spark sql中进行sechema合并

时间：2015-05-18 16:35:11 阅读：294 评论：0 收藏：0 [点我收藏+]

标签：

spark sql中支持sechema合并的操作。

直接上官方的代码吧。

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// sqlContext from the previous example is used in this example.
// This is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._

// Create a simple DataFrame, stored into a partition directory
val df1 = sparkContext.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
df1.saveAsParquetFile("data/test_table/key=1")

// Create another DataFrame in a new partition directory,
// adding a new column and dropping an existing column
val df2 = sparkContext.makeRDD(6 to 10).map(i => (i, i * 3)).toDF("single", "triple")
df2.saveAsParquetFile("data/test_table/key=2")

// Read the partitioned table
val df3 = sqlContext.parquetFile("data/test_table")
df3.printSchema()

// The final schema consists of all 3 columns in the Parquet files together
// with the partiioning column appeared in the partition directory paths.
// root
// |-- single: int (nullable = true)
// |-- double: int (nullable = true)
// |-- triple: int (nullable = true)
// |-- key : int (nullable = true)

也就是说df1和df2都保存在data/test_table目录下了。

df1列名分别为single,double,key

df2列名分别为single,triple,key。

然后df3直接读取test_table后，会将df1,df2的列都加在一起，那么dfs的列分别就是single,double,triple,key

然后将df3.show。结果就是：

single double triple key
3      6      null   1  
4      8      null   1  
5      10     null   1  
1      2      null   1  
2      4      null   1  
8      null   24     2  
9      null   27     2  
10     null   30     2  
6      null   18     2  
7      null   21     2

大家看，是不是df1和df2合起来的集成呢（不需要做关联）

spark sql中进行sechema合并

标签：

原文地址：http://www.cnblogs.com/hark0623/p/4512064.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行