标签:
如果你是SBT工程,请加入以下依赖到build.sbt文件中:
libraryDependencies += "com.databricks" % "spark-csv_2.10" % "1.3.0"如果你是Maven工程,请加入以下依赖到pom.xml文件中:
<dependency> <groupid>com.databricks</groupid> <artifactid>spark-csv_2.10</artifactid> <version>1.3.0</version> </dependency>6、SparkConf持有所有运行Spark程序的信息,在这个实例中,我们将以本地的方式运行这个程序,而且我们打算使用2个核(local[2]),部分代码片段如下:
import org.apache.spark.SparkConf val conf = new SparkConf().setAppName("csvDataFrame").setMaster("local[2]")7、使用SparkConf初始化SparkContext对象,SparkContext是进入Spark的核心切入点:
val sc = new SparkContext(conf)在Spark中查询数据最简单的一种方式就是使用SQL查询,所以我们可以定义一个SQLContext对象:
val sqlContext=new SQLContext(sc)8、现在我们就可以加载事先准备好的数据了:
import com.databricks.spark.csv._ val students=sqlContext.csvFile(filePath="StudentData.csv", useHeader=true, delimiter='|')其中,students对象的类型是org.apache. spark.sql.DataFrame。
val options = Map("header" -> "true", "path" -> "E:\\StudentData.csv") val newStudents = sqlContext.read.options(options).format("com.databricks.spark.csv").load()附录
id|studentName|phone|email 1|Burke|1-300-746-8446|ullamcorper.velit.in@ametnullaDonec.co.uk 2|Kamal|1-668-571-5046|pede.Suspendisse@interdumenim.edu 3|Olga|1-956-311-1686|Aenean.eget.metus@dictumcursusNunc.edu 4|Belle|1-246-894-6340|vitae.aliquet.nec@neque.co.uk 5|Trevor|1-300-527-4967|dapibus.id@acturpisegestas.net 6|Laurel|1-691-379-9921|adipiscing@consectetueripsum.edu 7|Sara|1-608-140-1995|Donec.nibh@enimEtiamimperdiet.edu 8|Kaseem|1-881-586-2689|cursus.et.magna@euismod.org 9|Lev|1-916-367-5608|Vivamus.nisi@ipsumdolor.com 10|Maya|1-271-683-2698|accumsan.convallis@ornarelectusjusto.edu 11|Emi|1-467-270-1337|est@nunc.com 12|Caleb|1-683-212-0896|Suspendisse@Quisque.edu 13|Florence|1-603-575-2444|sit.amet.dapibus@lacusAliquamrutrum.ca 14|Anika|1-856-828-7883|euismod@ligulaelit.co.uk 15|Tarik|1-398-171-2268|turpis@felisorci.com 16|Amena|1-878-250-3129|lorem.luctus.ut@scelerisque.com 17|Blossom|1-154-406-9596|Nunc.commodo.auctor@eratSed.co.uk 18|Guy|1-869-521-3230|senectus.et.netus@lectusrutrum.com 19|Malachi|1-608-637-2772|Proin.mi.Aliquam@estarcu.net 20|Edward|1-711-710-6552|lectus@aliquetlibero.co.uk
case class Employee(id: Int, name: String)和先前一样,我们分别定义SparkConf、SparkContext以及SQLContext:
val conf = new SparkConf().setAppName("colRowDataFrame"). setMaster("local[2]") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc)2、我们可以通过很多方式来初始化Employee类,比如从关系型数据库中获取数据以此来定义Employee类。但是在本文为了简单起见,我将直接定义一个Employee类的List,如下:
val listOfEmployees = List(Employee(1, "iteblog"), Employee(2, "Jason"), Employee(3, "Abhi"))3、我们将listOfEmployees列表传递给SQLContext类的createDataFrame 函数,这样我们就可以创建出DataFrame了!然后我们可以调用DataFrame的printuSchema函数,打印出该DataFrame的模式,我们可以看出这个DataFrame主要有两列:name和id,这正是我们定义Employee的两个参数,并且类型都一致。
val empFrame = sqlContext.createDataFrame(listOfEmployees) empFrame.printSchema root |-- id: integer (nullable = false) |-- name: string (nullable = true)之所以DataFrame打印出的模式和Employee类的两个参数一致,那是因为DataFrame内部通过反射获取到的。
val empFrameWithRenamedColumns = sqlContext.createDataFrame(listOfEmployees).withColumnRenamed("id", "empId") empFrameWithRenamedColumns.printSchema root |-- empId: integer (nullable = false) |-- name: string (nullable = true)5、我们可以使用Spark支持的SQL功能来查询相关的数据。在使用这个功能之前,我们必须先对DataFrame注册成一张临时表,我们可以使用registerTempTable函数实现,如下:
empFrameWithRenamedColumns.registerTempTable("employeeTable")6、现在我们就可以使用SQL语句来查询DataFrame里面的数据了:
val sortedByNameEmployees = sqlContext.sql("select * from employeeTable order by name desc") sortedByNameEmployees.show() +-----+-------+ |empId| name| +-----+-------+ | 1|iteblog| | 2| Jason| | 3| Abhi| +-----+-------+它如何工作的
def createDataFrame[A <: Product : TypeTag](rdd: RDD[A]): DataFrame而case class类就是继承了Product。我们所熟悉的TupleN类型也是继承了scala.Product类的,所以我们也可以通过TupleN来创建DataFrame:
val mobiles=sqlContext.createDataFrame(Seq((1,"Android"), (2, "iPhone"))) mobiles.printSchema mobiles.show() root |-- _1: integer (nullable = false) |-- _2: string (nullable = true) +---+-------+ | _1| _2| +---+-------+ | 1|Android| | 2| iPhone| +---+-------+我们知道,Tuple2的默认两个参数名字分别是_1和_2,同样,我们如果对这个默认的名字不是特别喜欢,我们也是可以通过withColumnRenamed函数对默认反射的列名进行重命名。
students.printSchema root |-- id: string (nullable = true) |-- studentName: string (nullable = true) |-- phone: string (nullable = true) |-- email: string (nullable = true)
如果采用的是load方式参见DataFrame的,students.printSchema的输出则如下:
root |-- id|studentName|phone|email: string (nullable = true)对DataFrame里面的数据进行采样
students.show() //打印出20行 +---+-----------+--------------+--------------------+ | id|studentName| phone| email| +---+-----------+--------------+--------------------+ | 1| Burke|1-300-746-8446|ullamcorper.velit...| | 2| Kamal|1-668-571-5046|pede.Suspendisse@...| | 3| Olga|1-956-311-1686|Aenean.eget.metus...| | 4| Belle|1-246-894-6340|vitae.aliquet.nec...| | 5| Trevor|1-300-527-4967|dapibus.id@acturp...| | 6| Laurel|1-691-379-9921|adipiscing@consec...| | 7| Sara|1-608-140-1995|Donec.nibh@enimEt...| | 8| Kaseem|1-881-586-2689|cursus.et.magna@e...| | 9| Lev|1-916-367-5608|Vivamus.nisi@ipsu...| | 10| Maya|1-271-683-2698|accumsan.convalli...| | 11| Emi|1-467-270-1337| est@nunc.com| | 12| Caleb|1-683-212-0896|Suspendisse@Quisq...| | 13| Florence|1-603-575-2444|sit.amet.dapibus@...| | 14| Anika|1-856-828-7883|euismod@ligulaeli...| | 15| Tarik|1-398-171-2268|turpis@felisorci.com| | 16| Amena|1-878-250-3129|lorem.luctus.ut@s...| | 17| Blossom|1-154-406-9596|Nunc.commodo.auct...| | 18| Guy|1-869-521-3230|senectus.et.netus...| | 19| Malachi|1-608-637-2772|Proin.mi.Aliquam@...| | 20| Edward|1-711-710-6552|lectus@aliquetlib...| +---+-----------+--------------+--------------------+ only showing top 20 rows students.show(15) +---+-----------+--------------+--------------------+ | id|studentName| phone| email| +---+-----------+--------------+--------------------+ | 1| Burke|1-300-746-8446|ullamcorper.velit...| | 2| Kamal|1-668-571-5046|pede.Suspendisse@...| | 3| Olga|1-956-311-1686|Aenean.eget.metus...| | 4| Belle|1-246-894-6340|vitae.aliquet.nec...| | 5| Trevor|1-300-527-4967|dapibus.id@acturp...| | 6| Laurel|1-691-379-9921|adipiscing@consec...| | 7| Sara|1-608-140-1995|Donec.nibh@enimEt...| | 8| Kaseem|1-881-586-2689|cursus.et.magna@e...| | 9| Lev|1-916-367-5608|Vivamus.nisi@ipsu...| | 10| Maya|1-271-683-2698|accumsan.convalli...| | 11| Emi|1-467-270-1337| est@nunc.com| | 12| Caleb|1-683-212-0896|Suspendisse@Quisq...| | 13| Florence|1-603-575-2444|sit.amet.dapibus@...| | 14| Anika|1-856-828-7883|euismod@ligulaeli...| | 15| Tarik|1-398-171-2268|turpis@felisorci.com| +---+-----------+--------------+--------------------+ only showing top 15 rows students.show(true) +---+-----------+--------------+--------------------+ | id|studentName| phone| email| +---+-----------+--------------+--------------------+ | 1| Burke|1-300-746-8446|ullamcorper.velit...| | 2| Kamal|1-668-571-5046|pede.Suspendisse@...| | 3| Olga|1-956-311-1686|Aenean.eget.metus...| | 4| Belle|1-246-894-6340|vitae.aliquet.nec...| | 5| Trevor|1-300-527-4967|dapibus.id@acturp...| | 6| Laurel|1-691-379-9921|adipiscing@consec...| | 7| Sara|1-608-140-1995|Donec.nibh@enimEt...| | 8| Kaseem|1-881-586-2689|cursus.et.magna@e...| | 9| Lev|1-916-367-5608|Vivamus.nisi@ipsu...| | 10| Maya|1-271-683-2698|accumsan.convalli...| | 11| Emi|1-467-270-1337| est@nunc.com| | 12| Caleb|1-683-212-0896|Suspendisse@Quisq...| | 13| Florence|1-603-575-2444|sit.amet.dapibus@...| | 14| Anika|1-856-828-7883|euismod@ligulaeli...| | 15| Tarik|1-398-171-2268|turpis@felisorci.com| | 16| Amena|1-878-250-3129|lorem.luctus.ut@s...| | 17| Blossom|1-154-406-9596|Nunc.commodo.auct...| | 18| Guy|1-869-521-3230|senectus.et.netus...| | 19| Malachi|1-608-637-2772|Proin.mi.Aliquam@...| | 20| Edward|1-711-710-6552|lectus@aliquetlib...| +---+-----------+--------------+--------------------+ only showing top 20 rows students.show(false) +---+-----------+--------------+-----------------------------------------+ |id |studentName|phone |email | +---+-----------+--------------+-----------------------------------------+ |1 |Burke |1-300-746-8446|ullamcorper.velit.in@ametnullaDonec.co.uk| |2 |Kamal |1-668-571-5046|pede.Suspendisse@interdumenim.edu | |3 |Olga |1-956-311-1686|Aenean.eget.metus@dictumcursusNunc.edu | |4 |Belle |1-246-894-6340|vitae.aliquet.nec@neque.co.uk | |5 |Trevor |1-300-527-4967|dapibus.id@acturpisegestas.net | |6 |Laurel |1-691-379-9921|adipiscing@consectetueripsum.edu | |7 |Sara |1-608-140-1995|Donec.nibh@enimEtiamimperdiet.edu | |8 |Kaseem |1-881-586-2689|cursus.et.magna@euismod.org | |9 |Lev |1-916-367-5608|Vivamus.nisi@ipsumdolor.com | |10 |Maya |1-271-683-2698|accumsan.convallis@ornarelectusjusto.edu | |11 |Emi |1-467-270-1337|est@nunc.com | |12 |Caleb |1-683-212-0896|Suspendisse@Quisque.edu | |13 |Florence |1-603-575-2444|sit.amet.dapibus@lacusAliquamrutrum.ca | |14 |Anika |1-856-828-7883|euismod@ligulaelit.co.uk | |15 |Tarik |1-398-171-2268|turpis@felisorci.com | |16 |Amena |1-878-250-3129|lorem.luctus.ut@scelerisque.com | |17 |Blossom |1-154-406-9596|Nunc.commodo.auctor@eratSed.co.uk | |18 |Guy |1-869-521-3230|senectus.et.netus@lectusrutrum.com | |19 |Malachi |1-608-637-2772|Proin.mi.Aliquam@estarcu.net | |20 |Edward |1-711-710-6552|lectus@aliquetlibero.co.uk | +---+-----------+--------------+-----------------------------------------+ only showing top 20 rows students.show(10,false) +---+-----------+--------------+-----------------------------------------+ |id |studentName|phone |email | +---+-----------+--------------+-----------------------------------------+ |1 |Burke |1-300-746-8446|ullamcorper.velit.in@ametnullaDonec.co.uk| |2 |Kamal |1-668-571-5046|pede.Suspendisse@interdumenim.edu | |3 |Olga |1-956-311-1686|Aenean.eget.metus@dictumcursusNunc.edu | |4 |Belle |1-246-894-6340|vitae.aliquet.nec@neque.co.uk | |5 |Trevor |1-300-527-4967|dapibus.id@acturpisegestas.net | |6 |Laurel |1-691-379-9921|adipiscing@consectetueripsum.edu | |7 |Sara |1-608-140-1995|Donec.nibh@enimEtiamimperdiet.edu | |8 |Kaseem |1-881-586-2689|cursus.et.magna@euismod.org | |9 |Lev |1-916-367-5608|Vivamus.nisi@ipsumdolor.com | |10 |Maya |1-271-683-2698|accumsan.convallis@ornarelectusjusto.edu | +---+-----------+--------------+-----------------------------------------+ only showing top 10 rows我们还可以使用head(n: Int)方法来采样数据,这个函数也需要输入一个参数标明需要采样的行数,而且这个函数返回的是Row数组,我们需要遍历打印。当然,我们也可以使用head()函数直接打印,这个函数只是返回数据的一行,类型也是Row。
students.head(5).foreach(println) [1,Burke,1-300-746-8446,ullamcorper.velit.in@ametnullaDonec.co.uk] [2,Kamal,1-668-571-5046,pede.Suspendisse@interdumenim.edu] [3,Olga,1-956-311-1686,Aenean.eget.metus@dictumcursusNunc.edu] [4,Belle,1-246-894-6340,vitae.aliquet.nec@neque.co.uk] [5,Trevor,1-300-527-4967,dapibus.id@acturpisegestas.net] println(students.head()) [1,Burke,1-300-746-8446,ullamcorper.velit.in@ametnullaDonec.co.uk]除了show、head函数。我们还可以使用first和take函数,他们分别调用head()和head(n)
println(students.first()) [1,Burke,1-300-746-8446,ullamcorper.velit.in@ametnullaDonec.co.uk] students.take(5).foreach(println) [1,Burke,1-300-746-8446,ullamcorper.velit.in@ametnullaDonec.co.uk] [2,Kamal,1-668-571-5046,pede.Suspendisse@interdumenim.edu] [3,Olga,1-956-311-1686,Aenean.eget.metus@dictumcursusNunc.edu] [4,Belle,1-246-894-6340,vitae.aliquet.nec@neque.co.uk] [5,Trevor,1-300-527-4967,dapibus.id@acturpisegestas.net]查询DataFrame里面的列
val emailDataFrame: DataFrame = students.select("email")
emailDataFrame.show(3) +--------------------+ | email| +--------------------+ |ullamcorper.velit...| |pede.Suspendisse@...| |Aenean.eget.metus...| +--------------------+ only showing top 3 rows2、选择多列。其实select函数支持选择多列。
val studentEmailDF = students.select("studentName", "email") studentEmailDF.show(5) +-----------+--------------------+ |studentName| email| +-----------+--------------------+ | Burke|ullamcorper.velit...| | Kamal|pede.Suspendisse@...| | Olga|Aenean.eget.metus...| | Belle|vitae.aliquet.nec...| | Trevor|dapibus.id@acturp...| +-----------+--------------------+ only showing top 5 rows需要主要的是,我们select列的时候,需要保证select的列是有效的,换句话说,就是必须保证select的列是printSchema打印出来的。如果列的名称是无效的,将会出现org.apache.spark.sql.AnalysisException异常,如下:
val studentEmailDF = students.select("studentName", "iteblog") studentEmailDF.show(5) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'iteblog' given input columns id, studentName, phone, email;根据条件过滤数据
students.filter("id > 5").show(7) +---+-----------+--------------+--------------------+ | id|studentName| phone| email| +---+-----------+--------------+--------------------+ | 6| Laurel|1-691-379-9921|adipiscing@consec...| | 7| Sara|1-608-140-1995|Donec.nibh@enimEt...| | 8| Kaseem|1-881-586-2689|cursus.et.magna@e...| | 9| Lev|1-916-367-5608|Vivamus.nisi@ipsu...| | 10| Maya|1-271-683-2698|accumsan.convalli...| | 11| Emi|1-467-270-1337| est@nunc.com| | 12| Caleb|1-683-212-0896|Suspendisse@Quisq...| | 13| Florence|1-603-575-2444|sit.amet.dapibus@...| | 14| Anika|1-856-828-7883|euismod@ligulaeli...| | 15| Tarik|1-398-171-2268|turpis@felisorci.com| +---+-----------+--------------+--------------------+ only showing top 10 rows students.filter("studentName =''").show(7) +---+-----------+--------------+--------------------+ | id|studentName| phone| email| +---+-----------+--------------+--------------------+ | 21| |1-598-439-7549|consectetuer.adip...| | 32| |1-184-895-9602|accumsan.laoreet@...| | 45| |1-245-752-0481|Suspendisse.eleif...| | 83| |1-858-810-2204|sociis.natoque@eu...| | 94| |1-443-410-7878|Praesent.eu.nulla...| +---+-----------+--------------+--------------------+注意看第一个过滤语句,虽然id被解析成String了,但是程序依然正确地做出了比较。我们也可以对多个条件进行过滤:
students.filter("studentName ='' OR studentName = 'NULL'").show(7) +---+-----------+--------------+--------------------+ | id|studentName| phone| email| +---+-----------+--------------+--------------------+ | 21| |1-598-439-7549|consectetuer.adip...| | 32| |1-184-895-9602|accumsan.laoreet@...| | 33| NULL|1-105-503-0141|Donec@Inmipede.co.uk| | 45| |1-245-752-0481|Suspendisse.eleif...| | 83| |1-858-810-2204|sociis.natoque@eu...| | 94| |1-443-410-7878|Praesent.eu.nulla...| +---+-----------+--------------+--------------------+我们还可以采用类SQL的语法对数据进行过滤:
students.filter("SUBSTR(studentName,0,1) ='M'").show(7) +---+-----------+--------------+--------------------+ | id|studentName| phone| email| +---+-----------+--------------+--------------------+ | 10| Maya|1-271-683-2698|accumsan.convalli...| | 19| Malachi|1-608-637-2772|Proin.mi.Aliquam@...| | 24| Marsden|1-477-629-7528|Donec.dignissim.m...| | 37| Maggy|1-910-887-6777|facilisi.Sed.nequ...| | 61| Maxine|1-422-863-3041|aliquet.molestie....| | 77| Maggy|1-613-147-4380| pellentesque@mi.net| | 97| Maxwell|1-607-205-1273|metus.In@musAenea...| +---+-----------+--------------+--------------------+ only showing top 7 rows对DataFrame里面的数据进行排序
students.sort(students("studentName").desc).show(7) +---+-----------+--------------+--------------------+ | id|studentName| phone| email| +---+-----------+--------------+--------------------+ | 50| Yasir|1-282-511-4445|eget.odio.Aliquam...| | 52| Xena|1-527-990-8606|in.faucibus.orci@...| | 86| Xandra|1-677-708-5691|libero@arcuVestib...| | 43| Wynter|1-440-544-1851|amet.risus.Donec@...| | 31| Wallace|1-144-220-8159| lorem.lorem@non.net| | 66| Vance|1-268-680-0857|pellentesque@netu...| | 41| Tyrone|1-907-383-5293|non.bibendum.sed@...| +---+-----------+--------------+--------------------+ only showing top 7 rows也可以对多列进行排序:
students.sort("studentName", "id").show(10) +---+-----------+--------------+--------------------+ | id|studentName| phone| email| +---+-----------+--------------+--------------------+ | 21| |1-598-439-7549|consectetuer.adip...| | 32| |1-184-895-9602|accumsan.laoreet@...| | 45| |1-245-752-0481|Suspendisse.eleif...| | 83| |1-858-810-2204|sociis.natoque@eu...| | 94| |1-443-410-7878|Praesent.eu.nulla...| | 91| Abel|1-530-527-7467| urna@veliteu.edu| | 69| Aiko|1-682-230-7013|turpis.vitae.puru...| | 47| Alma|1-747-382-6775| nec.enim@non.org| | 26| Amela|1-526-909-2605| in@vitaesodales.edu| | 16| Amena|1-878-250-3129|lorem.luctus.ut@s...| +---+-----------+--------------+--------------------+ only showing top 10 rows从上面的结果我们可以看出,默认是按照升序进行排序的。我们也可以将上面的语句写成下面的:
students.sort(students("studentName").asc, students("id").asc).show(10)这两个语句运行的效果是一致的。
students.select(students("studentName").as("name"), students("email")).show(10) +--------+--------------------+ | name| email| +--------+--------------------+ | Burke|ullamcorper.velit...| | Kamal|pede.Suspendisse@...| | Olga|Aenean.eget.metus...| | Belle|vitae.aliquet.nec...| | Trevor|dapibus.id@acturp...| | Laurel|adipiscing@consec...| | Sara|Donec.nibh@enimEt...| | Kaseem|cursus.et.magna@e...| | Lev|Vivamus.nisi@ipsu...| | Maya|accumsan.convalli...| +--------+--------------------+ only showing top 10 rows将DataFrame看作是关系型数据表
students.registerTempTable("students")(2)、然后我们在其上用标准的SQL进行查询:
sqlContext.sql("select * from students where studentName!='' order by email desc").show(7) +---+-----------+--------------+--------------------+ | id|studentName| phone| email| +---+-----------+--------------+--------------------+ | 87| Selma|1-601-330-4409|vulputate.velit@p...| | 96| Channing|1-984-118-7533|viverra.Donec.tem...| | 4| Belle|1-246-894-6340|vitae.aliquet.nec...| | 78| Finn|1-213-781-6969|vestibulum.massa@...| | 53| Kasper|1-155-575-9346|velit.eget@pedeCu...| | 63| Dylan|1-417-943-8961|vehicula.aliquet@...| | 35| Cadman|1-443-642-5919|ut.lacus@adipisci...| +---+-----------+--------------+--------------------+ only showing top 7 rows对两个DataFrame进行Join操作
val students1 = sqlContext.csvFile(filePath = "E:\\StudentPrep1.csv", useHeader = true, delimiter = '|') val students2 = sqlContext.csvFile(filePath = "E:\\StudentPrep2.csv", useHeader = true, delimiter = '|') val studentsJoin = students1.join(students2, students1("id") === students2("id")) studentsJoin.show(studentsJoin.count.toInt) +---+-----------+--------------+--------------------+---+------------------+--------------+--------------------+ | id|studentName| phone| email| id| studentName| phone| email| +---+-----------+--------------+--------------------+---+------------------+--------------+--------------------+ | 1| Burke|1-300-746-8446|ullamcorper.velit...| 1|BurkeDifferentName|1-300-746-8446|ullamcorper.velit...| | 2| Kamal|1-668-571-5046|pede.Suspendisse@...| 2|KamalDifferentName|1-668-571-5046|pede.Suspendisse@...| | 3| Olga|1-956-311-1686|Aenean.eget.metus...| 3| Olga|1-956-311-1686|Aenean.eget.metus...| | 4| Belle|1-246-894-6340|vitae.aliquet.nec...| 4|BelleDifferentName|1-246-894-6340|vitae.aliquet.nec...| | 5| Trevor|1-300-527-4967|dapibus.id@acturp...| 5| Trevor|1-300-527-4967|dapibusDifferentE...| | 6| Laurel|1-691-379-9921|adipiscing@consec...| 6|LaurelInvalidPhone| 000000000|adipiscing@consec...| | 7| Sara|1-608-140-1995|Donec.nibh@enimEt...| 7| Sara|1-608-140-1995|Donec.nibh@enimEt...| | 8| Kaseem|1-881-586-2689|cursus.et.magna@e...| 8| Kaseem|1-881-586-2689|cursus.et.magna@e...| | 9| Lev|1-916-367-5608|Vivamus.nisi@ipsu...| 9| Lev|1-916-367-5608|Vivamus.nisi@ipsu...| | 10| Maya|1-271-683-2698|accumsan.convalli...| 10| Maya|1-271-683-2698|accumsan.convalli...| +---+-----------+--------------+--------------------+---+------------------+--------------+--------------------+2、右外联:在内连接的基础上,还包含右表中所有不符合条件的数据行,并在其中的左表列填写NULL ,来看看下面的实例:
val studentsRightOuterJoin = students1.join(students2, students1("id") === students2("id"), "right_outer") studentsRightOuterJoin.show(studentsRightOuterJoin.count.toInt) +----+-----------+--------------+--------------------+---+--------------------+--------------+--------------------+ | id|studentName| phone| email| id| studentName| phone| email| +----+-----------+--------------+--------------------+---+--------------------+--------------+--------------------+ | 1| Burke|1-300-746-8446|ullamcorper.velit...| 1| BurkeDifferentName|1-300-746-8446|ullamcorper.velit...| | 2| Kamal|1-668-571-5046|pede.Suspendisse@...| 2| KamalDifferentName|1-668-571-5046|pede.Suspendisse@...| | 3| Olga|1-956-311-1686|Aenean.eget.metus...| 3| Olga|1-956-311-1686|Aenean.eget.metus...| | 4| Belle|1-246-894-6340|vitae.aliquet.nec...| 4| BelleDifferentName|1-246-894-6340|vitae.aliquet.nec...| | 5| Trevor|1-300-527-4967|dapibus.id@acturp...| 5| Trevor|1-300-527-4967|dapibusDifferentE...| | 6| Laurel|1-691-379-9921|adipiscing@consec...| 6| LaurelInvalidPhone| 000000000|adipiscing@consec...| | 7| Sara|1-608-140-1995|Donec.nibh@enimEt...| 7| Sara|1-608-140-1995|Donec.nibh@enimEt...| | 8| Kaseem|1-881-586-2689|cursus.et.magna@e...| 8| Kaseem|1-881-586-2689|cursus.et.magna@e...| | 9| Lev|1-916-367-5608|Vivamus.nisi@ipsu...| 9| Lev|1-916-367-5608|Vivamus.nisi@ipsu...| | 10| Maya|1-271-683-2698|accumsan.convalli...| 10| Maya|1-271-683-2698|accumsan.convalli...| |null| null| null| null|999|LevUniqueToSecondRDD|1-916-367-5608|Vivamus.nisi@ipsu...| +----+-----------+--------------+--------------------+---+--------------------+--------------+--------------------+3、左外联:在内连接的基础上,还包含左表中所有不符合条件的数据行,并在其中的右表列填写NULL ,同样我们来看看下面的实例:
val studentsLeftOuterJoin = students1.join(students2, students1("id") === students2("id"), "left_outer") studentsLeftOuterJoin.show(studentsLeftOuterJoin.count.toInt) +---+-----------+--------------+--------------------+----+------------------+--------------+--------------------+ | id|studentName| phone| email| id| studentName| phone| email| +---+-----------+--------------+--------------------+----+------------------+--------------+--------------------+ | 1| Burke|1-300-746-8446|ullamcorper.velit...| 1|BurkeDifferentName|1-300-746-8446|ullamcorper.velit...| | 2| Kamal|1-668-571-5046|pede.Suspendisse@...| 2|KamalDifferentName|1-668-571-5046|pede.Suspendisse@...| | 3| Olga|1-956-311-1686|Aenean.eget.metus...| 3| Olga|1-956-311-1686|Aenean.eget.metus...| | 4| Belle|1-246-894-6340|vitae.aliquet.nec...| 4|BelleDifferentName|1-246-894-6340|vitae.aliquet.nec...| | 5| Trevor|1-300-527-4967|dapibus.id@acturp...| 5| Trevor|1-300-527-4967|dapibusDifferentE...| | 6| Laurel|1-691-379-9921|adipiscing@consec...| 6|LaurelInvalidPhone| 000000000|adipiscing@consec...| | 7| Sara|1-608-140-1995|Donec.nibh@enimEt...| 7| Sara|1-608-140-1995|Donec.nibh@enimEt...| | 8| Kaseem|1-881-586-2689|cursus.et.magna@e...| 8| Kaseem|1-881-586-2689|cursus.et.magna@e...| | 9| Lev|1-916-367-5608|Vivamus.nisi@ipsu...| 9| Lev|1-916-367-5608|Vivamus.nisi@ipsu...| | 10| Maya|1-271-683-2698|accumsan.convalli...| 10| Maya|1-271-683-2698|accumsan.convalli...| | 11| iteblog| 999999| iteblog@iteblog.com|null| null| null| null| +---+-----------+--------------+--------------------+----+------------------+--------------+--------------------+将DataFrame保存成文件
val saveOptions = Map("header" -> "true", "path" -> "iteblog.csv")为了基于学习的态度,我们从DataFrame里面选择出studentName和email两列,并且将studentName的列名重定义为name。
val copyOfStudents = students.select(students("studentName").as("name"), students("email"))2、下面我们调用save函数保存上面的DataFrame数据到iteblog.csv文件夹中
copyOfStudents.write.format("com.databricks.spark.csv").mode(SaveMode.Overwrite).options(saveOptions).save()mode函数可以接收的参数有Overwrite、Append、Ignore和ErrorIfExists。从名字就可以很好的理解,Overwrite代表覆盖目录下之前存在的数据;Append代表给指定目录下追加数据;Ignore代表如果目录下已经有文件,那就什么都不执行;ErrorIfExists代表如果保存目录下存在文件,那么抛出相应的异常。
Spark DataFrames入门指南:创建和操作DataFrame
标签:
原文地址:http://blog.csdn.net/lw_ghy/article/details/51480358