标签:int ref func scala mem partition ram pagerank support
Note that, before Spark 2.0, the main programming interface of Spark was the Resilient Distributed Dataset (RDD). After Spark 2.0, RDDs are replaced by Dataset, which is strongly-typed like an RDD, but with richer optimizations under the hood. The RDD interface is still supported, and you can get a more complete reference at the RDD programming guide. However, we highly recommend you to switch to use Dataset, which has better performance than RDD. See the SQL programming guide to get more information about Dataset.
scala> val text=spark.read.textFile("/tmp/20171024/tian.txt")
text: org.apache.spark.sql.Dataset[String] = [value: string]
scala> text.count
res0: Long = 6
scala> val text=sc.textFile("/tmp/20171024/tian.txt")
text: org.apache.spark.rdd.RDD[String] = /tmp/20171024/tian.txt MapPartitionsRDD[7] at textFile at <console>:24
scala> text.count
res1: Long = 6
You can get values from Dataset directly, by calling some actions, or transform the Dataset to get a new one. For more details, please read theAPI doc.
Spark also supports pulling data sets into a cluster-wide in-memory cache. This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset or when running an iterative algorithm like PageRank. As a simple example, let’s mark our linesWithSpark
dataset to be cached:
scala> text.cache()
res2: text.type = /tmp/20171024/tian.txt MapPartitionsRDD[7] at textFile at <console>:24
scala> text.count
res3: Long = 6
It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is that these same functions can be used on very large data sets, even when they are striped across tens or hundreds of nodes.
标签:int ref func scala mem partition ram pagerank support
原文地址:http://www.cnblogs.com/playforever/p/7810196.html