标签:
存储级别 | 描述 |
MEMORY_ONLY |
将RDD 作为反序列化的的对象存储JVM 中。如果RDD不能被内存装下,一些分区将不会被缓存,并且在需要的时候被重新计算。
这是是默认的级别
|
MEMORY_AND_DISK | 将RDD作为反序列化的的对象存储在JVM 中。如果RDD不能被与内存装下,超出的分区将被保存在硬盘上,并且在需要时被读取 |
MEMORY_ONLY_SER |
将RDD 作为序列化的的对象进行存储(每一分区占用一个字节数组)。
通常来说,这比将对象反序列化的空间利用率更高,尤其当使用fast serializer,但在读取时会比较占用CPU
|
MEMORY_AND_DISK_SER | 与MEMORY_ONLY_SER 相似,但是把超出内存的分区将存储在硬盘上而不是在每次需要的时候重新计算 |
DISK_ONLY | 只将RDD 分区存储在硬盘上 |
DISK_ONLY_2 (含2的) | 与上述的存储级别一样,但是将每一个分区都复制到两个集群结点上 |
注意:
1)spark默认存储策略为MEMORY_ONLY:只缓存到内存并且以原生方式存(反序列化)一个副本;
2)MEMORY_AND_DISK存储级别在内存够用时直接保存到内存中,只有当内存不足时,才会存储到磁盘中。
cache()方法本质是调用了persist()方法;默认情况下的存储级别为MEMORY_ONLY
但是通过persist(StorageLevel)方法设定StorageLevel来满足工程的存储需求。
创建SparkContext由下面的代码组成:
import org.apache.spark.{SparkContext,SparkConf}
import org.apache.spark.SparkContext._
val conf = newSparkConf().setAppName(appName).setMaster(master_url)
val sc = new SparkContext(conf)
/*创建RDD,执行相应的操作*/
sc.close(); // 关闭SparkContext
在完成应用程序的编写后,生成jar包,使用spark-submit进行提交:
./bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
参考链接:http://spark.apache.org/docs/latest/submitting-applications.htmlSpark有多种运行方式,其取决于SparkContext中的Master环境变量的值;
如下是官方提供的例子:
# 在本地8核机器上运行(本地模式)(8个CPU)
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[8] \
/path/to/examples.jar \
100
# 在Standalone集群部署下运行
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \ # can also be `yarn-client` for client mode
--executor-memory 20G \
--num-executors 50 \
/path/to/examples.jar \
1000
(1)并行集合:接收一个已经存在的Scala集合,然后进行各种计算;
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
(2)外部文件:在一个文件的每条记录上运行函数;文件系统是hdfs或者Hadoop支持的任意存储系统;使用textFile()将本地文件或者Hdfs文件转换为RDD
val rdd1= sc.textFile("file:///root/access_log/access_log*.filter");
val rdd2=rdd1.map(_.split("\t")).filter(_.length==6)rdd2.count()
SparkContext.wholeTextFiles
lets you read a directory containing multiple small text
files, and returns each of them as (filename, content) pairs. This is in contrast with textFile
, which would return one record
per line in each file.
For SequenceFiles,
use SparkContext’s sequenceFile[K, V]
method where K
and V
are
the types of key and values in the file. These should be subclasses of Hadoop’s Writable interface,
like IntWritable and Text.
In addition, Spark allows you to specify native types for a few common Writables; for example, sequenceFile[Int, String]
will automatically
read IntWritables and Texts.
For other Hadoop InputFormats, you can use the SparkContext.hadoopRDD
method, which
takes an arbitrary JobConf
and input format class, key class and value class. Set these the same way you would for a Hadoop job
with your input source. You can also useSparkContext.newAPIHadoopRDD
for InputFormats based on the “new” MapReduce API (org.apache.hadoop.mapreduce
).
RDD.saveAsObjectFile
and SparkContext.objectFile
support
saving an RDD in a simple format consisting of serialized Java objects. While this is not as efficient as specialized formats like Avro, it offers an easy way to save any RDD.
val lines = sc.textFile("data.txt") // a base rdd from an external file.
val lineLengths = lines.map(s => s.length) // lineLengths为map操作的结果,是lazy的,返回一个新的RDD
val totalLength = lineLengths.reduce((a, b) => a + b) // action
lineLengths.persist() // memory_only
val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
val kv=sc.parallelize(List(List(1,2),List(3,4),List(3,6,8)))val
cleanF = sc.clean(f)
中sc.clean函数将用户函数预处理;this.map(t => (cleanF(t), t)).groupByKey(p)
对数据map进行函数操作,再对groupByKey进行分组操作。其中,p中确定了分区个数和分区函数,也就决定了并行化的程度。(K,
(Iterable[V], Iterable[w]))
。其中,Key和Value,Value是两个RDD下相同Key的两个数据集合的迭代器所构成的元组。版权声明:本文为博主原创文章,未经博主允许不得转载。
标签:
原文地址:http://blog.csdn.net/feige1990/article/details/48008233