标签:
在SparkStreaming中如何对数据进行分片
Cluster resources can be under-utilized if the number of parallel tasks used in any stage of the computation is not high enough. For example, for distributed reduce operations like reduceByKey
and reduceByKeyAndWindow
, the default number of parallel tasks is controlled by thespark.default.parallelism
configuration property. You can pass the level of parallelism as an argument (see PairDStreamFunctions
documentation), or set the spark.default.parallelism
configuration property to change the default.
并行的数据处理水平
如果在计算的任何阶段使用的并行任务的数量不够高,可能会造成集群资源可利用不足。例如,对于分布式reduce操作reduceByKey和reduceByKeyAndWindow,并行任务的默认数量是由spark.default.parallelism配置属性控制。你可以通过参数控制平行度(见PairDStreamFunctions文档,或设置spark.default.parallelism配置属性进行更改。
例如: SparkConf sparkConf = new SparkConf().setAppName("NAME").set("spark.default.parallelism", "5");
标签:
原文地址:http://www.cnblogs.com/gnivor/p/4575743.html