标签:local clu 返回值 one default shuffle cal stand cas
def reduceByKey(partitioner: Partitioner, func: JFunction2[V, V, V]): JavaPairRDD[K, V] def reduceByKey(func: JFunction2[V, V, V], numPartitions: Int): JavaPairRDD[K, V]
该函数利用映射函数将每个K对应的V进行运算。
其中参数说明如下:
- func:映射函数,根据需求自定义;
- partitioner:分区函数;
- numPartitions:分区数,默认的分区函数是HashPartitioner。
返回值:可以看出最终是返回了一个KV键值对。
linux:/$ spark-shell 。。。 17/10/28 20:33:54 WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN). 17/10/28 20:33:55 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. Spark context available as sc. 17/10/28 20:33:57 WARN SessionState: load mapred-default.xml, HIVE_CONF_DIR env not found! 17/10/28 20:33:58 WARN SessionState: load mapred-default.xml, HIVE_CONF_DIR env not found! SQL context available as sqlContext. scala> val x = sc.parallelize(List( | ("a", "b", 1), | ("a", "b", 1), | ("c", "b", 1), | ("a", "d", 1)) | ) x: org.apache.spark.rdd.RDD[(String, String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:21 scala> val byKey = x.map({case (id,uri,count) => (id,uri)->count}) byKey: org.apache.spark.rdd.RDD[((String, String), Int)] = MapPartitionsRDD[1] at map at <console>:23 scala> val reducedByKey = byKey.reduceByKey(_ + _) reducedByKey: org.apache.spark.rdd.RDD[((String, String), Int)] = ShuffledRDD[2] at reduceByKey at <console>:25 scala> reducedByKey.collect.foreach(println) ((c,b),1) ((a,d),1) ((a,b),2)
标签:local clu 返回值 one default shuffle cal stand cas
原文地址:http://www.cnblogs.com/yy3b2007com/p/7748074.html