combineByKey

时间：2017-10-05 23:40:38 阅读：455 评论：0 收藏：0 [点我收藏+]

标签：res clean cannot .net gen get which 源码 list

通过分析reduceByKey和groupByKey的源码，发现两个算子都使用了combineByKey这个算子，那么现在来分析一下combineByKey算子。

/**
   * Simplified version of combineByKey that hash-partitions the output RDD.
   */
  def combineByKey[C](
　　　 createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      numPartitions: Int): RDD[(K, C)] = {
    combineByKey(createCombiner, mergeValue, mergeCombiners, new HashPartitioner(numPartitions))
  }


/**
 * Generic function to combine the elements for each key using a custom set of aggregation
 * functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C
 * Note that V and C can be different -- for example, one might group an RDD of type
 * (Int, Int) into an RDD of type (Int, Seq[Int]). Users provide three functions:
 *
 * - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)
 * - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)
 * - `mergeCombiners`, to combine two C‘s into a single one.
 *
 * In addition, users can control the partitioning of the output RDD, and whether to perform
 * map-side aggregation (if a mapper can produce multiple items with the same key).
 */
def combineByKey[C](createCombiner: V => C,
    mergeValue: (C, V) => C,
    mergeCombiners: (C, C) => C,
    partitioner: Partitioner,
    mapSideCombine: Boolean = true,
    serializer: Serializer = null): RDD[(K, C)] = {
  require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
  if (keyClass.isArray) {
    if (mapSideCombine) {
      throw new SparkException("Cannot use map-side combining with array keys.")
    }
    if (partitioner.isInstanceOf[HashPartitioner]) {
      throw new SparkException("Default partitioner cannot partition array keys.")
    }
  }
  val aggregator = new Aggregator[K, V, C](
    self.context.clean(createCombiner),
    self.context.clean(mergeValue),
    self.context.clean(mergeCombiners))
  if (self.partitioner == Some(partitioner)) {
    self.mapPartitions(iter => {
      val context = TaskContext.get()
      new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
    }, preservesPartitioning = true)
  } else {
    new ShuffledRDD[K, V, C](self, partitioner)
      .setSerializer(serializer)
      .setAggregator(aggregator)
      .setMapSideCombine(mapSideCombine)
  }
}

在combineByKey函数中包含 createCombiner、mergeValue、mergeCombiners函数

createCombiner: V => C ：`createCombiner`, which turns a V into a C (e.g., creates a one-element list) 。如果在第一次执行combineByKey时，此时会调用此函数，它会负责将一个Value值转换成一个List

mergeValue: (C, V) => C ：`mergeValue`, to merge a V into a C (e.g., adds it to the end of a list) 如果不是第一次执行combinByKey，此时会将新传入的Value参数添加到原有的集合尾部，此函数负责将一个Value值加入到List的尾部（此函数在每个Parttition中执行）

mergeCombiners: (C, C) => C ：`mergeCombiners`, to combine two C‘s into a single one. 此函数的作用是将多个集合合并成一个集合（此函数在不同的Partition中执行）

参考：

http://codingjunkie.net/spark-combine-by-key/

combineByKey

标签：res clean cannot .net gen get which 源码 list

原文地址：http://www.cnblogs.com/0xcafedaddy/p/7630254.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行