码迷,mamicode.com
首页 > 其他好文 > 详细

reduceByKey和groupByKey的区别

时间:2017-10-04 14:20:48      阅读:224      评论:0      收藏:0      [点我收藏+]

标签:ssi   sid   currently   lin   out   ping   区别   val   als   

先来看一下在reduceByKey和groupByKey的源码,在PairRDDFunctions.scala文件中
/**
 * Merge the values for each key using an associative reduce function. This will also perform
 * the merging locally on each mapper before sending results to a reducer, similarly to a
 * "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
 * parallelism level.
 */
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = {
  reduceByKey(defaultPartitioner(self), func)
}


/**
 * Group the values for each key in the RDD into a single sequence. Allows controlling the
 * partitioning of the resulting key-value pair RDD by passing a Partitioner.
 * The ordering of elements within each group is not guaranteed, and may even differ
 * each time the resulting RDD is evaluated.
 *
 * Note: This operation may be very expensive. If you are grouping in order to perform an
 * aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]]
 * or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
 *
 * Note: As currently implemented, groupByKey must be able to hold all the key-value pairs for any
 * key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].
 */
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = {
  // groupByKey shouldn‘t use map side combine because map side combine does not
  // reduce the amount of data shuffled and requires all map side data be inserted
  // into a hash table, leading to more objects in the old gen.
  val createCombiner = (v: V) => CompactBuffer(v)
  val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
  val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
  val bufs = combineByKey[CompactBuffer[V]](
    createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine=false)
  bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}
 

 

reduceByKey和groupByKey的区别

标签:ssi   sid   currently   lin   out   ping   区别   val   als   

原文地址:http://www.cnblogs.com/0xcafedaddy/p/7625358.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!