标签:
如果一个RDD很大以至于它的所有元素并不能在driver端机器的内存中存放下,请不要进行如下调用:
val values = myVeryLargeRDD.collect()
def collect(): Array[T] = { val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray) Array.concat(results: _*) }
注意:
PairRDDFunctions.scala def countByKey(): Map[K, Long] = self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap def collectAsMap(): Map[K, V] = { val data = self.collect() val map = new mutable.HashMap[K, V] map.sizeHint(data.length) data.foreach { pair => map.put(pair._1, pair._2) } map } RDD.scala def countByValue()(implicit ord: Ordering[T] = null): Map[T, Long] = { map(value => (value, null)).countByKey() }
【knowledgebase】不要在一个很大的RDD上调用collect
标签:
原文地址:http://www.cnblogs.com/luogankun/p/4277958.html