RDD（google rdd paper notes）

时间：2017-09-24 00:35:53 阅读：175 评论：0 收藏：0 [点我收藏+]

标签：points lin pagerank 位置方式 round 需要 func 数据

RDD

Twister HaLoop Dryad MR Pregel....

多个并行操作重用中间结果-抽象
自动容错、位置感知性调度和可伸缩性

容错：数据检查点和记录数据的更新
RDD只支持粗粒度转换，即在大量记录上执行的单个操作。将创建RDD的一系列转换记录下来（即Lineage），以便恢复丢失的分区。

数据并行的批量分析应用，包括数据挖掘、机器学习、图算法等，因为这些程序通常都会在很多记录上执行相同的操作。RDD不太适合那些异步更新共享状态的应用，例如并行web爬行器。因此，我们的目标是为大多数分析型应用提供有效的编程模型，而其他类型的应用交给专门的系统。

构建RDD的时候，运行时通过管道的方式传输多个转换。
哈希分区和范围分区
Consistent partition placement across it- erations is one of the main optimizations in Pregel and HaLoop, so we let users express this optimization.

错误日志在ram的多个节点上
如果某个errors分区丢失，Spark只在相应的lines分区上执行filter操作来重建该errors分区。

The Spark scheduler will pipeline the latter two transformations and send a set of tasks to compute them to the nodes holding the cached partitions of errors.

Scala语法中filter的参数是一个闭包
these objects can be serialized and loaded on another node to pass the closure across the network

Distributed Shared Memory
传统的共享内存系统，还包括那些通过分布式哈希表或分布式文件系统进行数据共享的系统，比如Piccolo
share data through a distributed hash table or filesystem

不仅可以通过批量转换创建（即“写”）RDD，还可以对任意内存位置读写。也就是说，RDD限制应用执行批量写操作，这样有利于实现有效的容错。

key lookups on hash or range partitioned RDDs

B. Nitzberg and V. Lo. Distributed shared memory: a survey of issues and algorithms. Computer, 24(8):52 –60, aug 1991.

R. Power and J. Li. Piccolo: Building fast, distributed programs with partitioned tables. In Proc. OSDI 2010, 2010.

work around issues with Scala’s closure objects using reflection
Scala解释器来使用Spark

通过Partitioner类获取RDD的分区顺序，然后将另一个RDD按照同样的方式分区。有些操作会自动产生一个哈希或范围分区的RDD，像groupByKey，reduceByKey和sort等。

transanction action

data-parallel

operations like maps and sums --bulk operations

logistic regression:
1 a cached RDD called points( a map transformation on a text file)
2 run map and reduce on points to compute the gradient

Bulk Synchronous Parallel paradigm
Programs run as a series of coordinated iterations called supersteps.
On each superstep, each vertex in the graph runs a user function that can update state associ- ated with the vertex, mutate the graph topology, and send messages to other vertices for use in the next superstep.

This model can express many graph algorithms, includ- ing shortest paths, bipartite matching, and PageRank.

两个需要被join的数据集可以用相同的方式做hash-partitioned，这样可以减少shuffle提高性能

RDD（google rdd paper notes）

标签：points lin pagerank 位置方式 round 需要 func 数据

原文地址：http://www.cnblogs.com/yumanman/p/7583271.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行