Spark Streaming

时间：2016-09-17 00:28:16 阅读：229 评论：0 收藏：0 [点我收藏+]

标签：

Spark Streaming

Spark Streaming 是Spark为了用户实现流式计算的模型。

数据源包括Kafka,Flume,HDFS等。

技术分享

DStream 离散化流(discretized stream), Spark Streaming 使用DStream作为抽象表示。是随时间推移而收到的数据的序列。DStream内部的数据都是RDD形式存储, DStream是由这些RDD所组成的离散序列。

技术分享

编写Streaming步骤：

1.创建StreamingContext

// Create a local StreamingContext with two working thread and batch interval of 5 second.
// The master requires 2 cores to prevent from a starvation scenario.
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(5))

创建本地化StreamingContext, 需要至少2个工作线程。一个是receiver,一个是计算节点。

2.定义输入源，创建输入DStream

// Create a DStream that will connect to hostname:port, like localhost:9999
val lines = ssc.socketTextStream("node1", 9999)

3.定义流的计算过程，使用transformation和output operation DStream

// Split each line into words
val words = lines.flatMap(_.split(" "))

// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)

// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()

4.开始接收数据及处理数据，使用streamingContext.start()

ssc.start()             // Start the computation

5.等待批处理被终止，使用streamingContext.awaitTermination()

ssc.awaitTermination()  // Wait for the computation to terminate

6.可以手工停止批处理，使用streamingContext.stop()

数据源

数据源分为两种

1.基本源

text,HDFS等

2.高级源

Flume,Kafka等

DStream支持两种操作

一、转化操作(transformation)

技术分享

无状态转化(stateless)：每个批次的处理不依赖于之前批次的数据

有状态转化(stateful)：跨时间区间跟踪数据的操作；一些先前批次的数据被用来在新的批次中参与运算。

滑动窗口：
追踪状态变化：updateStateByKey()

窗口函数

技术分享

二、输出操作(output operation)

技术分享

Spark Streaming

标签：

原文地址：http://www.cnblogs.com/one--way/p/5877552.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行