标签:cert 数据集 ons any core tuning modify 重要 art
一、
Spark Streaming 构建在Spark core API之上,具备可伸缩,高吞吐,可容错的流处理模块。
1)支持多种数据源,如Kafka,Flume,Socket,文件等;
2)处理完成数据可写入Kafka,Hdfs,本地文件等多种地方;
DStream:
Spark Streaming对持续流入的数据有个高层的抽像:
It represents a continuous stream of data
a DStream is represented by a continuous series of RDDs,Each RDD in a DStream contains data from a certain interval
Any operation applied on a DStream translates to operations on the underlying RDDs.
什么是RDD?
RDD是Resilient Distributed Dataset的缩写,中文译为弹性分布式数据集,是Spark中最重要的概念。
RDD是只读的、分区的,可容错的数据集合。
何为弹性?
RDD可在内存、磁盘之间任意切换
RDD可以转换成其它RDD,可由其它RDD生成
RDD可存储任意类型数据
二、基本概念
1)add dependency
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.3.1</version>
</dependency>
其它想关依赖查询:
https://search.maven.org/search?q=g:org.apache.spark%20AND%20v:2.2.0
2)文件作为DStream源,是如何被监控的?
1)文件格式须一致
2)根据modify time开成流,而非create time
3)处理时,当前文件变更不会在此window处理,即不会reread
4)可以调用 FileSystem.setTimes()来修改文件时间,使其在下个window被处理,即使文件内容未被修改过
三、Transform operation
window operation
Spark Streaming also provides windowed computations, which allow you to apply transformations over a sliding window of data.
every time the window slides over a source DStream, the source RDDs that fall within the window are combined and operated upon to produce the RDDs of the windowed DStream.
在一个时间窗口内的RDD被合并为一个RDD来处理。
Any window operation needs to specify two parameters:
window length: The duration of the window
sliding interval: The interval at which the window operation if performed
四、Output operation
使用foreachRDD
dstream.foreachRDD
is a powerful primitive that allows data to be sent out to external systems. However, it is important to understand how to use this primitive correctly and efficiently.
CheckPoint概念
Performance Tuning
Fault-tolerance Semantics
标签:cert 数据集 ons any core tuning modify 重要 art
原文地址:https://www.cnblogs.com/gm-201705/p/9533271.html