(版本定制)第18课：Spark Streaming中空RDD处理及流处理程序优雅的停止

时间：2016-06-14 16:16:58 阅读：183 评论：0 收藏：0 [点我收藏+]

本期内容：

1. Spark Streaming中RDD为空处理

2. Streaming Context程序停止方式

Spark Streaming运用程序是根据我们设定的Batch Duration来产生RDD，产生的RDD存在partitons数据为空的情况，但是还是会执行foreachPartition，会获取计算资源，然后计算一下，这种情况就会浪费

集群计算资源，所以需要在程序运行的时候进行过滤，参考如下代码：

package com.dt.spark.sparkstreaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

object OnlineForeachRDD2DB {
  def main(args: Array[String]){
val conf = new SparkConf() //创建SparkConf对象
  conf.setAppName("OnlineForeachRDD2DB") //设置应用程序的名称，在程序运行的监控界面可以看到名称
  conf.setMaster("spark://Master:7077") //此时，程序在Spark集群
  /**
* 设置batchDuration时间间隔来控制Job生成的频率并且创建Spark Streaming执行的入口
*/
  val ssc = new StreamingContext(conf, Seconds(300))
  val lines = ssc.socketTextStream("Master", 9999)
  val words = lines.flatMap(line => line.split(" "))
  val wordCounts = words.map(word => (word,1)).reduceByKey(_ + _)
  wordCounts.foreachRDD{ rdd =>
/**
* 例如：rdd为空，rdd为空会产生什么问题呢？
* rdd没有任何元素，但是也会做做foreachPartition，也会进行写数据库的操作或者把数据写到HDFS上，
*   rdd里面没有任何记录，但是还会获取计算资源，然后计算一下，消耗计算资源，这个时候纯属浪费资源，
* 所以必须对空rdd进行处理；

* 例如：使用rdd.count()>0，但是rdd.count()会触发一个Job；

* 使用rdd.isEmpty()的时候，take也会触发Job；

* def isEmpty(): Boolean = withScope {

* partitions.length == 0 || take(1).length == 0

* }

*
* rdd.partitions.isEmpty里判断的是length是否等于0，就代表是否有partition
* def isEmpty: Boolean = { length == 0 }
* 注：rdd.isEmpty()和rdd.partitions.isEmpty是两种概念；
*/

//
if(rdd.partitions.length > 0) {
rdd.foreachPartition{ partitonOfRecord =>
if(partitionOfRecord.hasNext） // 判断下partition中是否存在数据

{

   val connection = ConnectionPool.getConnection()
partitonOfRecord.foreach(record => {
  val sql = "insert into streaming_itemcount(item,rcount) values(‘" + record._1 + "‘," + record._2 + ")"
  val stmt = connection.createStatement()
  stmt.executeUpdate(sql)
  stmt.close()
})
  ConnectionPool.returnConnection(connection)
}

}

}
}

ssc.start()
ssc.awaitTermination()
}
}

二、SparkStreaming程序停止方式

第一种是不管接受到数据是否处理完成，直接被停止掉。

第二种是接受到数据全部处理完成才停止掉，一般采用第二种方式。

第一种停止方式：

/**
* Stop the execution of the streams immediately (does not wait for all received data
* to be processed). By default, if `stopSparkContext` is not specified, the underlying
* SparkContext will also be stopped. This implicit behavior can be configured using the
* SparkConf configuration spark.streaming.stopSparkContextByDefault.
*
* 把streams的执行直接停止掉(并不会等待所有接受到的数据处理完成)，默认情况下SparkContext也会被停止掉，
* 隐式的行为可以做配置，配置参数为spark.streaming.stopSparkContextByDefault。
*
* @param stopSparkContext If true, stops the associated SparkContext. The underlying SparkContext
* will be stopped regardless of whether this StreamingContext has been
* started.
*/
def stop(stopSparkContext: Boolean = conf.getBoolean("spark.streaming.stopSparkContextByDefault", true)
): Unit = synchronized {
stop(stopSparkContext, false)

}

第二种停止方式：

/**
* Stop the execution of the streams, with option of ensuring all received data
* has been processed.
*

* 所有接受到的数据全部被处理完成，才把streams的执行停止掉

*
* @param stopSparkContext if true, stops the associated SparkContext. The underlying SparkContext
* will be stopped regardless of whether this StreamingContext has been
* started.
* @param stopGracefully if true, stops gracefully by waiting for the processing of all
* received data to be completed
*/
def stop(stopSparkContext: Boolean, stopGracefully: Boolean): Unit = {
var shutdownHookRefToRemove: AnyRef = null
if (AsynchronousListenerBus.withinListenerThread.value) {
  throw new SparkException("Cannot stop StreamingContext within listener thread of" +
" AsynchronousListenerBus")
}
synchronized {
  try {
state match {
  case INITIALIZED =>
logWarning("StreamingContext has not been started yet")
  case STOPPED =>
logWarning("StreamingContext has already been stopped")
  case ACTIVE =>
scheduler.stop(stopGracefully)
// Removing the streamingSource to de-register the metrics on stop()
env.metricsSystem.removeSource(streamingSource)
uiTab.foreach(_.detach())
StreamingContext.setActiveContext(null)
waiter.notifyStop()
if (shutdownHookRef != null) {
shutdownHookRefToRemove = shutdownHookRef
shutdownHookRef = null
}
logInfo("StreamingContext stopped successfully")
}
} finally {
// The state should always be Stopped after calling `stop()`, even if we haven‘t started yet
state = STOPPED
}
}
if (shutdownHookRefToRemove != null) {
ShutdownHookManager.removeShutdownHook(shutdownHookRefToRemove)
}
// Even if we have already stopped, we still need to attempt to stop the SparkContext because
// a user might stop(stopSparkContext = false) and then call stop(stopSparkContext = true).
if (stopSparkContext) sc.stop()
}

备注：

资料来源于：DT_大数据梦工厂（Spark发行版本定制）

更多私密内容，请关注微信公众号：DT_Spark

如果您对大数据Spark感兴趣，可以免费听由王家林老师每天晚上20：00开设的Spark永久免费公开课，地址YY房间号：68917580

本文出自 “DT_Spark大数据梦工厂” 博客，请务必保留此出处http://18610086859.blog.51cto.com/11484530/1789096

(版本定制)第18课：Spark Streaming中空RDD处理及流处理程序优雅的停止

标签：spark streaming stop

原文地址：http://18610086859.blog.51cto.com/11484530/1789096

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行