首页 > 移动开发 > 详细

YARN源码分析(一)-----ApplicationMaster

时间：2015-12-14 06:42:24 阅读：299 评论：0 收藏：0 [点我收藏+]

标签：

转自：http://blog.csdn.net/androidlushangderen/article/details/48128955

YARN学习系列：http://blog.csdn.net/Androidlushangderen/article/category/5780183

前言

在之前两周主要学了HDFS中的一些模块知识,其中的许多都或多或少有我们借鉴学习的地方,现在将目光转向另外一个块,被誉为MRv2,就是yarn,在Yarn中,解决了MR中JobTracker单点的问题,将此拆分成了ResourceManager和NodeManager这样的结构,在每个节点上,还会有ApplicationMaster来管理应用程序的整个生命周期,的确在Yarn中,多了许多优秀的设计,而今天,我主要分享的就是这个ApplicationMaster相关的一整套服务,他是隶属于ResoureManager的内部服务中的.了解了AM的启动机制,你将会更进一步了解Yarn的任务启动过程.

ApplicationMaster管理涉及类

ApplicationMaster管理涉及到了4大类,ApplicationMasterLauncher,AMLivelinessMonitor,ApplicationMasterService,以及ApplicationMaster自身类.下面介绍一下这些类的用途,在Yarn中,每个类都会有自己明确的功能模块的区分.

1.ApplicationMasterLauncher--姑且叫做AM启动关闭事件处理器,他既是一个服务也是一个处理器,在这个类中,只处理2类事件,launch和cleanup事件.分别对应启动应用和关闭应用的情形.

2.AMLivelinessMonitor--这个类从名字上可以看出他是监控类,监控的对象是AM存活状态的监控类,检测的方法与之前的HDFS一样,都是采用heartbeat的方式,如果有节点过期了,将会触发一次过期事件.

3.ApplicationMasterService--AM请求服务处理类.AMS存在于ResourceManager,中,服务的对象是各个节点上的ApplicationMaster,负责接收各个AM的注册请求,更新心跳包信息等.

4.ApplicationMaster--节点应用管理类,简单的说,ApplicationMaster负责管理整个应用的生命周期.

简答的描述完AM管理的相关类,下面从源码级别分析一下几个流程.

AM启动

要想让AM启动,启动的背景当然是有用户提交了新的Application的时候,之后ApplicationMasterLauncher会生成Launch事件,与对应的nodemanager通信,让其准备启动的新的AM的Container.在这里,就用到了ApplicationMasterLauncher这个类,之前在上文中已经提到,此类就处理2类事件,Launch启动和Cleanup清洗事件,先来看看这个类的基本变量设置

[java] view plain copy print ?

//Application应用事件处理器
public class ApplicationMasterLauncher extends AbstractService implements
EventHandler<AMLauncherEvent> {
private static final Log LOG = LogFactory.getLog(
ApplicationMasterLauncher.class);
private final ThreadPoolExecutor launcherPool;
private LauncherThread launcherHandlingThread;
//事件队列
private final BlockingQueue<Runnable> masterEvents
= new LinkedBlockingQueue<Runnable>();
//资源管理器上下文
protected final RMContext context;
public ApplicationMasterLauncher(RMContext context) {
super(ApplicationMasterLauncher.class.getName());
this.context = context;
//初始化线程池
this.launcherPool = new ThreadPoolExecutor(10, 10, 1,
TimeUnit.HOURS, new LinkedBlockingQueue<Runnable>());
//新建处理线程
this.launcherHandlingThread = new LauncherThread();
}

还算比较简单,有一个masterEvents事件队列,还有执行线程以及所需的线程池执行环境。在RM相关的服务中，基本都是继承自AbstractService这个抽象服务类的。ApplicationMasterLauncher中主要处理2类事件,就是下面的展示的

[java] view plain copy print ?

@Override
public synchronized void handle(AMLauncherEvent appEvent) {
AMLauncherEventType event = appEvent.getType();
RMAppAttempt application = appEvent.getAppAttempt();
//处理来自ApplicationMaster获取到的请求，分为启动事件和清洗事件2种
switch (event) {
case LAUNCH:
launch(application);
break;
case CLEANUP:
cleanup(application);
default:
break;
}
}

然后调用具体的实现方法，以启动事件launch事件为例

[java] view plain copy print ?

//添加应用启动事件
private void launch(RMAppAttempt application) {
Runnable launcher = createRunnableLauncher(application,
AMLauncherEventType.LAUNCH);
//将启动事件加入事件队列中
masterEvents.add(launcher);
}

这些事件被加入到事件队列之后，是如何被处理的呢，通过消息队列的形式，在一个独立的线程中逐一被执行

[java] view plain copy print ?

//执行线程实现
private class LauncherThread extends Thread {
public LauncherThread() {
super("ApplicationMaster Launcher");
}
@Override
public void run() {
while (!this.isInterrupted()) {
Runnable toLaunch;
try {
//执行方法为从事件队列中逐一取出事件
toLaunch = masterEvents.take();
//放入线程池池中进行执行
launcherPool.execute(toLaunch);
} catch (InterruptedException e) {
LOG.warn(this.getClass().getName() + " interrupted. Returning.");
return;
}
}
}
}

如果论到事件的具体执行方式，就要看具体AMLauch是如何执行的，AMLauch本身就是一个runnable实例。

[java] view plain copy print ?

/**
* The launch of the AM itself.
* Application事件执行器
*/
public class AMLauncher implements Runnable {
private static final Log LOG = LogFactory.getLog(AMLauncher.class);
private ContainerManagementProtocol containerMgrProxy;
private final RMAppAttempt application;
private final Configuration conf;
private final AMLauncherEventType eventType;
private final RMContext rmContext;
private final Container masterContainer;

在里面主要的run方法如下，就是按照事件类型进行区分操作

[java] view plain copy print ?

@SuppressWarnings("unchecked")
public void run() {
//AMLauncher分2中事件分别处理
switch (eventType) {
case LAUNCH:
try {
LOG.info("Launching master" + application.getAppAttemptId());
//调用启动方法
launch();
handler.handle(new RMAppAttemptEvent(application.getAppAttemptId(),
RMAppAttemptEventType.LAUNCHED));
...
break;
case CLEANUP:
try {
LOG.info("Cleaning master " + application.getAppAttemptId());
//调用作业清洗方法
cleanup();
...
break;
default:
LOG.warn("Received unknown event-type " + eventType + ". Ignoring.");
break;
}
}

后面的launch操作会调用RPC函数与远程的NodeManager通信来启动Container。然后到了ApplicationMaster的run()启动方法，在启动方法中，会进行应用注册的方法，

[java] view plain copy print ?

@SuppressWarnings({ "unchecked" })
public boolean run() throws YarnException, IOException {
LOG.info("Starting ApplicationMaster");
Credentials credentials =
UserGroupInformation.getCurrentUser().getCredentials();
DataOutputBuffer dob = new DataOutputBuffer();
credentials.writeTokenStorageToStream(dob);
// Now remove the AM->RM token so that containers cannot access it.
Iterator<Token<?>> iter = credentials.getAllTokens().iterator();
while (iter.hasNext()) {
Token<?> token = iter.next();
if (token.getKind().equals(AMRMTokenIdentifier.KIND_NAME)) {
iter.remove();
}
}
allTokens = ByteBuffer.wrap(dob.getData(), 0, dob.getLength());
//与ResourceManager通信，周期性发送心跳信息，包含了应用的最新信息
AMRMClientAsync.CallbackHandler allocListener = new RMCallbackHandler();
amRMClient = AMRMClientAsync.createAMRMClientAsync(1000, allocListener);
amRMClient.init(conf);
amRMClient.start();
.....
// Register self with ResourceManager
// This will start heartbeating to the RM
//启动之后进行AM的注册
appMasterHostname = NetUtils.getHostname();
RegisterApplicationMasterResponse response = amRMClient
.registerApplicationMaster(appMasterHostname, appMasterRpcPort,
appMasterTrackingUrl);
// Dump out information about cluster capability as seen by the
// resource manager
int maxMem = response.getMaximumResourceCapability().getMemory();
LOG.info("Max mem capabililty of resources in this cluster " + maxMem);
// A resource ask cannot exceed the max.
if (containerMemory > maxMem) {
LOG.info("Container memory specified above max threshold of cluster."
+ " Using max value." + ", specified=" + containerMemory + ", max="
+ maxMem);
containerMemory = maxMem;
}

在这个操作中，会将自己注册到AMLivelinessMonitor中，此刻开始启动心跳监控。

AMLiveLinessMonitor监控

在这里把重心从ApplicationMaster转移到AMLivelinessMonitor上，首先这是一个激活状态的监控线程，此类线程都有一个共同的父类

[java] view plain copy print ?

//应用存活状态监控线程
public class AMLivelinessMonitor extends AbstractLivelinessMonitor<ApplicationAttemptId> {

在AbstractlinessMonitor中定义监控类线程的一类特征和方法

[java] view plain copy print ?

//进程存活状态监控类
public abstract class AbstractLivelinessMonitor<O> extends AbstractService {
private static final Log LOG = LogFactory.getLog(AbstractLivelinessMonitor.class);
//thread which runs periodically to see the last time since a heartbeat is
//received.
//检查线程
private Thread checkerThread;
private volatile boolean stopped;
//默认超时时间5分钟
public static final int DEFAULT_EXPIRE = 5*60*1000;//5 mins
//超时时间
private int expireInterval = DEFAULT_EXPIRE;
//监控间隔检测时间，为超时时间的1/3
private int monitorInterval = expireInterval/3;
private final Clock clock;
//保存了心跳检验的结果记录
private Map<O, Long> running = new HashMap<O, Long>();

心跳检测本身非常的简单，做一次通信记录检查，然后更新一下，记录时间，当一个新的节点加入监控或解除监控操作

[java] view plain copy print ?

//新的节点注册心跳监控
public synchronized void register(O ob) {
running.put(ob, clock.getTime());
}
//节点移除心跳监控
public synchronized void unregister(O ob) {
running.remove(ob);
}

每次做心跳周期检测的时候，调用下述方法

[java] view plain copy print ?

//更新心跳监控检测最新时间
public synchronized void receivedPing(O ob) {
//only put for the registered objects
if (running.containsKey(ob)) {
running.put(ob, clock.getTime());
}
}

非常简单的更新方法，O ob对象在这里因场景而异，在AM监控中，为ApplicationID应用ID。在后面的AMS和AM的交互中会看到。新的应用加入AMLivelinessMonitor监控中后，后面的主要操作就是AMS与AM之间的交互操作了。

AM与AMS

在ApplicationMaster运行之后，会周期性的向ApplicationMasterService发送心跳信息，心跳信息包含有许多资源描述信息。

[java] view plain copy print ?

//ApplicationMaster心跳信息更新
@Override
public AllocateResponse allocate(AllocateRequest request)
throws YarnException, IOException {
ApplicationAttemptId appAttemptId = authorizeRequest();
//进行心跳信息时间的更新
this.amLivelinessMonitor.receivedPing(appAttemptId);
....

每次心跳信息一来，就会更新最新监控时间。在AMS也有对应的注册应用的方法

[java] view plain copy print ?

//ApplicationMaster在ApplicationMasterService上服务上进行应用注册
@Override
public RegisterApplicationMasterResponse registerApplicationMaster(
RegisterApplicationMasterRequest request) throws YarnException,
IOException {
ApplicationAttemptId applicationAttemptId = authorizeRequest();
ApplicationId appID = applicationAttemptId.getApplicationId();
.....
//在存活监控线程上进行心跳记录，更新检测时间，key为应用ID
this.amLivelinessMonitor.receivedPing(applicationAttemptId);
RMApp app = this.rmContext.getRMApps().get(appID);
// Setting the response id to 0 to identify if the
// application master is register for the respective attemptid
lastResponse.setResponseId(0);
responseMap.put(applicationAttemptId, lastResponse);
LOG.info("AM registration " + applicationAttemptId);
this.rmContext

如果在心跳监控中出现过期的现象，就会触发一个expire事件，在AMLiveLinessMonitor中，这部分的工作是交给CheckThread执行的

[java] view plain copy print ?

//进程存活状态监控类
public abstract class AbstractLivelinessMonitor<O> extends AbstractService {
...
//thread which runs periodically to see the last time since a heartbeat is
//received.
//检查线程
private Thread checkerThread;
....
//默认超时时间5分钟
public static final int DEFAULT_EXPIRE = 5*60*1000;//5 mins
//超时时间
private int expireInterval = DEFAULT_EXPIRE;
//监控间隔检测时间，为超时时间的1/3
private int monitorInterval = expireInterval/3;
....
//保存了心跳检验的结果记录
private Map<O, Long> running = new HashMap<O, Long>();
...
private class PingChecker implements Runnable {
@Override
public void run() {
while (!stopped && !Thread.currentThread().isInterrupted()) {
synchronized (AbstractLivelinessMonitor.this) {
Iterator<Map.Entry<O, Long>> iterator =
running.entrySet().iterator();
//avoid calculating current time everytime in loop
long currentTime = clock.getTime();
while (iterator.hasNext()) {
Map.Entry<O, Long> entry = iterator.next();
//进行超时检测
if (currentTime > entry.getValue() + expireInterval) {
iterator.remove();
//调用超时处理方法，将处理事件交由调度器处理
expire(entry.getKey());
LOG.info("Expired:" + entry.getKey().toString() +
" Timed out after " + expireInterval/1000 + " secs");
}
}
}

check线程主要做的事件就是遍历每个节点的最新心跳更新时间，通过计算差值进行判断是否过期，过期调用expire方法。此方法由其子类实现

[java] view plain copy print ?

//应用存活状态监控线程
public class AMLivelinessMonitor extends AbstractLivelinessMonitor<ApplicationAttemptId> {
//中央调度处理器
private EventHandler dispatcher;
...
@Override
protected void expire(ApplicationAttemptId id) {
//一旦应用过期，处理器处理过期事件处理
dispatcher.handle(
new RMAppAttemptEvent(id, RMAppAttemptEventType.EXPIRE));
}
}

产生应用超期事件，然后发给中央调度器去处理。之所以采用的这样的方式，是因为在RM中，所有的模块设计是以事件驱动的形式工作，最大程度的保证了各个模块间的解耦。不同模块通过不同的事件转变为不同的状态，可以理解为状态机的改变。最后用一张书中的截图简单的展示AM模块相关的调用过程。

技术分享

全部代码的分析请点击链接https://github.com/linyiqun/hadoop-yarn,后续将会继续更新YARN其他方面的代码分析。

参考文献

《Hadoop技术内部–HDFS结构设计与实现原理》.蔡斌等

YARN源码分析(一)-----ApplicationMaster

标签：

原文地址：http://www.cnblogs.com/cxzdy/p/5044020.html

踩

(0)

赞

(0)

举报

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行

更多

友情链接

兰亭集智国之画百度统计站长统计阿里云 chrome插件新版天听网

关于我们 - 联系我们 - 留言反馈

© 2014 mamicode.com 版权所有联系我们:gaon5@hotmail.com

迷上了代码！