标签:resource manager capacityscheduler npe异常 yarn
一、问题描述
yarn2.0发生resource manager down(master2)掉,并引起resource manager的failover切换
二、问题分析
1)看master2上resource manager的日志
2016-06-26 12:35:41,504 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=warehouse OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1466451117456_12139 CONTAINERID=container_1466451117456_12139_02_000001 2016-06-26 12:35:41,504 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Updating application attempt appattempt_1466451117456_12139_000002 with final state: FAILED, and exit status: -100 2016-06-26 12:35:41,504 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1466451117456_12139_000002 State change from ALLOCATED t o FINAL_SAVING 2016-06-26 12:35:41,504 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Unregistering app attempt : appattempt_1466451117456_12139_000002 2016-06-26 12:35:41,504 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type CONTAINER_EXPIRED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1664) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:1231) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1117) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:114) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:686) at java.lang.Thread.run(Thread.java:724)2016-06-26 12:35:41,504 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Application finished, removing password for appattempt_14664511174 56_12139_000002 2016-06-26 12:35:41,504 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. 2016-06-26 12:35:41,504 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1466451117456_12139_000002 State change from FINAL_SAVIN G to FAILED 2016-06-26 12:35:41,504 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: The number of failed attempts is 0. The max attempts is 2 2016-06-26 12:35:41,505 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1466451117456_12139_000003 2016-06-26 12:35:41,505 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1466451117456_12139_000003 State change from NEW to SUBM
可以看到CapacityScheduler的NPE导致ResourceManager退出。这种退出机制本身是安全的,防止Scheduler的异常导致ResourceManager后续一直不可用。
2)分析原因可能是CapacityScheduler异步调度引起该异常,源码如下(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
static void schedule(CapacityScheduler cs) { // First randomize the start point int current = 0; Collection<FiCaSchedulerNode> nodes = cs.nodeTracker.getAllNodes(); int start = random.nextInt(nodes.size()); //这里循环处理的时候,nodes可能已经被其他线程修改 for (FiCaSchedulerNode node : nodes) { if (current++ >= start) { cs.allocateContainersToNode(node); } } // Now, just get everyone to be safe for (FiCaSchedulerNode node : nodes) { cs.allocateContainersToNode(node); } try { Thread.sleep(cs.getAsyncScheduleInterval()); } catch (InterruptedException e) {} }
三、解决方法
修改capacity-scheduler.xml,取消异步调度
<property> <name>yarn.scheduler.capacity.schedule-asynchronously.enable</name> <value>false</value> </property>
该修改需要重启ResourceManager才可生效
本文出自 “散人” 博客,请务必保留此出处http://zouqingyun.blog.51cto.com/782246/1878530
resource manager因为CapacityScheduler的NPE异常退出,引起failover切换
标签:resource manager capacityscheduler npe异常 yarn
原文地址:http://zouqingyun.blog.51cto.com/782246/1878530