今天收到线上的resource manager报警:
报错信息如下:
2014-07-08 13:22:54,118 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:xxxx:53356 Timed out after 600 secs 2014-07-08 13:22:54,118 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node xxxx:53356 as it is now LOST 2014-07-08 13:22:54,118 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: xxxx:53356 Node Transitioned from UNHEALTHY to LOST 2014-07-08 13:22:54,118 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_REMOVED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeNode(FairScheduler.java:715) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:974) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:108) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:378) at java.lang.Thread.run(Thread.java:662) 2014-07-08 13:22:54,118 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. 2014-07-08 13:22:54,119 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 1000 2014-07-08 13:22:54,119 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 2000
这是一个bug,bug id:https://issues.apache.org/jira/browse/YARN-502
根据bug的描述,是在rm删除标记为UNHEALTHY的nm的时候可能会触发bug(第一次已经删除,后面删除再进行删除操作时就会报错)。
根据堆栈信息来看代码:
org.apache.hadoop.yarn.server.resourcemanager.scheduler.ResourceScheduler: protected ResourceScheduler scheduler; private final class EventProcessor implements Runnable { // 开启一个EventProcessor 线程,对event进行处理 @Override public void run() { SchedulerEvent event; while (!stopped && !Thread.currentThread ().isInterrupted()) { try { event = eventQueue.take(); // 从event queue里面拿出event } catch (InterruptedException e) { LOG.error("Returning, interrupted : " + e); return; // TODO: Kill RM. } try { scheduler.handle(event); //处理event } catch (Throwable t) { // cache event的异常 // An error occurred, but we are shutting down anyway. // If it was an InterruptedException, the very act of // shutdown could have caused it and is probably harmless. if (stopped ) { LOG.warn("Exception during shutdown: " , t); break; } LOG.fatal("Error in handling event type " + event.getType() //根据日志来看,这里获取的event.getType()为 NODE_REMOVED + " to the scheduler", t); if (shouldExitOnError && !ShutdownHookManager.get().isShutdownInProgress()) { LOG.info("Exiting, bbye.." ); System. exit(-1); } } } } }
这里可以看到可以通过shouldExitOnError可以控制RM线程是否退出。
private boolean shouldExitOnError = false; // 初始设置为false @Override public synchronized void init(Configuration conf) { // 在做初始化时,可以通过配置文件获取 this. shouldExitOnError = conf.getBoolean(Dispatcher.DISPATCHER_EXIT_ON_ERROR_KEY, Dispatcher.DEFAULT_DISPATCHER_EXIT_ON_ERROR); // 参数在Dispatcher类中定义 super.init(conf); }
org.apache.hadoop.yarn.event.Dispatcher类: public interface Dispatcher { // Configuration to make sure dispatcher crashes but doesn‘t do system-exit in // case of errors. By default, it should be false, so that tests are not // affected. For all daemons it should be explicitly set to true so that // daemons can crash instead of hanging around. public static final String DISPATCHER_EXIT_ON_ERROR_KEY = "yarn.dispatcher.exit-on-error"; // 控制参数 public static final boolean DEFAULT_DISPATCHER_EXIT_ON_ERROR = false; // 默认为false EventHandler getEventHandler(); void register(Class<? extends Enum> eventType, EventHandler handler); }
在ResourceManager类的init函数中:
@Override public synchronized void init(Configuration conf) { this. conf = conf; this. conf.setBoolean(Dispatcher.DISPATCHER_EXIT_ON_ERROR_KEY, true); // 这个值的默认值为true了(覆盖了Dispatcher类中的DEFAULT设置)
即默认在遇到dispather的错误时,会退出。
遇到错误是否退出可以由配置参数yarn.dispatcher.exit-on-error决定。不过这个改动影响比较大,最好还是不要设置,还是打patch来解决吧。
官方的patch也比较简单,即在rmnm时进行一次判断,防止二次删除操作:
--- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java +++ hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java @@ -501,8 +501,13 @@ public DeactivateNodeTransition(NodeState finalState) { public void transition(RMNodeImpl rmNode, RMNodeEvent event) { // Inform the scheduler rmNode.nodeUpdateQueue.clear(); - rmNode.context.getDispatcher().getEventHandler().handle( - new NodeRemovedSchedulerEvent(rmNode)); + // If the current state is NodeState.UNHEALTHY + // Then node is already been removed from the + // Scheduler + if (!rmNode.getState().equals(NodeState.UNHEALTHY)) { + rmNode.context.getDispatcher().getEventHandler() + .handle( new NodeRemovedSchedulerEvent(rmNode)); + } rmNode.context.getDispatcher().getEventHandler().handle( new NodesListManagerEvent( NodesListManagerEventType.NODE_UNUSABLE, rmNode));
本文出自 “菜光光的博客” 博客,请务必保留此出处http://caiguangguang.blog.51cto.com/1652935/1436087
yarn RM crash问题一例,布布扣,bubuko.com
原文地址:http://caiguangguang.blog.51cto.com/1652935/1436087