标签:hadoop
线上某个hive job运行失败,报错如下
Container [pid=28474,containerID=container_1411897705890_0181_01_000012] is running beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory used; 1.5 GB of 2.1 GB virtual memory used. Killing container. Dump of the process-tree for container_1411897705890_0181_01_000012 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 28474 19508 28474 28474 (bash) 0 0 9416704 309 /bin/bash -c /usr/java/jdk1.7.0_67/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1024m -Djava.io.tmpdir=/data/yarn/local/usercache/hadoop/appcache/application_1411897705890_0181/container_1411897705890_0181_01_000012/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/data/yarn/logs/application_1411897705890_0181/container_1411897705890_0181_01_000012 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 10.10.11.161 32875 attempt_1411897705890_0181_r_000000_3 12 1>/data/yarn/logs/application_1411897705890_0181/container_1411897705890_0181_01_000012/stdout 2>/data/yarn/logs/application_1411897705890_0181/container_1411897705890_0181_01_000012/stderr |- 28481 28474 28474 28474 (java) 2356 397 1630285824 264098 /usr/java/jdk1.7.0_67/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1024m -Djava.io.tmpdir=/data/yarn/local/usercache/hadoop/appcache/application_1411897705890_0181/container_1411897705890_0181_01_000012/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/data/yarn/logs/application_1411897705890_0181/container_1411897705890_0181_01_000012 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 10.10.11.161 32875 attempt_1411897705890_0181_r_000000_3 12
根据异常分析应该是内存使用超过限制,ContainersMonitorImpl将进程kill导致的,查看下JVM内存回收的情况
[GC [PSYoungGen: 241753K->16036K(306176K)] 241753K->16116K(1005568K), 0.0362550 secs] [Times: user=0.31 sys=0.05, real=0.04 secs] [GC [PSYoungGen: 210741K->4826K(306176K)] 210821K->282228K(1005568K), 0.0996080 secs] [Times: user=1.58 sys=0.15, real=0.10 secs] [GC [PSYoungGen: 194630K->4762K(306176K)] 472032K->624439K(1005568K), 0.1418910 secs] [Times: user=2.30 sys=0.14, real=0.14 secs] [Full GC [PSYoungGen: 4762K->0K(306176K)] [ParOldGen: 619677K->359650K(699392K)] 624439K->359650K(1005568K) [PSPermGen: 21635K->21622K(43520K)], 0.1742260 secs] [Times: user=0.82 sys=0.07, real=0.17 secs] [GC-- [PSYoungGen: 192581K->192581K(306176K)] 655085K->833707K(1005568K), 0.0634170 secs] [Times: user=1.08 sys=0.00, real=0.06 secs] [Full GC [PSYoungGen: 192581K->0K(306176K)] [ParOldGen: 641125K->640707K(699392K)] 833707K->640707K(1005568K) [PSPermGen: 21674K->21674K(49152K)], 0.0663990 secs] [Times: user=0.65 sys=0.05, real=0.07 secs] [Full GC [PSYoungGen: 262656K->0K(306176K)] [ParOldGen: 640709K->8142K(699392K)] 903365K->8142K(1005568K) [PSPermGen: 24649K->24647K(49152K)], 0.0662210 secs] [Times: user=0.37 sys=0.00, real=0.07 secs] [GC [PSYoungGen: 262656K->15936K(327680K)] 270798K->24078K(1027072K), 0.0175890 secs] [Times: user=0.14 sys=0.14, real=0.02 secs] Heap PSYoungGen total 327680K, used 201250K [0x00000000eaa80000, 0x0000000100000000, 0x0000000100000000) eden space 284160K, 65% used [0x00000000eaa80000,0x00000000f5f78b18,0x00000000fc000000) from space 43520K, 36% used [0x00000000fd580000,0x00000000fe510010,0x0000000100000000) to space 22016K, 0% used [0x00000000fc000000,0x00000000fc000000,0x00000000fd580000) ParOldGen total 699392K, used 8142K [0x00000000bff80000, 0x00000000eaa80000, 0x00000000eaa80000) object space 699392K, 1% used [0x00000000bff80000,0x00000000c07738d8,0x00000000eaa80000) PSPermGen total 49152K, used 24726K [0x00000000bad80000, 0x00000000bdd80000, 0x00000000bff80000) object space 49152K, 50% used [0x00000000bad80000,0x00000000bc5a5908,0x00000000bdd80000)
并没有明显的内存泄露或者内存溢出的情况,还是从堆内存入手,由于MR JOB的对象生存周期普遍比较短,尝试调大新生代,让更多对象在新生代进行回收,提高回收的效率,参数调整为
-Xms1024m -Xmx1024m -Xmn600m
问题解决。这次遇到的是物理内存超过限制的问题,还有一种是虚拟内存超过限制导致任务被kill
参数yarn.nodemanager.vmem-pmem-ratio的含义是每单位的物理内存总量对应的虚拟内存量,默认是2.1,表示每使用1MB的物理内存,最多可以使用2.1MB的虚拟内存总量,解决虚拟内存问题可以适当调高该参数或者还从JVM内存回收方面来优化。
最后说下ContainersMonitorImpl的监控策略,它保存了每个Container的pid,内部的MonitoringThread线程每隔一段时间扫描运行的Container进程树。NodeManager通过读取/proc/<pid>/stat文件构造以该Container进程为根的进程树,通过监控进程树使用的内存量来限制任务的内存量。
private class MonitoringThread extends Thread { public void run() { while (true) { //获取进程树 ResourceCalculatorProcessTree pTree = ptInfo.getProcessTree(); pTree.updateProcessTree(); //获取container进程树的内存使用量 long currentVmemUsage = pTree.getCumulativeVmem(); long currentPmemUsage = pTree.getCumulativeRssmem(); //获取进程树中年龄大于1的进程的内存使用量 long curMemUsageOfAgedProcesses = pTree.getCumulativeVmem(1); long curRssMemUsageOfAgedProcesses = pTree.getCumulativeRssmem(1); long vmemLimit = ptInfo.getVmemLimit(); long pmemLimit = ptInfo.getPmemLimit(); boolean isMemoryOverLimit = false; String msg = ""; //如果一个Container进程树中所有进程(年龄大于0)总内存超过设置最大值的两倍或者 //年龄大于1的进程总内存量超过设置最大值,则将该Container杀死 if (isVmemCheckEnabled() && isProcessTreeOverLimit(containerId.toString(), currentVmemUsage, curMemUsageOfAgedProcesses, vmemLimit)) { msg = formatErrorMessage("virtual", currentVmemUsage, vmemLimit, currentPmemUsage, pmemLimit, pId, containerId, pTree); isMemoryOverLimit = true; } else if (isPmemCheckEnabled() && isProcessTreeOverLimit(containerId.toString(), currentPmemUsage, curRssMemUsageOfAgedProcesses, pmemLimit)) { msg = formatErrorMessage("physical", currentVmemUsage, vmemLimit, currentPmemUsage, pmemLimit, pId, containerId, pTree); isMemoryOverLimit = true; } } }
所有有些时候并不是某个JVM进程的堆内存溢出才可以导致Task被kill,需要调整好对应参数才行,堆内存也并不是越大越好,调整好各代所占的比例也很重要。
本文出自 “lotso的博客” 博客,请务必保留此出处http://lotso.blog.51cto.com/3681673/1567548
标签:hadoop
原文地址:http://lotso.blog.51cto.com/3681673/1567548