标签:监控 nagios dell omsa check_openmanage
昨天分享了下服务器硬件监控的知识,在文章的结尾提到了check_openmanage工具。本文就主要介绍这个工具在服务器硬件监控方面的作用。
一、check_openmanage介绍
check_openmanage 是一个 Nagios 的插件,它基于 OMSA 获取相关的报道信息,用来检测安装有 OpenManage Server Administrator (OMSA) 的戴尔服务器的运行状态,包括存储系统、电源、温度等信息。
官网:http://folk.uio.no/trondham/software/check_openmanage.html
最新版本下载链接:http://folk.uio.no/trondham/software/files/check_openmanage-3.7.12.tar.gz
体系结构:
如上图,nagios提供了两种方式进行监控信息的获取。
1、nagios 服务器端 check_nrpe 调用被监控端的 check_openmanage 来实现,这种方式要在被监控端安装 OMSA 和 check_openmanage
2、nagios 服务器端直接通过 check_openmanage 来远程监控。这种方式要在 nagios 服务器端安装 perl-Net-SNMP,在被监控端安装SNMP和OMSA。
注意:
由于第一种方式,check_nrpe会消耗服务器性能,因此建议使用第二种方式。另外,第二种方式也适合使用zabbix的运维监控环境。
二、check_openmanage安装
check_openmanage的安装非常简单,只需要把它的包拿下来解压即可。由于包的来源有git仓库和gz包,所以这里列举两种安装方式。
方式一:
[root@kvm-phy04-jz ~]# cd /usr/local/src [root@kvm-phy04-jz src]# git clone git://git.uio.no/check_openmanage [root@kvm-phy04-jz src]# cd check_openmanage [root@kvm-phy04-jz check_openmanage]# ./check_openmanage # 不带任何参数默认输出服务器的warning和critical的报警信息
方式二:
[root@kvm-phy04-jz ~]# cd /usr/local/src [root@kvm-phy04-jz src]# wget http://folk.uio.no/trondham/software/files/check_openmanage-3.7.11.tar.gz [root@kvm-phy04-jz src]# tar zxf check_openmanage-3.7.11.tar.gz [root@kvm-phy04-jz src]# cd check_openmanage-3.7.11 [root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage
注意:
如果提示"Storage Error",则需要加上--no-storage参数
[root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage --no-storage
三、check_openmanage使用详解
check_openmanage提供了很多选项和参数供我们使用,由于官方提供的帮助文档都是英文的,这里我就根据使用的经验进行了翻译和注解,帮助大家快速的上手这个工具。
【通用选项】 -f,--config # 指定配置文件 -p,--perfdata # 输出性能数据,常和--only连用,不要和-d连用 -t,--timeout 时间值 # 设定check_openmanage的执行超时时间 -c,--critical # 自定义温度的critical阈值 -w,--warning # 自定义温度的warning阈值 -F,--fahrenheit # 使用华氏温度作为温度单位 -d,--debug # 显示所有检查项目 -h,--help # 获取check_openmanage帮助信息 -V,--version # 获取check_openmanage的版本信息 【SNMP选项】 -H,--hostname # 使用snmp协议,获取指定主机名或ip的服务器硬件信息 -C,--community # 自定义snmp的团体名,默认为public -P,--protocol # 自定义snmp的协议版本,默认为2c --port # 自定义snmp的端口号,默认为161 -6,--ipv6 # 使用ipv6替代ipv4,默认为no --tcp # 使用TCP协议替代UDP协议,默认为no 【输出选项】 -i,--info # 输出的警告信息加上服务器的SN号作为前缀 -e,--extinfo # 输出系统信息 -s,--state # 输出的信息之前自带警告级别,如warning或critical -S,--short-state # 输出的信息之前自带警告级别缩写,如W或C -o,--okinfo # 输出信息为一行(默认) -B,--show-blacklist # 输出黑名单列表信息,如果加入黑名单的信息多了,可以通过-B查看黑名单的列表信息 -I,--htmlinfo # 输出带可点击链接的html格式信息 【检查控制和黑名单】 -a,--all # 获取日志统计和详细日志输出 -b,--blacklist component=ID号 # 黑名单,指定某个组件的指定ID信息不显示。ID信息通过./check_openmanage -d可以看到。和-d搭配使用无效 --only # 仅输入某项监控数据 --check component=[0|1],esmlog=[0|1] # 检查单个项目或组合项目,0为不检查,1为检查,单独使用 --no-storage # 不检查存储信息 --vdisk-critical # 将虚拟磁盘的任何警告都提升为崩溃级别critical 【自定义输出信息】 --postmsg ‘自定义信息‘ # 在输出的结尾输出该自定义信息 在自定义信息中,我们可以使用如下变量 %m # 系统型号 %s # 系统SN号 %b # BIOS版本 %d # BIOS发型时间 %o # 操作系统名称 %r # 操作系统内核版本 %p # 物理磁盘数量 %l # 逻辑磁盘数量 %n # 表示换行符 %% # 表示%百分号
参考资料:
1、http://folk.uio.no/trondham/software/check_openmanage.html#download
2、check_openmanage -h
四、实用范例
由于check_openmanage命令有很多选项,因此在实际使用当中可能会让使用者很迷惑如何使用,因此这里列举一些常用的查看需求和对应的命令组合。上面介绍了,check_openmanage有两种获取信息的方式,我这里的范例,主要是上面介绍的第一种方式的前一部分,即使用本地check_openmanage命令查看。
1、如果执行的时候不带任何无参数 不带任何参数默认输出服务器的warning和critical的报警信息
[root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage Controller 0 [PERC H310 Mini]: Firmware ‘20.12.1-0002‘ is out of date Physical Disk 0:1:0 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:1 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:2 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:3 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:4 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:5 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified
2、输出带有状态提示的信息
[root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage -s WARNING: Controller 0 [PERC H310 Mini]: Firmware ‘20.12.1-0002‘ is out of date WARNING: Physical Disk 0:1:0 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING: Physical Disk 0:1:1 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING: Physical Disk 0:1:2 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING: Physical Disk 0:1:3 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING: Physical Disk 0:1:4 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING: Physical Disk 0:1:5 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified
3、使用黑名单,不检查Firmware固件版本更新提示
[root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage -s -b ctrl_fw=0 WARNING: Physical Disk 0:1:0 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING: Physical Disk 0:1:1 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING: Physical Disk 0:1:2 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING: Physical Disk 0:1:3 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING: Physical Disk 0:1:4 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING: Physical Disk 0:1:5 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified
4、使用黑名单,不检查磁盘未认证的提示
[root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage -s -b pdisk_cert=all
WARNING: Controller 0 [PERC H310 Mini]: Firmware ‘20.12.1-0002‘ is out of date
5、使用黑名单,不检查ID为0的Firmware固件版本更新提示和ID为0:0:1:0的物理磁盘的未认证提示
[root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage -b ctrl_fw=0\/pdisk=0:0:1:0 Physical Disk 0:1:1 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:2 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:3 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:4 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:5 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified
6、使用黑名单,不检查ID为0的Firmware固件版本更新提示和未认证的物理磁盘提示
[root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage -b ctrl_fw=0\/pdisk=ALL OK - System: ‘PowerEdge R720‘, SN: ‘33R0G42‘, 32 GB ram (4 dimms), 1 logical drives, 6 physical drives
7、输出所有检查项目
[root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage -d System: PowerEdge R720 OMSA version: 8.1.0 ServiceTag: 33R0G42 Plugin version: 3.7.11 BIOS/date: 2.4.3 07/09/2014 Checking mode: local ----------------------------------------------------------------------------- Storage Components ============================================================================= STATE | ID | MESSAGE TEXT ---------+----------+-------------------------------------------------------- WARNING | 0 | Controller 0 [PERC H310 Mini]: Firmware ‘20.12.1-0002‘ is out of date OK | 0 | Controller 0 [PERC H310 Mini] is Degraded WARNING | 0:0:1:0 | Physical Disk 0:1:0 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING | 0:0:1:1 | Physical Disk 0:1:1 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING | 0:0:1:2 | Physical Disk 0:1:2 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING | 0:0:1:3 | Physical Disk 0:1:3 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING | 0:0:1:4 | Physical Disk 0:1:4 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified WARNING | 0:0:1:5 | Physical Disk 0:1:5 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified OK | 0:0 | Logical Drive ‘/dev/sda‘ [RAID-10, 836.63 GB] is Ready OK | 0:0 | Connector 0 [SAS Port RAID Mode] on controller 0 is Ready OK | 0:1 | Connector 1 [SAS Port RAID Mode] on controller 0 is Ready OK | 0:0:1 | Enclosure 0:0:1 [Backplane] on controller 0 is Ready ----------------------------------------------------------------------------- Chassis Components ============================================================================= STATE | ID | MESSAGE TEXT ---------+------+------------------------------------------------------------ OK | 0 | Memory module 0 [DIMM_A1, 8192 MB] is Ok OK | 1 | Memory module 1 [DIMM_A2, 8192 MB] is Ok OK | 2 | Memory module 2 [DIMM_B1, 8192 MB] is Ok OK | 3 | Memory module 3 [DIMM_B2, 8192 MB] is Ok OK | 0 | Chassis fan 0 [System Board Fan1 RPM] reading: 3000 RPM OK | 1 | Chassis fan 1 [System Board Fan2 RPM] reading: 3000 RPM OK | 2 | Chassis fan 2 [System Board Fan3 RPM] reading: 2880 RPM OK | 3 | Chassis fan 3 [System Board Fan4 RPM] reading: 3000 RPM OK | 4 | Chassis fan 4 [System Board Fan5 RPM] reading: 2880 RPM OK | 5 | Chassis fan 5 [System Board Fan6 RPM] reading: 3000 RPM OK | 0 | Power Supply 0 [AC]: Presence Detected OK | 0 | Temperature Probe 0 [System Board Inlet Temp] reads 27 C (min=3/-7, max=42/47) OK | 1 | Temperature Probe 1 [System Board Exhaust Temp] reads 31 C (min=8/3, max=70/75) OK | 2 | Temperature Probe 2 [CPU1 Temp] reads 36 C (min=8/3, max=79/84) OK | 3 | Temperature Probe 3 [CPU2 Temp] reads 31 C (min=8/3, max=79/84) OK | 0 | Processor 0 [Intel Xeon E5-2630 v2 2.60GHz] is Present OK | 1 | Processor 1 [Intel Xeon E5-2630 v2 2.60GHz] is Present OK | 0 | Voltage sensor 0 [CPU1 VCORE PG] is Good OK | 1 | Voltage sensor 1 [CPU2 VCORE PG] is Good OK | 2 | Voltage sensor 2 [System Board 3.3V PG] is Good OK | 3 | Voltage sensor 3 [System Board 5V PG] is Good OK | 4 | Voltage sensor 4 [CPU2 PLL PG] is Good OK | 5 | Voltage sensor 5 [CPU1 PLL PG] is Good OK | 6 | Voltage sensor 6 [System Board 1.1V PG] is Good OK | 7 | Voltage sensor 7 [CPU1 M23 VDDQ PG] is Good OK | 8 | Voltage sensor 8 [CPU1 M23 VTT PG] is Good OK | 9 | Voltage sensor 9 [System Board FETDRV PG] is Good OK | 10 | Voltage sensor 10 [CPU2 VSA PG] is Good OK | 11 | Voltage sensor 11 [CPU1 VSA PG] is Good OK | 12 | Voltage sensor 12 [CPU2 M01 VDDQ PG] is Good OK | 13 | Voltage sensor 13 [CPU1 M01 VDDQ PG] is Good OK | 14 | Voltage sensor 14 [CPU2 M23 VTT PG] is Good OK | 15 | Voltage sensor 15 [CPU2 M01 VTT PG] is Good OK | 16 | Voltage sensor 16 [System Board NDC PG] is Good OK | 17 | Voltage sensor 17 [CPU2 VTT PG] is Good OK | 18 | Voltage sensor 18 [CPU1 VTT PG] is Good OK | 19 | Voltage sensor 19 [CPU2 M23 VDDQ PG] is Good OK | 20 | Voltage sensor 20 [System Board 1.5V PG] is Good OK | 21 | Voltage sensor 21 [System Board PS2 PG Fail] is Good OK | 22 | Voltage sensor 22 [System Board PS1 PG Fail] is Good OK | 23 | Voltage sensor 23 [System Board BP1 5V PG] is Good OK | 24 | Voltage sensor 24 [CPU1 M01 VTT PG] is Good OK | 25 | Voltage sensor 25 [PS1 Voltage 1] reads 220 V OK | 0 | Battery probe 0 [System Board CMOS Battery] is Good OK | 1 | Amperage probe 1 [System Board Pwr Consumption] reads 112 W OK | 0 | Chassis intrusion 0 detection: Ok (Chassis is closed) OK | 0 | SD Card 0 [vFlash] is Absent ----------------------------------------------------------------------------- Other messages ============================================================================= STATE | MESSAGE TEXT ---------+------------------------------------------------------------------- OK | ESM log health is Ok (less than 80% full) OK | Chassis Service Tag is sane
8、将服务器的SN号作为警告信息的输出前缀
[root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage -i [33R0G42] Controller 0 [PERC H310 Mini]: Firmware ‘20.12.1-0002‘ is out of date [33R0G42] Physical Disk 0:1:0 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified [33R0G42] Physical Disk 0:1:1 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified [33R0G42] Physical Disk 0:1:2 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified [33R0G42] Physical Disk 0:1:3 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified [33R0G42] Physical Disk 0:1:4 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified [33R0G42] Physical Disk 0:1:5 [Seagate ST3300657SS, 300GB] on ctrl 0 is Online, Not Certified
9、不检查存储
[root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage --no-storage OK - System: ‘PowerEdge R720‘, SN: ‘33R0G42‘, 32 GB ram (4 dimms), not checking storage
10、使用黑名单,不显示Firmware固件版本更新和未认证磁盘提示信息,并输出系统信息
[root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage -e -b ctrl_fw=0\/pdisk=ALL ------ SYSTEM: PowerEdge R720, SN: 33R0G42
五、使用check_openmanage获取远端服务器信息
正常情况下,如果使用check_openmanage检查本机的信息,可以直接像上面的命令一样直接使用check_openmanage命令去查看。他也支持在某一台机器上集中查看其它物理服务器的信息,此时要跟上-H ip_address信息才行。并且,被监控的服务器上还需要安装如下几个包:
net-snmp
perl-Net-SNMP
srvadmin-all
安装顺序上,net-snmp一定要放在srvadmin-all之前安装。这样子,srvadmin-all在安装的时候,会自动帮助你设置好snmp的信息。
安装范例:
被监控服务器kvm-phy04-jz:
[root@kvm-phy05-jz ~]# yum install -y net-snmp net-snmp-devel net-snmp-utils [root@kvm-phy05-jz ~]# wget -q -O - http://linux.dell.com/repo/hardware/latest/bootstrap.cgi | bash [root@kvm-phy05-jz ~]# yum -y install OpenIPMI srvadmin-all [root@kvm-phy05-jz ~]# yum remove -y srvadmin-tomcat srvadmin-jre srvadmin-smweb [root@kvm-phy05-jz ~]# rm -rf /opt/dell/srvadmin/lib64/openmanage/apache-tomcat [root@kvm-phy05-jz ~]# /etc/init.d/snmpd restart [root@kvm-phy05-jz ~]# chkconfig snmpd on [root@kvm-phy05-jz ~]# /opt/dell/srvadmin/sbin/srvadmin-services.sh restart [root@kvm-phy05-jz ~]# /opt/dell/srvadmin/sbin/srvadmin-services.sh enable
监控服务器kvm-phy04-jz:
[root@kvm-phy04-jz check_openmanage-3.7.11]# yum install -y perl-Net-SNMP [root@kvm-phy04-jz check_openmanage-3.7.11]# ./check_openmanage -H 192.168.0.210 Controller 0 [PERC H310 Mini]: Firmware ‘20.12.0-0004‘ is out of date Physical Disk 0:1:0 [Unknown vendor INTEL SSDSC2BA200G3, 199GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:1 [Unknown vendor INTEL SSDSC2BA200G3, 199GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:2 [Unknown vendor INTEL SSDSC2BA200G3, 199GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:3 [Unknown vendor INTEL SSDSC2BA200G3, 199GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:4 [Unknown vendor INTEL SSDSC2BA200G3, 199GB] on ctrl 0 is Online, Not Certified Physical Disk 0:1:5 [Unknown vendor INTEL SSDSC2BA200G3, 199GB] on ctrl 0 is Online, Not Certified
总结:
如果运维环境使用的是nagios+cacti的监控架构,使用check_openmanage可以非常方便的对线上服务器硬件进行监控预警。由于我司的监控架构使用的是zabbix,因此这里不再多说nagios的具体监控实施操作。感兴趣的同学可以参考下面两篇博文的讲解:
http://dreamway.blog.51cto.com/1281816/1048274
http://www.2cto.com/os/201505/397023.html
http://www.2cto.com/os/201405/301212.html
报错集锦:
报错1:
ERROR: You need perl module Net::SNMP to run check_openmanage in SNMP mode
原因:
SNMP监控模式下,check_openmanage 需要 perl-Net-SNMP 支持
解决方案:
安装perl-Net-SNMP包
# yum install -y perl-Net-SNMP
报错2:
ERROR: (SNMP) OpenManage is not installed or is not working correctly
原因:
snmp未配置导致。如果先安装snmp,在安装omsa的时候会自动帮你配置好snmp
配置信息如下:
解决方案:
1、先安装net-snmp,再安装omsa(即srvadmin-all)
or
2、手动按照上图信息进行配置
报错3:
SNMP CRITICAL: No response from remote host ‘X.X.X.X‘
原因:
被监控端没有安装snmp服务
解决方案:
安装snmp服务
# yum install -y net-snmpd
OK,本文到此,希望能对51博友有所帮助!
本文出自 “Not Only Linux” 博客,请务必保留此出处http://nolinux.blog.51cto.com/4824967/1665075
标签:监控 nagios dell omsa check_openmanage
原文地址:http://nolinux.blog.51cto.com/4824967/1665075