目录
1、引言及环境介绍
2、高可用环境部署
3、crmsh接口使用介绍
4、案例
5、总结
1、引言及环境介绍
在上一博文中介绍了一些关于高可用技术的理论基础知识,这一博文则是介绍corosync+pacemakcer这一高可用方案的安装部署,并会以实际的案例来演示高可用的实现,corosync提供集群的信息层(messaging layer)的功能,传递心跳信息和集群事务信息,pacemaker工作在资源分配层,提供资源管理器的功能,并以crmsh这个资源配置的命令接口来配置资源。在进入主题前先来介绍一下常见的开源高可用方案和这次环境搭建的系统环境。
常见的HA开源方案:
heartbeat v1 + haresources
heartbeat v 2 + crm
heartbeat v3 + cluster-glue + pacemaker
corosync + cluster-glue + pacemaker
cman + rgmanager
keepalived + script
此次测试的系统环境:
[root@nod1 tomcat]# cat /etc/issue CentOS release 6.4 (Final) Kernel \r on an \m [root@nod1 tomcat]# uname -r 2.6.32-358.el6.x86_64
两个节点都是采用相同的操作系统
2、高可用环境部署
[root@nod1 ~]# yum -y install pacemakcer corosync #pacemaker和corosync采用yum方式安装即可,前提是你要配置好yum源,注意:两个节点都要进行安装 [root@nod1 ~]# rpm -ql corosync /etc/corosync /etc/corosync/corosync.conf.example #主配置文件模板 /etc/corosync/corosync.conf.example.udpu /etc/corosync/service.d /etc/corosync/uidgid.d /etc/dbus-1/system.d/corosync-signals.conf /etc/rc.d/init.d/corosync /etc/rc.d/init.d/corosync-notifyd /etc/sysconfig/corosync-notifyd /usr/bin/corosync-blackbox /usr/libexec/lcrso /usr/libexec/lcrso/coroparse.lcrso /usr/libexec/lcrso/objdb.lcrso /usr/libexec/lcrso/quorum_testquorum.lcrso /usr/libexec/lcrso/quorum_votequorum.lcrso /usr/libexec/lcrso/service_cfg.lcrso /usr/libexec/lcrso/service_confdb.lcrso /usr/libexec/lcrso/service_cpg.lcrso /usr/libexec/lcrso/service_evs.lcrso /usr/libexec/lcrso/service_pload.lcrso /usr/libexec/lcrso/vsf_quorum.lcrso /usr/libexec/lcrso/vsf_ykd.lcrso /usr/sbin/corosync /usr/sbin/corosync-cfgtool /usr/sbin/corosync-cpgtool /usr/sbin/corosync-fplay /usr/sbin/corosync-keygen #为corosync生成authkey的命令,此命令是根据内核的熵池来生成认证文件的,如果熵池的随机性不足,则会运行此命令后一直卡着,此时用户只有不断的敲击键盘使产生足够的随机数后才能生成authkdy文件 /usr/sbin/corosync-notifyd /usr/sbin/corosync-objctl /usr/sbin/corosync-pload /usr/sbin/corosync-quorumtool /usr/share/doc/corosync-1.4.7 /usr/share/doc/corosync-1.4.7/LICENSE /usr/share/doc/corosync-1.4.7/SECURITY /usr/share/man/man5/corosync.conf.5.gz /usr/share/man/man8/confdb_keys.8.gz /usr/share/man/man8/corosync-blackbox.8.gz /usr/share/man/man8/corosync-cfgtool.8.gz /usr/share/man/man8/corosync-cpgtool.8.gz /usr/share/man/man8/corosync-fplay.8.gz /usr/share/man/man8/corosync-keygen.8.gz /usr/share/man/man8/corosync-notifyd.8.gz /usr/share/man/man8/corosync-objctl.8.gz /usr/share/man/man8/corosync-pload.8.gz /usr/share/man/man8/corosync-quorumtool.8.gz /usr/share/man/man8/corosync.8.gz /usr/share/man/man8/corosync_overview.8.gz /usr/share/snmp/mibs/COROSYNC-MIB.txt /var/lib/corosync /var/log/cluster
生成集群节点间的认证文件:
[root@nod1 ~]# corosync-keygen #生成认证文件 Corosync Cluster Engine Authentication key generator. Gathering 1024 bits for key from /dev/random. Press keys on your keyboard to generate entropy. Press keys on your keyboard to generate entropy (bits = 80). #熵池随机性不足时一直卡在这里,这里可以另开窗口进行其他的配置
提供corosync的配置文件,利用模板生成:
[root@nod1 ~]# cd /etc/corosync [root@nod1 corosync]# cp corosync.conf.example corosync.conf [root@nod1 corosync]# ls corosync.conf.example service.d corosync.conf corosync.conf.example.udpu uidgid.d [root@nod1 corosync]# vim corosync.conf # Please read the corosync.conf.5 manual page compatibility: whitetank #表示兼容whitetank版本,其实是corosync 0.8之前的版本 totem { #定义集群环境下各corosync间通讯机制 version: 2 # secauth: Enable mutual node authentication. If you choose to # enable this ("on"), then do remember to create a shared # secret with "corosync-keygen". #secauth: off secauth: on #表示基于authkey的方式来验证各节点 threads: 0 #启动的线程数,0表示不启动线程机制,默认即可 # interface: define at least one interface to communicate # over. If you define more than one interface stanza, you must # also set rrp_mode. interface { #定义哪个接口来传递心跳信息和集群事务信息 # Rings must be consecutively numbered, starting at 0. ringnumber: 0 #表示心跳信息发出后能够在网络中转几圈,保持默认值即可 # This is normally the *network* address of the # interface to bind to. This ensures that you can use # identical instances of this configuration file # across all your cluster nodes, without having to # modify this option. bindnetaddr: 192.168.0.0 #绑定的网络地址 # However, if you have multiple physical network # interfaces configured for the same subnet, then the # network address alone is not sufficient to identify # the interface Corosync should bind to. In that case, # configure the *host* address of the interface # instead: # bindnetaddr: 192.168.1.1 # When selecting a multicast address, consider RFC # 2365 (which, among other things, specifies that # 239.255.x.x addresses are left to the discretion of # the network administrator). Do not reuse multicast # addresses across multiple Corosync clusters sharing # the same network. mcastaddr: 239.255.21.111 #监听的多播地址,不要使用默认 # Corosync uses the port you specify here for UDP # messaging, and also the immediately preceding # port. Thus if you set this to 5405, Corosync sends # messages over UDP ports 5405 and 5404. mcastport: 5405 #corosync间传递信息使用的端口,默认即可 # Time-to-live for cluster communication packets. The # number of hops (routers) that this ring will allow # itself to pass. Note that multicast routing must be # specifically enabled on most network routers. ttl: 1 #包的生存周期,保持默认即可 } } logging { # Log the source file and line where messages are being # generated. When in doubt, leave off. Potentially useful for # debugging. fileline: off # Log to standard error. When in doubt, set to no. Useful when # running in the foreground (when invoking "corosync -f") to_stderr: no # Log to a log file. When set to "no", the "logfile" option # must not be set. to_logfile: yes logfile: /var/log/cluster/corosync.log # Log to the system log daemon. When in doubt, set to yes. to_syslog: no #关闭日志发往syslog # Log debug messages (very verbose). When in doubt, leave off. debug: off # Log messages with time stamps. When in doubt, set to on # (unless you are only logging to syslog, where double # timestamps can be annoying). timestamp: on #打印日志时是否记录时间戳,会消耗较多的cpu资源 logger_subsys { subsys: AMF debug: off } } #新增加以下内容 service { ver: 0 name: pacemaker #表示以插件化方式启用pacemaker } aisexec { #运行openaix时所使用的用户及组,默认时也是采用root,可以不定义 user: root group: root }
当corosync-keygen命令顺利运行完成后,在/etc/corosync/目录下生成authkey认证文件:
[root@nod1 corosync]# ls authkey corosync.conf.example service.d corosync.conf corosync.conf.example.udpu uidgid.d [root@nod1 corosync]# scp authkey corosync.conf nod2.test.com:/etc/corosync/ #把认证文件与配置文件拷贝到另一节点 [root@nod1 corosync]# service corosync start #启动服务,不要忘记另一个节点也要把corosync服务启动
验证corosync服务是否正常启动,在集群环境应对每个服务器都要验证:
验证corosync是否启动成功:
[root@nod1 corosync]# grep -e "Corosync Cluster Engine" /var/log/cluster/corosync.log #查看corosync集群引擎是否启动 Jul 19 21:45:48 corosync [MAIN ] Corosync Cluster Engine (‘1.4.7‘): started and ready to provide service. [root@nod1 corosync]# grep -e "configuration file" /var/log/cluster/corosync.log #查看corosync的配置文件是否成功加载 Jul 19 21:45:48 corosync [MAIN ] Successfully read main configuration file ‘/etc/corosync/corosync.conf‘.
查看定义的TOTEM接口是否启用:
[root@nod1 corosync]# grep "TOTEM" /var/log/cluster/corosync.log Jul 19 21:45:48 corosync [TOTEM ] Initializing transport (UDP/IP Multicast). Jul 19 21:45:48 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Jul 19 21:45:48 corosync [TOTEM ] The network interface [192.168.0.201] is now up. Jul 19 21:45:48 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
验证启动时是否有错误:
[root@nod1 corosync]# grep "ERROR" /var/log/cluster/corosync.log Jul 19 21:45:48 corosync [pcmk ] ERROR: process_ais_conf: You have configured a cluster using the Pacemaker plugin for Corosync. The plugin is not supported in this environment and will be removed very soon. Jul 19 21:45:48 corosync [pcmk ] ERROR: process_ais_conf: Please see Chapter 8 of ‘Clusters from Scratch‘ (http://www.clusterlabs.org/doc) for details on using Pacemaker with CMAN #上边的错误信息可以忽略,这里报错的信息主要意思是说pacemaker是以插件的方式配置的,在以后的版本中将不再支持
验证pacemaker是否正常启动:
[root@nod1 corosync]# grep "pcmk_startup" /var/log/cluster/corosync.log Jul 19 21:45:48 corosync [pcmk ] info: pcmk_startup: CRM: Initialized Jul 19 21:45:48 corosync [pcmk ] Logging: Initialized pcmk_startup Jul 19 21:45:48 corosync [pcmk ] info: pcmk_startup: Maximum core file size is: 18446744073709551615 Jul 19 21:45:48 corosync [pcmk ] info: pcmk_startup: Service: 9 Jul 19 21:45:48 corosync [pcmk ] info: pcmk_startup: Local hostname: nod1.test.com
3、crmsh接口使用介绍
pacemaker的配置接口有两种,一是crmsh,另一个是pcs,主里以crmsh的使用为例。
crmsh依赖pssh这个包,所以两个都需要分别在各个集群节点上进行安装,这两个包可以在这里进行下载http://crmsh.github.io/
[root@nod1 ~]# ls crmsh-2.1-1.6.x86_64.rpm pssh-2.3.1-2.el6.x86_64.rpm [root@nod1 ~]# yum install crmsh-2.1-1.6.x86_64.rpm pssh-2.3.1-2.el6.x86_64.rpm
crmsh的crm命令有两种模式:一种是命令模式,当执行一个命令,crmsh会把执行得到的结果输出到shell的标准输出;另一种是交互式模式;下边将有大量的例子来说明。
crm命令的使用:
[root@nod1 ~]# crm #直接使用crm进入交互式模式 crm(live)# crm(live)# help #查看帮助信息获取crm支持哪些子命令
crmsh常用的子命令:
status:查看集群的状态信息
configure:配置集群的命令
node:管理节点状态
ra:配置资源代理
resource:管理资源的子命令,比如关闭一个资源,清除资源的当前状态(比如一些出错信息)
接下来先查看一下集群的状态信息:
[root@nod1 ~]# crm crm(live)# status Last updated: Tue Jul 21 21:21:35 2015 Last change: Sun Jul 19 23:01:34 2015 Stack: classic openais (with plugin) #这里表示基于插件化的方式用openais中的corosync调用pacemaker来工作的 Current DC: nod1.test.com - partition with quorum #Designated Coordinate简称DC,表示指定的协调员,这里表示nod1.test.com就是集群中的事务协调员,“partition with quorum”就表示当前分区是拥有法定票数的 Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes #表示配置了2个节点,预计的投票数为2票 0 Resources configured #表示没有配置集群资源 Online: [ nod1.test.com nod2.test.com ] #这里显示两个节点都是在线的
查看集群默认的配置信息:
[root@nod1 ~]# crm crm(live)# configure crm(live)configure# show #使用show这个子命令就能查看当前集群的配置信息,使用“show xml”能以xml文件格式显示出当前的配置信息 node nod1.test.com node nod2.test.com property cib-bootstrap-options: dc-version=1.1.11-97629de cluster-infrastructure="classic openais (with plugin)" expected-quorum-votes=2 stonith-enabled=true no-quorum-policy=stop last-lrm-refresh=1436887216 crm(live)configure# verify #verify是检查配置文件是否有错误 error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity Errors found during check: config not valid #这里报了一些错误,表示默认时没有定义STONITH设备,在corosync+pacemaker的集群是不允许的,当然可以定义忽略这个检查,下边有介绍
用property子命令定义集群的全局属性:
[root@nod1 ~]# crm crm(live)configure# property #在crmsh接口中是支持tab键命令补全功能的,这里输入property后连续敲击两下tab键就可列出可配置的参数 batch-limit= maintenance-mode= remove-after-stop= cluster-delay= migration-limit= shutdown-escalation= cluster-recheck-interval= no-quorum-policy= start-failure-is-fatal= crmd-transition-delay= node-action-limit= startup-fencing= dc-deadtime= node-health-green= stonith-action= default-action-timeout= node-health-red= stonith-enabled= default-resource-stickiness= node-health-strategy= stonith-timeout= election-timeout= node-health-yellow= stop-all-resources= enable-acl= pe-error-series-max= stop-orphan-actions= enable-startup-probes= pe-input-series-max= stop-orphan-resources= is-managed-default= pe-warn-series-max= symmetric-cluster= load-threshold= placement-strategy= crm(live)configure# property stonith-enabled=false #把stonith设备的支持关闭,不然我们在想使用corosync的集群功能就需要定义stonith设备 crm(live)configure# show node nod1.test.com node nod2.test.com property cib-bootstrap-options: dc-version=1.1.11-97629de cluster-infrastructure="classic openais (with plugin)" expected-quorum-votes=2 stonith-enabled=false \ #已是false状态 no-quorum-policy=stop last-lrm-refresh=1436887216 crm(live)configure# verify #再校验配置就不会报错了 crm(live)configure# commit #提交配置
集群资源的配置
要想获取资源的详细信息就需要去ra(resource agent)中去查看,比如我们要定义一个虚拟ip资源:
[root@nod1 ~]# crm crm(live)# ra crm(live)ra# classes #查看集群资源有哪些类型 lsb ocf / heartbeat pacemaker service stonith crm(live)ra# list ocf #列出ocf这个类型下有哪些资源代理,下边就有IPaddr这个关于设置ip的资源代理 CTDB ClusterMon Delay Dummy Filesystem HealthCPU HealthSMART IPaddr IPaddr2 IPsrcaddr LVM MailTo Route SendArp Squid Stateful SysInfo SystemHealth VirtualDomain Xinetd apache conntrackd controld db2 dhcpd ethmonitor exportfs iSCSILogicalUnit mysql named nfsnotify nfsserver pgsql ping pingd postfix remote rsyncd symlink tomcat crm(live)ra# meta ocf:IPaddr #使用meta子命令能获取到一个资源的详细信息,即此资源的使用帮助信息
定义主资源用primitive命令:
[root@nod1 ~]# crm crm(live)#configure crm(live)configure# primitive webip ocf:IPaddr params ip=192.168.0.100 crm(live)configure# verify crm(live)configure# commit #一旦提交成功,此资源就开始生效了 crm(live)configure# cd .. crm(live)# status Last updated: Tue Jul 21 22:14:43 2015 Last change: Tue Jul 21 22:12:44 2015 Stack: classic openais (with plugin) Current DC: nod1.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 1 Resources configured Online: [ nod1.test.com nod2.test.com ] webip(ocf::heartbeat:IPaddr):Started nod1.test.com #这里就是我们定义好的资源,在nod1.test.com节点启用了 [root@nod1 ~]# ip add show 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000 link/ether 00:0c:29:07:89:fe brd ff:ff:ff:ff:ff:ff inet 192.168.0.201/24 brd 192.168.0.255 scope global eth0 inet 192.168.0.100/24 brd 192.168.0.255 scope global secondary eth0 inet6 fe80::20c:29ff:fe07:89fe/64 scope link valid_lft forever preferred_lft forever #我们定义的ip已生效
定义nginx的这个服务资源:
[root@nod1 ~]# crm crm(live)# configure crm(live)configure# primitive nginx lsb:nginx #nginx这个服务是在lsb这个资源类别下的资源代理,primitive命令后的第一个nginx是定义集群资源的一个名称 crm(live)configure# verify crm(live)configure# commit crm(live)configure# cd .. crm(live)# status Last updated: Tue Jul 21 22:25:00 2015 Last change: Tue Jul 21 22:24:58 2015 Stack: classic openais (with plugin) Current DC: nod1.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] webip(ocf::heartbeat:IPaddr):Started nod1.test.com nginx(lsb:nginx):Started nod2.test.com #nginx这个资源在nod2.test.com节点启动起来了,这也验证了在高可用集群中集群会尽可能让资源分摊到各个节点的特性,而在实际环境中我们希望webip与nginx这两个资源是运行在同一个节点上的。
要想让多个资源同时运行在同一个节点上可以把多个资源定义在一个group中或定义排列约束(colocation):
[root@nod1 ~]# crm crm(live)# configure crm(live)configure# group webservice webip nginx crm(live)configure# verify crm(live)configure# commit crm(live)configure# cd .. crm(live)# status Last updated: Tue Jul 21 22:30:19 2015 Last change: Tue Jul 21 22:30:17 2015 Stack: classic openais (with plugin) Current DC: nod1.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod1.test.com nginx(lsb:nginx):Started nod1.test.com #两个资源同时运行在nod1.test.com上了
接下验证资源是否能转移到其他节点上:
[root@nod1 ~]# crm node standby #把当前节点转换成standby状态 [root@nod1 ~]# crm status Last updated: Tue Jul 21 22:37:14 2015 Last change: Tue Jul 21 22:37:09 2015 Stack: classic openais (with plugin) Current DC: nod1.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Node nod1.test.com: standby Online: [ nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod2.test.com nginx(lsb:nginx):Started nod2.test.com #webservice组中的资源已转换到了nod2.test.com节点上
再让nod1.test.com重新上线,观察资源是否能转移回来:
[root@nod1 ~]# crm node online #让当前节点重新上线 You have new mail in /var/spool/mail/root [root@nod1 ~]# crm status Last updated: Tue Jul 21 22:38:37 2015 Last change: Tue Jul 21 22:38:33 2015 Stack: classic openais (with plugin) Current DC: nod1.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod2.test.com nginx(lsb:nginx):Started nod2.test.com #webservice组资源没有转换到nod1.test.com,是因为没有定义组对节点的倾向性
如果此时把nod2.test.com节点上的corosync服务停止,webservice这个组中的资源能够转换到nod1.test.com节点上吗?如下测试:
[root@nod2 ~]# service corosync stop Signaling Corosync Cluster Engine (corosync) to terminate: [确定] Waiting for corosync services to unload:. [确定] You have new mail in /var/spool/mail/root 在nod1.test.com节在上查看当前集群的状态: [root@nod1 ~]# crm status Last updated: Tue Jul 21 22:43:27 2015 Last change: Tue Jul 21 22:38:33 2015 Stack: classic openais (with plugin) Current DC: nod1.test.com - partition WITHOUT quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com ] OFFLINE: [ nod2.test.com ]
从上边的输出信息可知资源并没有转移过来,为什么?仔细看上边的“Current DC: nod1.test.com - partition WITHOUT quorum ”表示当前分区没有法定的票数,所以此节点不会正常工作,资源当然不会转移过来。那如何解决这个问题,方案不止一个,一是可以增加一个ping node节点,二是可以增加一个仲裁磁盘,三是让集群中的节点数成奇数个,四是直接忽略当集群没有法定票数时直接忽略,第四种是最简单的方式,操作如下:
[root@nod2 ~]# service corosync start #先把nod2.test.com的corosync服务启动 Starting Corosync Cluster Engine (corosync): [确定] [root@nod1 ~]# crm crm(live)# status Last updated: Tue Jul 21 22:50:08 2015 Last change: Tue Jul 21 22:38:33 2015 Stack: classic openais (with plugin) Current DC: nod1.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod1.test.com nginx(lsb:nginx):Started nod1.test.com crm(live)configure# property no #敲击两下tab键后列出以no开头的可配置参数 no-quorum-policy= node-health-green= node-health-strategy= node-action-limit= node-health-red= node-health-yellow= crm(live)configure# property no-quorum-policy= #输入“no-quorum-policy=”再敲击两下tab键后列出一些帮助信息 no-quorum-policy (enum, [stop]): What to do when the cluster does not have quorum What to do when the cluster does not have quorum Allowed values: stop, freeze, ignore, suicide crm(live)configure# property no-quorum-policy=ignore #设置其值为"ignore" crm(live)configure# verify crm(live)configure# commit crm(live)configure# show #显示当前的配置信息 node nod1.test.com attributes standby=off node nod2.test.com primitive nginx lsb:nginx primitive webip IPaddr params ip=192.168.0.100 group webservice webip nginx property cib-bootstrap-options: dc-version=1.1.11-97629de cluster-infrastructure="classic openais (with plugin)" expected-quorum-votes=2 stonith-enabled=false no-quorum-policy=ignore last-lrm-refresh=1436887216 crm(live)# status Last updated: Tue Jul 21 22:54:00 2015 Last change: Tue Jul 21 22:51:10 2015 Stack: classic openais (with plugin) Current DC: nod1.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod1.test.com nginx(lsb:nginx):Started nod1.test.com #当前资源已运行在nod1.test.com上
在nod1.test.com上停止corosync服务,再观察资源是否能转移到nod2.test.com上:
[root@nod1 ~]# service corosync stop Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ] Waiting for corosync services to unload:. [ OK ] [root@nod2 ~]# crm #在nod2.test.com上进行crm管理接口 crm(live)# status Last updated: Tue Jul 21 22:56:52 2015 Last change: Tue Jul 21 22:52:25 2015 Stack: classic openais (with plugin) Current DC: nod2.test.com - partition WITHOUT quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod2.test.com ] OFFLINE: [ nod1.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod2.test.com nginx(lsb:nginx):Started nod2.test.com #资源已成功转移到nod2.test.com上,所以在两个节点的高可用的环境,要设置“no-quorum-policy=ignore”,忽略节点的得到的法定票数不大于一半时的情况
如果是我们把在nod2.test.com上的nginx进程杀掉,集群资源会被转移到nod1.test.com上吗?如下测试:
[root@nod1 ~]# service corosync start #先把nod1.test.com上的corosync服务启动 Starting Corosync Cluster Engine (corosync): [ OK ] [root@nod1 ~]# crm status Last updated: Wed Jul 22 22:22:56 2015 Last change: Wed Jul 22 22:19:55 2015 Stack: classic openais (with plugin) Current DC: nod2.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod2.test.com nginx(lsb:nginx):Started nod2.test.com 再切换到nod2.test.com节点上杀掉nginx进程: [root@nod2 ~]# pgrep nginx 1798 1799 [root@nod2 ~]# killall nginx #杀掉nginx进程 [root@nod2 ~]# pgrep nginx #检验nginx进程是否被杀掉,没有任何信息输出表示nginx进程已不存在 [root@nod2 ~]# crm status Last updated: Wed Jul 22 22:26:09 2015 Last change: Wed Jul 22 22:19:55 2015 Stack: classic openais (with plugin) Current DC: nod2.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod2.test.com nginx(lsb:nginx):Started nod2.test.com
上边查看集群状态时发现资源还是在nod2.test.com节点上,这在实际的生产环境中是不允许的,所以需要让集群能监控我们定义的资源,如果发现某资源不存在了,自己会尝试启动这一资源,如果尝试启动不成功,则会转移资源,下 边就来说说如何定义监控资源。
[root@nod2 ~]# service nginx start #先把上边杀掉的nginx启动起来 正在启动 nginx: [确定]
要定义资源的监控时也是在用全局定义命令primitive定义资源时一同定义,接下来我们先把之前定义的资源删掉后重新定义:
[root@nod1 ~]# crm crm(live)# resource crm(live)resource# show Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nginx(lsb:nginx):Started #进入资源管理命令可查看当前集群配置资源的情况,上边表示两个资源都是处理started状态 crm(live)resource# stop webservice #停掉webservice这个组中的所有资源,要删除资源,必须让资源处理stoppped状态 crm(live)resource# show Resource Group: webservice webip(ocf::heartbeat:IPaddr):Stopped nginx(lsb:nginx):Stopped crm(live)resource# cd .. crm(live)# configure crm(live)configure# edit #输入edit命令回车后会调用vi编辑器直接去编辑资源定义的配置文件,如下所示 node nod1.test.com attributes standby=on node nod2.test.com primitive nginx lsb:nginx #这是定义的资源,需要删除 primitive webip IPaddr \ #这是定义的资源,需要删除 params ip=192.168.0.100 group webservice webip nginx \ #这是定义的资源,需要删除 meta target-role=Stopped property cib-bootstrap-options: dc-version=1.1.11-97629de cluster-infrastructure="classic openais (with plugin)" expected-quorum-votes=2 stonith-enabled=false no-quorum-policy=ignore last-lrm-refresh=1436887216 #vim:set syntax=pcmk
在上边打开的编辑窗口中删除我们自己定义的资源,再保存退出,最后保留的内容如下:
node nod1.test.com attributes standby=on node nod2.test.com property cib-bootstrap-options: dc-version=1.1.11-97629de cluster-infrastructure="classic openais (with plugin)" expected-quorum-votes=2 stonith-enabled=false no-quorum-policy=ignore last-lrm-refresh=1436887216 #vim:set syntax=pcmk crm(live)configure# verify #检查语法 crm(live)configure# commit #提交配置 crm(live)resource# cd #回到根目录 crm(live)# status #查看集群状态 Last updated: Wed Jul 22 21:33:07 2015 Last change: Wed Jul 22 21:31:45 2015 Stack: classic openais (with plugin) Current DC: nod2.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 0 Resources configured Online: [ nod1.test.com nod2.test.com ]
从状态信息输出发现我们定义的资源已被删除了,现在开始重新定义带监控的资源:
crm(live)configure# primitive webip ocf:IPaddr params ip=192.168.0.100 op monitor timeout=20s interval=60s crm(live)configure# primitive webserver lsb:nginx op monitor timeout=20s interval=60s crm(live)configure# group webservice webip webserver crm(live)configure# verify crm(live)configure# commit crm(live)configure# cd crm(live)# status Last updated: Wed Jul 22 22:29:59 2015 Last change: Wed Jul 22 22:28:01 2015 Stack: classic openais (with plugin) Current DC: nod2.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod1.test.com webserver(lsb:nginx):Started nod1.test.com
这样带监控的资源就定义好了,上边在定义监控是的那些参数的意义可以在使用类似的命令查看“crm(live)ra# meta ocf:IPaddr”。现在我们再到nod1.test.com节点上把nginx杀掉,观察会发生什么现象:
[root@nod1 ~]# pgrep nginx 3056 3063 [root@nod1 ~]# killall nginx [root@nod1 ~]# pgrep nginx [root@nod1 ~]# pgrep nginx [root@nod1 ~]# pgrep nginx #等了几十秒后,nginx又被重新启动了 3337 3338
再看一下集群的状态信息,如下:
[root@nod1 ~]# crm status Last updated: Wed Jul 22 22:33:29 2015 Last change: Wed Jul 22 22:28:01 2015 Stack: classic openais (with plugin) Current DC: nod2.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod1.test.com webserver(lsb:nginx):Started nod1.test.com Failed actions: webserver_monitor_60000 on nod1.test.com ‘not running‘ (7): call=23, status=complete, last-rc-change=‘Wed Jul 22 22:32:02 2015‘, queued=0ms, exec=0ms #这里报告了webserver这个资源没有运行
如果我们kill掉nginx后,让nginx无法启动,又是怎样一个情况呢,我们这样来测试,把nginx杀掉后,立刻去修改nginx的配置文件,随便增加一些行,让nginx的配置文件无法通过语法检测,这样自然nginx就无法启动了,说做就做:
[root@nod1 ~]# killall nginx [root@nod1 ~]# echo "test" >> /etc/nginx/nginx.conf [root@nod1 ~]# nginx -t nginx: [emerg] unexpected end of file, expecting ";" or "}" in /etc/nginx/nginx.conf:44 nginx: configuration file /etc/nginx/nginx.conf test failed [root@nod1 ~]# crm status Last updated: Wed Jul 22 22:37:42 2015 Last change: Wed Jul 22 22:28:01 2015 Stack: classic openais (with plugin) Current DC: nod2.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod2.test.com #看这里资源被转移到nod2.test.com了 webserver(lsb:nginx):Started nod2.test.com Failed actions: webserver_start_0 on nod1.test.com ‘unknown error‘ (1): call=30, status=complete, last-rc-change=‘Wed Jul 22 22:37:02 2015‘, queued=0ms, exec=70ms #这里也报告一个未知的错误
上边的两个测试证明,集群对资源能实现监控,并在资源不可用时能测试重新启动资源,如果不成功则转移资源。测试完了不要忘记恢复nod1.test.com节点上的nginx配置。
4、资源约束
资源约束定义我们期望资源运行在某一个节点上,或期望某些资源会在一起,而不使用组的方式定义。
接着上边的实验,我们希望webip与webserver这两个资源始终是在一起的,而不用定义webservice这个group来实现,那做如下操作:
[root@nod1 ~]# crm crm(live)# status Last updated: Wed Jul 22 22:46:26 2015 Last change: Wed Jul 22 22:28:01 2015 Stack: classic openais (with plugin) Current DC: nod2.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod2.test.com webserver(lsb:nginx):Started nod2.test.com Failed actions: webserver_start_0 on nod1.test.com ‘unknown error‘ (1): call=30, status=complete, last-rc-change=‘Wed Jul 22 22:37:02 2015‘, queued=0ms, exec=70ms
先把上边资源的报错信息清理掉:
[root@nod1 ~]# crm crm(live)# resource crm(live)resource# cleanup webserver #清理资源的一些状态信息 Cleaning up webserver on nod1.test.com Cleaning up webserver on nod2.test.com Waiting for 2 replies from the CRMd.. OK crm(live)resource# cd crm(live)# status Last updated: Wed Jul 22 22:47:53 2015 Last change: Wed Jul 22 22:47:47 2015 Stack: classic openais (with plugin) Current DC: nod2.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod2.test.com webserver(lsb:nginx):Started nod2.test.com
接下来删除webservice这个组资源:
[root@nod1 ~]# crm crm(live)# resource crm(live)resource# status Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started webserver(lsb:nginx):Started crm(live)configure# delete webservice #删除组资源 crm(live)configure# verify crm(live)configure# commit crm(live)# status Last updated: Wed Jul 22 23:00:13 2015 Last change: Wed Jul 22 23:00:09 2015 Stack: classic openais (with plugin) Current DC: nod2.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] webip(ocf::heartbeat:IPaddr):Started nod1.test.com #组被删除后,两个资源被集群平均分布在各节点上 webserver(lsb:nginx):Started nod2.test.com #webserver运行在nod2.test.com上
4.1、定义排列约束(colocation)
排列约束是定义让两个资源是否在一起:
[root@nod1 ~]# crm crm(live)#configure crm(live)configure# help colocation #查看colocation帮助信息 crm(live)configure# colocation webserver_with_webip inf: webserver webip #这里表示webserver资源与webip在一起的可能是正无穷的,即两资源一定要在一起 crm(live)configure# show xml #查看我们定义的约束 crm(live)configure# verify crm(live)configure# commit crm(live)configure# cd .. crm(live)# status Last updated: Wed Jul 22 23:09:11 2015 Last change: Wed Jul 22 23:09:08 2015 Stack: classic openais (with plugin) Current DC: nod2.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] webip(ocf::heartbeat:IPaddr):Started nod1.test.com #现在两个资源又都在nod1.test.com上运行了 webserver(lsb:nginx):Started nod1.test.com
4.2、定义顺序约束(order)
顺序约束表示资源的启动按照一定的顺序进行,而关闭则是一个相反的过程:
[root@nod1 ~]# crm crm(live)configure# help order #查看帮助 crm(live)configure# order webip_before_webserver mandatory: webip webserver #表示webip先于webserver启动,详细请看帮助信息 crm(live)configure# verify crm(live)configure# commit crm(live)configure# show xml #查看定义的详情
4.3、定义位置约束(location)
位置约束表示资源更倾向运行在哪个节点上。
[root@nod1 ~]# crm crm(live)# status Last updated: Wed Jul 22 23:20:08 2015 Last change: Wed Jul 22 23:15:39 2015 Stack: classic openais (with plugin) Current DC: nod2.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] webip(ocf::heartbeat:IPaddr):Started nod1.test.com #此时资源是运行在nod1.test.com上的 webserver(lsb:nginx):Started nod1.test.com
定义位置约束让资源更倾向运行在nod2.test.com上:
[root@nod1 ~]# crm crm(live)# configure crm(live)configure# help location #查看帮助信息 crm(live)configure# location webip_on_nod2 webip inf: nod2.test.com #表示webip在nod2.test.com上的倾向性是正无穷的 crm(live)configure# verify crm(live)configure# commit crm(live)configure# cd crm(live)# status Last updated: Wed Jul 22 23:23:21 2015 Last change: Wed Jul 22 23:22:50 2015 Stack: classic openais (with plugin) Current DC: nod2.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] webip(ocf::heartbeat:IPaddr):Started nod2.test.com webserver(lsb:nginx):Started nod2.test.com
上边webip与webserver资源都已转移到了nod2.test.com,但webserver资源我们并没有定义它的位置约束,为什么它也转移到了nod2.test.com上了呢?因为我们定义过webip与webserver的排序约束,这两个资源在一起的分数(score)是inf(正无穷)的,所以webip在哪里,webserver就在哪里。
location的定义还有另外一种格式,如下:
[root@nod1 ~]# crm crm(live)configure# delete webip_on_nod2 #先删除上边定义的location crm(live)configure# verify crm(live)configure# commit crm(live)configure# location webip_on_nod1 webip rule inf: #uname eq nod1.test.com #表示webip运行在名称为nod1.test.com主机上的倾向性是正无穷的 crm(live)configure# verify crm(live)configure# commit crm(live)configure# cd crm(live)# status Last updated: Wed Jul 22 23:33:38 2015 Last change: Wed Jul 22 23:33:18 2015 Stack: classic openais (with plugin) Current DC: nod2.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] webip(ocf::heartbeat:IPaddr):Started nod1.test.com webserver(lsb:nginx):Started nod1.test.com #上边的两个资源又转移到了nod1.test.com节点上。
接着再来定义一个location:
crm(live)configure# location webserver_not_on_nod1 webserver rule -inf: #uname eq nod1.test.com #这里表示webserver资源不在nod1上的分数是负无穷 crm(live)configure# verify crm(live)configure# commit crm(live)configure# cd crm(live)# status Last updated: Wed Jul 22 23:41:25 2015 Last change: Wed Jul 22 23:41:19 2015 Stack: classic openais (with plugin) Current DC: nod2.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] webip(ocf::heartbeat:IPaddr):Started nod2.test.com webserver(lsb:nginx):Started nod2.test.com
webip与webserver从nod1.test.com上转移到了nod2.test上,为什么呢?虽然定义了webserver资源不在nod1上的分数是负无穷,但我们不是定义了webip对nod1.test.com的倾向性是正无穷么,这个“inf+(-inf)”等于什么呢?答案是“-inf”,所以资源绝对不会在nod1.test.com上。
5、案例
一个高可用集群一般会包含三类资源,一是虚拟ip,二是服务,三是共享存储,下边我们再把共享存储加上来一起说说高可用的实现,因有新的资源加入,在资源的约束上又会有所不同,所以先把上边的定义的ip资源、服务资源删除,重新来说说有三种资源的高可用性,怎样删除集群中的资源这里就不再赘述了,可以看看前边的操作。
资源删除后就是一个干净的集群,如下所示:
[root@nod1 ~]# crm crm(live)# status Last updated: Fri Jul 24 20:58:49 2015 Last change: Fri Jul 24 20:58:32 2015 Stack: classic openais (with plugin) Current DC: nod1.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 0 Resources configured Online: [ nod1.test.com nod2.test.com ]
接下来准备共享存储,这里以nod0.test.com这个节点提供NFS共享存储为例:
[root@nod0 ~]# yum -y install nfs-utils [root@nod0 ~]# vim /etc/exports /web/htdocs 192.168.0.0/24(rw) [root@nod0 ~]# mkdir -pv /web/htdocs [root@nod0 ~]# vim /web/htdocs/index.html [root@nod0 ~]# service rpcbind start Starting rpcbind: [ OK ] [root@nod0 ~]# service nfs start Starting NFS services: [ OK ] Starting NFS mountd: [ OK ] Starting NFS daemon: [ OK ] Starting RPC idmapd: [ OK ] [root@nod0 ~]# vim /etc/exports /web/htdocs 192.168.0.0/24(rw,no_root_squash) [root@nod0 ~]# mkdir -pv /web/htdocs/ [root@nod0 ~]# echo "<h>NFS node</h>" > /web/htdocs/index.html #这是提供的测试页面 [root@nod2 ~]# mount -t nfs 192.168.0.200:/web/htdocs /usr/share/nginxhtml/ #nfs第一次挂载很慢,所以先手动挂载一次
再在nod2.test.com上启动nginx,测试一下能否访问nod2.test.com节点上的ip:192.168.0.202测试页面:
[root@nod2 ~]# service nginx start 正在启动 nginx: [确定]
测试通过了要关闭nginx服务,卸载共享存储: [root@nod2 ~]# umount /usr/share/nginx/html/ You have new mail in /var/spool/mail/root [root@nod2 ~]# service nginx stop 停止 nginx: [确定]
接下来就去定义高可用集群的资源了:
[root@nod1 ~]# crm crm(live)# configure crm(live)configure# primitive webip ocf:IPaddr params ip=192.168.0.100 op monitor timeout=10s interval=30s crm(live)configure# primitive webserver lsb:nginx op monitor timeout=10s interval=30s crm(live)configure# primitive webstore ocf:Filesystem params device="192.168.0.200:/web/htdocs" directory="/usr/share/nginx/html" fstype="nfs" op monitor timeout=30s interval=60s crm(live)configure# verify WARNING: webip: specified timeout 10s for monitor is smaller than the advised 20s WARNING: webserver: specified timeout 10s for monitor is smaller than the advised 15 WARNING: webstore: default timeout 20s for start is smaller than the advised 60 #表示nfs共享存储要定义start时的超时时间,默认是20s,但建议是60s WARNING: webstore: default timeout 20s for stop is smaller than the advised 60 #表示nfs共享存储要定义stop时的超时时间,默认是20s,但建议是60s WARNING: webstore: specified timeout 30s for monitor is smaller than the advised 40 在校验时报了如下错误,大概是说在设置资源时监控的时间值不对,按照提示做修改就是 crm(live)configure# cd .. There are changes pending. Do you want to commit them (y/n)? n #这里不要提交,当然也可以用"edit"命令调用vi编辑器去编辑xml文件 crm(live)# configure #进入配置模式重新定义资源 crm(live)configure# primitive webip ocf:IPaddr params ip=192.168.0.222 op monitor timeout=20s interval=30s crm(live)configure# verify crm(live)configure# primitive webserver lsb:nginx op monitor timeout=15s interval=30s crm(live)configure# verify crm(live)configure# primitive webstore ocf:Filesystem params device="192.168.0.200:/web/htdocs" directory="/usr/share/nginx/html" fstype="nfs" op monitor timeout=30s interval=60s op start timeout=60s op stop timeout=60s crm(live)configure# verify WARNING: webstore: specified timeout 30s for monitor is smaller than the advised 40 #这里还有一个值设置不对 crm(live)configure# edit #直接进入编辑模式进行修改,修改后就是下边这样 node nod1.test.com attributes standby=off node nod2.test.com attributes standby=off primitive webip IPaddr params ip=192.168.0.222 op monitor timeout=20s interval=30s primitive webserver lsb:nginx op monitor timeout=15s interval=30s primitive webstore Filesystem params device="192.168.0.200:/web/htdocs" directory="/usr/share/nginx/html" fstype=nfs op monitor timeout=40s interval=60s op start timeout=60s interval=0 op stop timeout=60s interval=0 property cib-bootstrap-options: dc-version=1.1.11-97629de cluster-infrastructure="classic openais (with plugin)" expected-quorum-votes=2 stonith-enabled=false no-quorum-policy=ignore last-lrm-refresh=1437576541 #vim:set syntax=pcmk #记得保存退出 crm(live)configure# verify #现在校验就没有错误了 crm(live)configure# commit #提交配置
接下来定义三个资源的一些约束,思考一下,有VIP、有服务、有共享存储的一个高可用集群需要怎样一些约束关系呢?第一:集群在正常工作时三个资源应该是运行在一个节点上的,而三个资源间又有一些小的约束关系,VIP要与服务(nginx)要在一起,服务(nginx)要与共享存储在一起,这些可以用排列约束(colocation),也可以用组(group)的方式实现;第二:各个资源的启动次序,VIP应该是先于服务启动,共享存储得先挂载上才启动服务吧;接下来就去定义这些:
crm(live)configure# group webservice webip webstore webserver #定义一个组包含三个资源 crm(live)configure# order webip_before_webstore_before_webserver inf: webip webstore webserver #定义顺序约束,定义这三个资源的启动顺序一定(inf)是先启动webip,接着是webstore,最后是webserver,而关闭则是相反的过程 crm(live)configure# verify crm(live)configure# show xml # 查看配置的xml文件
如果这三个资源对集群节点没有倾向性那就直接可以commit了,特别是在当今虚拟化泛滥的年代,高可用一样部署xem、kvm、openstack这样的虚拟环境下,集群资源对虚拟资源的倾向性表现得不明显了。
crm(live)configure# commit crm(live)configure# cd .. crm(live)# status Last updated: Fri Jul 24 22:25:06 2015 Last change: Fri Jul 24 22:25:02 2015 Stack: classic openais (with plugin) Current DC: nod1.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 3 Resources configured Online: [ nod1.test.com nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod1.test.com webstore(ocf::heartbeat:Filesystem):Started nod1.test.com webserver(lsb:nginx):Started nod1.test.com #从上边的输出信息可知资源运行在了nod1.test.com这个节点上了
现在访问http服务测试一下,访问的是我们定义的VIP,如下:
现在测试一下集群资源是否能正常转移,把nod1.test.com节点置于standby状态,看资源是否能转移到nod2.test.com节点上:
[root@nod1 ~]# crm node standby [root@nod1 ~]# crm status Last updated: Fri Jul 24 22:28:03 2015 Last change: Fri Jul 24 22:27:53 2015 Stack: classic openais (with plugin) Current DC: nod1.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 3 Resources configured Node nod1.test.com: standby Online: [ nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod2.test.com webstore(ocf::heartbeat:Filesystem):Started nod2.test.com webserver(lsb:nginx):Started nod2.test.com #上边输出信息中看到资源都转移到了nod2.test.com上了
再去刷新一下访问页面:如下依然是有效的,如下:
经测试资源能正常切换,接下来还要测试定义的资源监控是否生效,可以去尝试停止nginx服务或umount共享存储,等监控资源的时间到时集群就会尝试重新启动服务或挂载共享存储:
[root@nod2 ~]# service nginx stop 停止 nginx: [确定]
过一会后,集群就监控到异常了,如下:
[root@nod1 ~]# crm status Last updated: Fri Jul 24 22:42:05 2015 Last change: Fri Jul 24 22:36:00 2015 Stack: classic openais (with plugin) Current DC: nod1.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 3 Resources configured Online: [ nod1.test.com nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod2.test.com webstore(ocf::heartbeat:Filesystem):Started nod2.test.com webserver(lsb:nginx):Started nod2.test.com Failed actions: webserver_monitor_30000 on nod2.test.com ‘not running‘ (7): call=41, status=complete, last-rc-change=‘Fri Jul 24 22:37:25 2015‘, queued=0ms, exec=0ms
再来测试一下共享存储是否能监控并恢复,如下:
[root@nod2 ~]# umount /usr/share/nginx/html/
现在去访问web,就是打开nginx的默认页面了,如下:
当检测时间一到,集群就会发现异常,并尝试恢复,如下:
[root@nod1 ~]# crm status Last updated: Fri Jul 24 22:44:03 2015 Last change: Fri Jul 24 22:36:00 2015 Stack: classic openais (with plugin) Current DC: nod1.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 3 Resources configured Online: [ nod1.test.com nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod2.test.com webstore(ocf::heartbeat:Filesystem):Started nod2.test.com webserver(lsb:nginx):Started nod2.test.com Failed actions: webserver_monitor_30000 on nod2.test.com ‘not running‘ (7): call=41, status=complete, last-rc-change=‘Fri Jul 24 22:37:25 2015‘, queued=0ms, exec=0ms webstore_monitor_60000 on nod2.test.com ‘not running‘ (7): call=39, status=complete, last-rc-change=‘Fri Jul 24 22:42:55 2015‘, queued=0ms, exec=0ms
现在访问web页面又恢复了,如下:
至此,corosync+pacemaker+crmsh的高可用的实现已演示完毕。
6、总结
作为一个Linux运维工程师,掌握高可用架构是必不可少的技能,刚学习高可用时感觉那些理论知识就不好理解,可现在把上边的实验做完后,感觉对高可用架构有了新的认识,并对上一博客中提到的理论知识也有了新的认识。
在利用corosync+pacemaker且是两个节点实现高可用时,需要注意的是要设置全局属性把stonith设备关闭,忽略法定票数不大于一半的机制,即:
crm(live)configure# property no-quorum-policy=ignore crm(live)configure# property stonith-enabled=false
本文出自 “专注运维,与Linux共舞” 博客,请务必保留此出处http://zhaochj.blog.51cto.com/368705/1678307
原文地址:http://zhaochj.blog.51cto.com/368705/1678307