Corosync概述:
Corosync是集群管理套件的一部分,它在传递信息的时候可以通过一个简单的配置文件来定义信息传递的方式和协议等。它是一个新兴的软件,2008年推出,但其实它并不是一个真正意义上的新软件,在2002年的时候有一个项目Openais , 它由于过大,分裂为两个子项目,其中可以实现HA心跳信息传输的功能就是Corosync ,它的代码60%左右来源于Openais. Corosync可以提供一个完整的HA功能,但是要实现更多,更复杂的功能,那就需要使用Openais了。Corosync是未来的发展方向。在以后的新项目里,一般采用Corosync,而hb_gui可以提供很好的HA管理功能,可以实现图形化的管理。另外相关的图形化有RHCS的套件luci+ricci,当然还有基于java开发的LCMC集群管理工具;它与heartbeat都是实现集群高可用的工具,到这里corosync与pacemaker的基础知识就说到这里了,下面我们来看看怎么安装corosync与pacemaker。
Corosync与pacemaker安装:
1.环境说明
(1).操作系统
CentOS 6.5 X86_64位系统
(2).软件环境
**corosync-1.4.1-17.el6.x86_64
**crmsh-1.2.6-4.el6.x86_64.rpm
**pssh-2.3.1-2.el6.x86_64.rpm
(3).拓扑环境
节点数:3 分别为:node1 node2 nfs
node1:172.16.100.6 node2:172.16.100.7 nfs:172.16.100.9 TestHost:172.16.100.88
拓扑结构如下图所示:
2.安装及配置过程如下:
1、准备工作
为了配置一台Linux主机成为HA的节点,通常需要做出如下的准备工作:
1)所有节点的主机名称和对应的IP地址解析服务可以正常工作,且每个节点的主机名称需要跟"uname -n“命令的结果保持一致;因此,需要保证两个节点上的/etc/hosts文件均为下面的内容:
# vim /etc/hosts 172.16.100.6 node1.magedu.com node1 172.16.100.7 node2.magedu.com node2
为了使得重新启动系统后仍能保持如上的主机名称,还分别需要在各节点执行类似如下的命令:
Node1配置:
# sed -i ‘s@\(HOSTNAME=\).*@\1node1.samlee.com@g‘ /etc/sysconfig/network # hostname node1.samlee.com
Node2配置:
# sed -i ‘s@\(HOSTNAME=\).*@\1node2.samlee.com@g‘ /etc/sysconfig/network # hostname node2.samlee.com
2)设定两个节点可以基于密钥进行ssh通信,这可以通过如下的命令实现:
Node1配置:
# ssh-keygen -t rsa -P ‘‘ # ssh-copy-id -i ~/.ssh/id_rsa.pub root@node2 # ssh node2.samlee.com ‘date‘;date
Node2配置:
# ssh-keygen -t rsa -P ‘‘ # ssh-copy-id -i ~/.ssh/id_rsa.pub root@node1 # ssh node1.samlee.com ‘date‘;date
3)设置5分钟自动同步时间(node1、node2都需要配置)
# crontab -e */5 * * * * /sbin/ntpdata 172.16.100.10 &> /dev/null
2、安装配置Corosync集群管理工具
1)安装Corosync工具(yum方式)
# yum -y install corosync
安装crmsh(rpm方式)
RHEL自6.4起不再提供集群的命令行配置工具crmsh,转而使用pcs;如果你习惯了使用crm命令,可下载相关的程序包自行安装即可。crmsh依赖于pssh,因此需要一并下载。
# cd /root/corosync_packages/ # yum -y --nogpgcheck localinstall crmsh*.rpm pssh*.rpm
2)配置corosync(操作在node1.samlee.com上执行)
# cd /etc/corosync/ # cp corosync.conf.example corosync.conf # vim corosync.conf # Please read the corosync.conf.5 manual page compatibility: whitetank totem { version: 2 secauth: on --开启认证功能 threads: 0 --CPU个数 interface { ringnumber: 0 bindnetaddr: 172.16.0.0 --集群节点运行所在的网络地址 mcastaddr: 226.96.6.17 --组播传输地址 mcastport: 5405 --心跳信息检测端口 ttl: 1 } } logging { fileline: off to_stderr: no to_logfile: yes to_syslog: yes logfile: /var/log/cluster/corosync.log debug: off timestamp: on logger_subsys { subsys: AMF debug: off } } amf { mode: disabled } ##设置随corosync启动的服务 service { ver: 0 name: pacemaker } ##ais运行身份设定 aisexec { user: root group: root } 并设定此配置文件中 bindnetaddr后面的IP地址为你的网卡所在网络的网络地址,我们这里的两个节点在172.16.0.0网络,因此这里将其设定为172.16.0.0;如下 bindnetaddr: 172.16.0.0
3)生成节点间通信时所用到的认证密钥文件:
# corosync-keygen 如果随机数不够的话需要需要登录状态狂敲键盘
4)将corosync.conf和authkey复制至node2:
# scp -p corosync.conf authkey node2:/etc/corosync/
5)分别在node1、node2两个节点中创建corosync生成的日志所在的目录
# mkdir /var/log/cluster # ssh node2 ‘mkdir /var/log/cluster‘
6)启动corosync服务
# service corosync start # ssh node2 ‘/etc/init.d/corosync start‘
7)查看corosync集群引擎是否正常启动:
# grep -e "Corosync Cluster Engine" -e "configuration file" /var/log/cluster/corosync.log # ssh node2 ‘grep -e "Corosync Cluster Engine" -e "configuration file" /var/log/cluster/corosync.log‘ 如下所示证明正常启动: Aug 13 11:26:58 corosync [MAIN ] Corosync Cluster Engine (‘1.4.1‘): started and ready to provide service. Aug 13 11:26:58 corosync [MAIN ] Successfully read main configuration file ‘/etc/corosync/corosync.conf‘.
8)查看初始化成员节点通知是否正常发出:
# grep TOTEM /var/log/cluster/corosync.log # ssh node2 ‘grep TOTEM /var/log/cluster/corosync.log‘ 如下所示证明正常发出: Aug 13 13:19:20 corosync [TOTEM ] Initializing transport (UDP/IP Multicast). Aug 13 13:19:20 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Aug 13 13:19:20 corosync [TOTEM ] The network interface [172.16.100.6] is now up. Aug 13 13:19:20 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Aug 13 11:26:59 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
9)检查启动过程中是否有错误产生。下面的错误信息表示packmaker不久之后将不再作为corosync的插件运行,因此,建议使用cman作为集群基础架构服务;此处可安全忽略。
# grep ERROR: /var/log/cluster/corosync.log | grep -v unpack_resources Aug 13 13:19:20 corosync [pcmk ] ERROR: process_ais_conf: You have configured a cluster using the Pacemaker plugin for Corosync. The plugin is not supported in this environment and will be removed very soon. Aug 13 13:19:20 corosync [pcmk ] ERROR: process_ais_conf: Please see Chapter 8 of ‘Clusters from Scratch‘ (http://www.clusterlabs.org/doc) for details on using Pacemaker with CMAN
10)查看pacemaker是否正常启动:
# grep pcmk_startup /var/log/cluster/corosync.log Aug 13 13:19:20 corosync [pcmk ] info: pcmk_startup: CRM: Initialized Aug 13 13:19:20 corosync [pcmk ] Logging: Initialized pcmk_startup Aug 13 13:19:20 corosync [pcmk ] info: pcmk_startup: Maximum core file size is: 18446744073709551615 Aug 13 13:19:20 corosync [pcmk ] info: pcmk_startup: Service: 9 Aug 13 13:19:20 corosync [pcmk ] info: pcmk_startup: Local hostname: node1.samlee.com
11)如果安装了crmsh,可使用如下命令查看集群节点的启动状态:
# crm status Last updated: Sat Aug 13 13:42:26 2016 Last change: Sat Aug 13 13:19:58 2016 by hacluster via crmd on node1.samlee.com Stack: classic openais (with plugin) Current DC: node1.samlee.com (version 1.1.14-8.el6-70404b0) - partition with quorum 2 nodes and 0 resources configured, 2 expected votes Online: [ node1.samlee.com node2.samlee.com ]
12)检查corosync端口是否正常:
# ss -tunlp | grep 5405 udp UNCONN 0 0 172.16.100.6:5405 *:* users:(("corosync",5879,15)) udp UNCONN 0 0 226.96.6.17:5405 *:* users:(("corosync",5879,11)) # ssh node2 ‘ss -tunlp | grep 5405‘ udp UNCONN 0 0 172.16.100.7:5405 *:* users:(("corosync",5047,15)) udp UNCONN 0 0 226.96.6.17:5405 *:* users:(("corosync",5047,11))
从上面的信息可以看出两个节点都已经正常启动,并且集群已经处于正常工作状态.
13)执行ps auxf命令可以查看corosync启动的各相关进程:
# ps auxf root 5879 0.9 0.9 545200 4648 ? Ssl 13:19 0:17 corosync 496 5884 0.0 2.1 94608 10672 ? S< 13:19 0:00 \_ /usr/libexec/pacemaker/cib root 5885 0.0 0.8 95148 3968 ? S< 13:19 0:00 \_ /usr/libexec/pacemaker/stonithd root 5886 0.0 0.5 62932 2788 ? S< 13:19 0:00 \_ /usr/libexec/pacemaker/lrmd 496 5887 0.0 0.6 85936 3196 ? S< 13:19 0:00 \_ /usr/libexec/pacemaker/attrd 496 5888 0.0 3.7 117468 18504 ? S< 13:19 0:00 \_ /usr/libexec/pacemaker/pengine 496 5889 0.0 0.8 135988 4228 ? S< 13:19 0:01 \_ /usr/libexec/pacemaker/crmd
3.集群资源管理
crmsh基本介绍
[root@node1 ~]# crm ##进入crmsh crm(live)# help ##查看帮助 This is crm shell, a Pacemaker command line interface. Available commands: cib manage shadow CIBs ##CIB资源管理模块 resource resources management ##资源管理模块 configure CRM cluster configuration ##CRM配置,包含资源粘性、资源类型、资源约束等 node nodes management ##节点管理 options user preferences ##用户偏好 history CRM cluster history ##CRM历史 site Geo-cluster support ##地理集群支持 ra resource agents information center ##资源代理配置 status show cluster status ##查看集群状态 help,? show help (help topics for list of topics) ##查看帮助 end,cd,up go back one level ##返回上一级 quit,bye,exit exit the program ##退出 crm(live)# configure ##进入配置模式 crm(live)configure# show ##查看当前配置 node node1.samlee.com node node2.samlee.com property $id="cib-bootstrap-options" dc-version="1.1.10-14.el6-368c726" cluster-infrastructure="classic openais (with plugin)" expected-quorum-votes="2" crm(live)configure# verify ##检查当前配置语法,由于没有STONITH,所以报错,可关闭 error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity Errors found during check: config not valid crm(live)configure# property stonith-enabled=false ##禁用stonith后再次检查配置,无报错 crm(live)configure# verify crm(live)configure# commit ##提交配置 crm(live)configure# cd crm(live)# ra ##-进入RA(资源代理配置)模式 crm(live)ra# help This level contains commands which show various information about the installed resource agents. It is available both at the top level and at the `configure` level. Available commands: classes list classes and providers ##查看RA类型 list list RA for a class (and provider)##查看指定类型(或提供商)的RA meta show meta data for a RA ##查看RA详细信息 providers show providers for a RA and a class ##查看指定资源的提供商和类型 help show help (help topics for list of topics) end go back one level quit exit the program crm(live)ra# classes lsb ocf / heartbeat pacemaker service stonith crm(live)ra# list ocf pacemaker ClusterMon Dummy HealthCPU HealthSMART Stateful SysInfo SystemHealth controld ping pingd remote crm(live)ra# info ocf:heartbeat:IPaddr crm(live)ra# cd crm(live)# status ##查看集群状态 Last updated: Sat Aug 13 15:51:13 2016 Last change: Sat Aug 13 15:46:19 2016 via cibadmin on node1.samlee.com Stack: classic openais (with plugin) Current DC: node2.samlee.com - partition with quorum Version: 1.1.10-14.el6-368c726 2 Nodes configured, 2 expected votes 0 Resources configured Online: [ node1.samlee.com node2.samlee.com ]
法定票数问题:
在双节点集群中,由于票数是偶数,当心跳出现问题(脑裂)时,两个节点都将达不到法定票数,默认quorum策略会关闭集群服务,为了避免这种情况,可以增加票数为奇数,或者调整默认quorum策略为【ignore】
crm(live)# configure crm(live)configure# property no-quorum-policy=ignore crm(live)configure# show node node1.samlee.com node node2.samlee.com property $id="cib-bootstrap-options" dc-version="1.1.10-14.el6-368c726" cluster-infrastructure="classic openais (with plugin)" expected-quorum-votes="2" stonith-enabled="false" no-quorum-policy="ignore" crm(live)configure# verify crm(live)configure# commit
防止资源在节点恢复后移动:
故障发生时,资源会迁移到正常节点上,但当故障节点恢复后,资源可能再次回到原来节点,这在有些情况下并非是最好的策略,因为资源的迁移是有停机时间的,特别是一些复杂的应用,如oracle数据库,这个时间会更长。为了避免这种情况可设置资源粘性策略。
crm(live)configure# rsc_defaults resource-stickiness=100 ##设置资源粘性为100
实例应用:配置web高可用集群
(1)定义VIP:
crm(live)# configure crm(live)configure# primitive webip ocf:heartbeat:IPaddr params ip=172.16.100.99 nic=eth0 cidr_netmask=16 crm(live)configure# verify crm(live)configure# commit crm(live)configure# cd crm(live)# status Last updated: Sat Aug 13 17:46:25 2016 Last change: Sat Aug 13 17:46:17 2016 via cibadmin on node1.samlee.com Stack: classic openais (with plugin) Current DC: node2.samlee.com - partition with quorum Version: 1.1.10-14.el6-368c726 2 Nodes configured, 2 expected votes 1 Resources configured Online: [ node1.samlee.com node2.samlee.com ] webip (ocf::heartbeat:IPaddr): Started node1.samlee.com
最后一行,定义的资源已经在node1上启动。使用 ip addr show命令可以看到该VIP已经生效:
# ip addr show 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 00:0c:29:07:45:da brd ff:ff:ff:ff:ff:ff inet 172.16.100.6/16 brd 172.16.255.255 scope global eth0 inet 172.16.100.99/16 brd 172.16.255.255 scope global secondary eth0 ##已经生效!! inet6 fe80::20c:29ff:fe07:45da/64 scope link valid_lft forever preferred_lft forever
(2)配置httpd资源
node1-web服务配置 # yum -y install httpd # echo "<h1>node1.samlee.com</h1>" >/var/www/html/index.html # service httpd start # chkconfig httpd off # service httpd stop node2-web服务配置 # yum -y install httpd # echo "<h1>node2.samlee.com</h1>" >/var/www/html/index.html # service httpd start # chkconfig httpd off # service httpd stop --------------------------------------------------------------------- --------------------------------------------------------------------- crm(live)# configure pcrm(live)configure# primitive webserver lsb:httpd crm(live)configure# show node node1.samlee.com node node2.samlee.com primitive webip ocf:heartbeat:IPaddr params ip="172.16.100.99" primitive webserver lsb:httpd property $id="cib-bootstrap-options" dc-version="1.1.10-14.el6-368c726" cluster-infrastructure="classic openais (with plugin)" expected-quorum-votes="2" stonith-enabled="false" no-quorum-policy="ignore" rsc_defaults $id="rsc-options" resource-stickiness="100" crm(live)configure# verify crm(live)configure# commit crm(live)configure# cd crm(live)# status Last updated: Sat Aug 13 17:55:46 2016 Last change: Sat Aug 13 17:55:19 2016 via cibadmin on node1.samlee.com Stack: classic openais (with plugin) Current DC: node2.samlee.com - partition with quorum Version: 1.1.10-14.el6-368c726 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ node1.samlee.com node2.samlee.com ] webip (ocf::heartbeat:IPaddr): Started node1.samlee.com webserver (lsb:httpd): Started node2.samlee.com
从上面的信息中可以看出webip和webserver有可能会分别运行于两个节点上,这对于通过此IP提供Web服务的应用来说是不成立的,即此两者资源必须同时运行在某节点上,如何实现两个资源运行在同一个节点上呢?
(1)手工切换资源至其他节点上(在资源自启动无法满足--仅用于测试)
crm(live)# resource crm(live)resource# list webip (ocf::heartbeat:IPaddr): Started webserver (lsb:httpd): Started crm(live)resource# migrate webserver crm(live)# status Last updated: Mon Aug 15 09:57:34 2016 Last change: Mon Aug 15 09:57:09 2016 via crm_resource on node1.samlee.com Stack: classic openais (with plugin) Current DC: node1.samlee.com - partition with quorum Version: 1.1.10-14.el6-368c726 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ node1.samlee.com node2.samlee.com ] webip (ocf::heartbeat:IPaddr): Started node1.samlee.com webserver (lsb:httpd): Started node1.samlee.com
切换后查看效果如下:
(2)建立资源组(将需要在一起启动的资源规划在同一个资源组内)
crm(live)# configure crm(live)configure# group webservice webip webserver crm(live)configure# verify crm(live)configure# commit crm(live)configure# cd crm(live)# resource crm(live)resource# list Resource Group: webservice webip (ocf::heartbeat:IPaddr): Started webserver (lsb:httpd): Started crm(live)# status Last updated: Mon Aug 15 10:06:17 2016 Last change: Mon Aug 15 10:04:33 2016 via cibadmin on node1.samlee.com Stack: classic openais (with plugin) Current DC: node1.samlee.com - partition with quorum Version: 1.1.10-14.el6-368c726 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ node1.samlee.com node2.samlee.com ] Resource Group: webservice webip (ocf::heartbeat:IPaddr): Started node1.samlee.com webserver (lsb:httpd): Started node1.samlee.com
测试效果如下:
测试完成后删除组资源:
crm(live)# resource crm(live)resource# stop webservice crm(live)resource# cleanup webservice crm(live)resource# cd crm(live)configure# delete webservice crm(live)configure# verify crm(live)configure# commit crm(live)# status Last updated: Mon Aug 15 10:31:30 2016 Last change: Mon Aug 15 10:26:21 2016 via cibadmin on node1.samlee.com Stack: classic openais (with plugin) Current DC: node1.samlee.com - partition with quorum Version: 1.1.10-14.el6-368c726 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ node1.samlee.com node2.samlee.com ] webip (ocf::heartbeat:IPaddr): Started node1.samlee.com webserver (lsb:httpd): Started node2.samlee.com ##停止资源--清除记录 # crm crm(live)# resource crm(live)resource# stop webservice crm(live)resource# list crm(live)resource# cleanup webservice crm(live)resource# cleanup webip crm(live)resource# cleanup httpd crm(live)resource# cd crm(live)# node crm(live)node# clearstate node1.samlee.com crm(live)node# clearstate node2.samlee.com crm(live)node# cd crm(live)# resource crm(live)resource# start webservice crm(live)resource# reprobe crm(live)resource# refresh crm(live)resource# cd crm(live)# configure crm(live)configure# show crm(live)configure# edit crm(live)configure# verify crm(live)configure# commit
(3)使用资源约束对资源精细化管理
上面针对资源约束做的案例,即便集群拥有所有必需资源,但它可能还无法进行正确处理。资源约束则用以指定在哪些群集节点上运行资源,以何种顺序装载资源,以及特定资源依赖于哪些其它资源。pacemaker共给我们提供了三种资源约束方法:
1) Resource Location(资源位置约束): 定义资源可以、不可以或尽可能在哪些节点上运行;
2) Resource Collocation(资源排列约束): 排列约束用以定义集群资源可以或不可以在某个节点上同时运行;
3) Resource Order(资源顺序约束): 顺序约束定义集群资源在节点上启动的顺序;
定义约束时,还需要指定分数。各种分数是集群工作方式的重要组成部分。其实,从迁移资源到决定在已降级集群中停止哪些资源的整个过程是通过以某种方式修改分数来实现的。分数按每个资源来计算,资源分数为负的任何节点都无法运行该资源。在计算出资源分数后,集群选择分数最高的节点。INFINITY(无穷大)目前定义为 1,000,000。加减无穷大遵循以下3个基本规则:
1)任何值 + 无穷大 = 无穷大
2)任何值 - 无穷大 = -无穷大
3)无穷大 - 无穷大 = -无穷大
定义资源约束时,也可以指定每个约束的分数。分数表示指派给此资源约束的值。分数较高的约束先应用,分数较低的约束后应用。通过使用不同的分数为既定资源创建更多位置约束,可以指定资源要故障转移至的目标节点的顺序。
因此,对于前述的webip和webserver可能会运行于不同节点的问题,通过定义排列约束解决:
crm(live)# configure crm(live)configure# colocation webserver_with_webip inf: webserver webip crm(live)configure# verify crm(live)configure# commit crm(live)configure# cd crm(live)# status Last updated: Mon Aug 15 11:03:31 2016 Last change: Mon Aug 15 11:02:47 2016 via cibadmin on node1.samlee.com Stack: classic openais (with plugin) Current DC: node1.samlee.com - partition with quorum Version: 1.1.10-14.el6-368c726 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ node1.samlee.com node2.samlee.com ] webip (ocf::heartbeat:IPaddr): Started node1.samlee.com webserver (lsb:httpd): Started node1.samlee.com
最后看到两个资源已经运行在同一个节点中,通过资源顺序约束定义资源的启动顺序:
##定义先启动资源webip后再启动webserver资源 crm(live)configure# order webip_before_webserver mandatory: webip webserver crm(live)configure# verify crm(live)configure# commit
查看测试效果:
此外,由于HA集群本身并不强制每个节点的性能相同或相近,所以,某些时候我们可能希望在正常时服务总能在某个性能较强的节点上运行,这可以通过位置约束来实现:
crm(live)# configure crm(live)configure# location webip_on_node1 webip 200: node2.samlee.com crm(live)configure# verify crm(live)configure# commit
定义资源监控,如果服务停止或重启我们可以通过资源监控方式来获知:
crm(live)configure# primitive vip ocf:heartbeat:IPaddr params ip=172.16.100.100 op monitor interval=30s timeout=20s
--以上为高可用集群技术之corosync应用详解(一)所有内容。
本文出自 “Opensamlee” 博客,请务必保留此出处http://gzsamlee.blog.51cto.com/9976612/1838084
原文地址:http://gzsamlee.blog.51cto.com/9976612/1838084