码迷,mamicode.com
首页 > 其他好文 > 详细

RAC集群节点故障模拟测试

时间:2015-05-22 00:00:21      阅读:1632      评论:0      收藏:0      [点我收藏+]

标签:

RAC节点故障模拟测试

重启单个RAC 节点模拟测试
模拟操作步骤
使用shutdown –Fr的方式重启节点,查看系统反应和数据库重新启动的时间。
预期测试结果
重启单个节点,vip将会切换到另外一个节点。系统重新启动之后,节点上的集群服务和数据库将会自动启动,重新加入集群。Vip也将切换回原始节点。
测试过程记录
使用shutdown 命令重启第三节点
第三节点关闭之后查看crs服务状态
RAC02:oracle:db2 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.db.db application         ONLINE    ONLINE    rac...ac01 
ora....b1.inst application    ONLINE    ONLINE    rac...ac01 
ora....b2.inst application    ONLINE    ONLINE    rac...ac02 
ora....b3.inst application    ONLINE    OFFLINE               
ora....SM1.asm application    ONLINE    ONLINE    rac...ac01 
ora....01.lsnr application    ONLINE    ONLINE    rac...ac01 
ora....c01.gsd application    ONLINE    ONLINE    rac...ac01 
ora....c01.ons application    ONLINE    ONLINE    rac...ac01 
ora....c01.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM2.asm application    ONLINE    ONLINE    rac...ac02 
ora....02.lsnr application    ONLINE    ONLINE    rac...ac02 
ora....c02.gsd application    ONLINE    ONLINE    rac...ac02 
ora....c02.ons application    ONLINE    ONLINE    rac...ac02 
ora....c02.vip application    ONLINE    ONLINE    rac...ac02 
ora....SM3.asm application    ONLINE    OFFLINE               
ora....03.lsnr application    ONLINE    OFFLINE               
ora....c03.gsd application    ONLINE    OFFLINE               
ora....c03.ons application    ONLINE    OFFLINE               
ora....c03.vip application    ONLINE    ONLINE    rac...ac01    


2012-06-15 10:50:33.574: [  CRSRES][1409472832] Attempting to start `ora.rac03.vip` on member `rac01`
2012-06-15 10:50:36.806: [  CRSRES][1409472832] Start of `ora.rac03.vip` on member `rac01` succeeded.
2012-06-15 10:50:36.821: [  CRSEVT][1407371584] Post recovery done evmd event for: rac03
2012-06-15 10:50:36.821: [    CRSD][1407371584] SM: recoveryDone: 0
可以看到,vip已经切换到第一节点。

大约15分钟以后,第三节点主机启动
crs和数据库服务已经自动挂起,两分钟后恢复正常,vip切换回一节点
RAC02:oracle:db2 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.db.db application    ONLINE    ONLINE    rac...ac01 
ora....b1.inst application    ONLINE    ONLINE    rac...ac01 
ora....b2.inst application    ONLINE    ONLINE    rac...ac02 
ora....b3.inst application    ONLINE    ONLINE    rac...ac03 
ora....SM1.asm application    ONLINE    ONLINE    rac...ac01 
ora....01.lsnr application    ONLINE    ONLINE    rac...ac01 
ora....c01.gsd application    ONLINE    ONLINE    rac...ac01 
ora....c01.ons application    ONLINE    ONLINE    rac...ac01 
ora....c01.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM2.asm application    ONLINE    ONLINE    rac...ac02 
ora....02.lsnr application    ONLINE    ONLINE    rac...ac02 
ora....c02.gsd application    ONLINE    ONLINE    rac...ac02 
ora....c02.ons application    ONLINE    ONLINE    rac...ac02 
ora....c02.vip application    ONLINE    ONLINE    rac...ac02 
ora....SM3.asm application    ONLINE    ONLINE    rac...ac03 
ora....03.lsnr application    ONLINE    ONLINE    rac...ac03 
ora....c03.gsd application    ONLINE    ONLINE    rac...ac03 
ora....c03.ons application    ONLINE    ONLINE    rac...ac03 

测试关闭第一 第二节点 ,重启服务器
RAC03:oracle:db3 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.db.db application    ONLINE    ONLINE    rac...ac03 
ora....b1.inst application    ONLINE    OFFLINE               
ora....b2.inst application    ONLINE    OFFLINE               
ora....b3.inst application    ONLINE    ONLINE    rac...ac03 
ora....SM1.asm application    ONLINE    OFFLINE               
ora....01.lsnr application    ONLINE    OFFLINE               
ora....c01.gsd application    ONLINE    OFFLINE               
ora....c01.ons application    ONLINE    OFFLINE               
ora....c01.vip application    ONLINE    ONLINE    rac...ac03 
ora....SM2.asm application    ONLINE    OFFLINE               
ora....02.lsnr application    ONLINE    OFFLINE               
ora....c02.gsd application    ONLINE    OFFLINE               
ora....c02.ons application    ONLINE    OFFLINE               
ora....c02.vip application    ONLINE    ONLINE    rac...ac03 
ora....SM3.asm application    ONLINE    ONLINE    rac...ac03 
ora....03.lsnr application    ONLINE    ONLINE    rac...ac03 
ora....c03.gsd application    ONLINE    ONLINE    rac...ac03 
ora....c03.ons application    ONLINE    ONLINE    rac...ac03 
ora....c03.vip application    ONLINE    ONLINE    rac...ac03
可以发现第一第二节点nodeapps ASM 实例都offline VIP 迁移到了第三节点。

10分钟之后,节点重启后CRS自动将所有组件全部启动,系统正常
RAC03:oracle:db3 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.db.db application    ONLINE    ONLINE    rac...ac03 
ora....b1.inst application    ONLINE    ONLINE    rac...ac01 
ora....b2.inst application    ONLINE    ONLINE    rac...ac02 
ora....b3.inst application    ONLINE    ONLINE    rac...ac03 
ora....SM1.asm application    ONLINE    ONLINE    rac...ac01 
ora....01.lsnr application    ONLINE    ONLINE    rac...ac01 
ora....c01.gsd application    ONLINE    ONLINE    rac...ac01 
ora....c01.ons application    ONLINE    ONLINE    rac...ac01 
ora....c01.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM2.asm application    ONLINE    ONLINE    rac...ac02 
ora....02.lsnr application    ONLINE    ONLINE    rac...ac02 
ora....c02.gsd application    ONLINE    ONLINE    rac...ac02 
ora....c02.ons application    ONLINE    ONLINE    rac...ac02 
ora....c02.vip application    ONLINE    ONLINE    rac...ac02 
ora....SM3.asm application    ONLINE    ONLINE    rac...ac03 
ora....03.lsnr application    ONLINE    ONLINE    rac...ac03 
ora....c03.gsd application    ONLINE    ONLINE    rac...ac03 
ora....c03.ons application    ONLINE    ONLINE    rac...ac03 
ora....c03.vip application    ONLINE    ONLINE    rac...ac03

测试效果总结
强行重启主机,对应节点上的数据库实例也会被shundown,数据库将会自己执行instance recovery过程,同时vip发生切换,对前端应用不造成影响。主机重启之后,集群服务和数据库服务会自动启动,重新加入集群,vip也将切换回来。能够保证正常的工作。
网络故障模拟测试
本部分主要通过模拟RAC公用网络中断、Inter-connect网络中断、监听程序故障来模拟现实应用中可能发生的意外,通过测试以确定预期应采取的措施及相关指标值。

公用网络故障模拟测试
模拟操作步骤
在RAC的一个节点上,移去该节点上公用网卡的网线,模拟外网中断。
预期测试结果
该节点上的VIP及数据库实例应从存活的监听器上撤销,一段时间之后该VIP应切换到存活的节点上。
测量过程记录
A. 拔除第三节点外网网线测试
模拟拔除第三节点外网网线,
从日志中可以看到, crs检测到第三节点外网网络故障,尝试修复最终将vip漂移到第一节点的过程。几秒之内vip切换完成。
2012-06-15 14:35:28.078: [  CRSRES][1407371584] In stateChanged, ora.rac03.vip target is ONLINE
2012-06-15 14:35:28.078: [  CRSRES][1407371584] ora.rac03.vip on rac03 went OFFLINE unexpectedly
2012-06-15 14:35:28.078: [  CRSRES][1407371584] StopResource: setting CLI values
2012-06-15 14:35:28.081: [  CRSRES][1407371584] Attempting to stop `ora.rac03.vip` on member `rac03`
2012-06-15 14:35:28.349: [  CRSRES][1407371584] Stop of `ora.rac03.vip` on member `rac03` succeeded.
2012-06-15 14:35:28.349: [  CRSRES][1407371584] ora.rac03.vip RESTART_COUNT=0 RESTART_ATTEMPTS=0
2012-06-15 14:35:28.351: [  CRSRES][1407371584] ora.rac03.vip failed on rac03 relocating.
2012-06-15 14:35:28.372: [  CRSRES][1407371584] StopResource: setting CLI values
2012-06-15 14:35:28.374: [  CRSRES][1407371584] Attempting to stop `ora.rac03.LISTENER_RAC03.lsnr` on member `rac03`
2012-06-15 14:35:28.632: [  CRSRES][1407371584] Stop of `ora.rac03.LISTENER_RAC03.lsnr` on member `rac03` succeeded.
2012-06-15 14:35:28.683: [  CRSRES][1407371584] Attempting to start `ora.rac03.vip` on member `rac01`
2012-06-15 14:35:32.149: [  CRSRES][1407371584] Start of `ora.rac03.vip` on member `rac01` succeeded.

此时查看crs状态 可以发现VIP 漂移至了第一节点,第三节点监听offline
RAC02:oracle:db2 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.db.db application    ONLINE    ONLINE    rac...ac03 
ora....b1.inst application    ONLINE    ONLINE    rac...ac01 
ora....b2.inst application    ONLINE    ONLINE    rac...ac02 
ora....b3.inst application    ONLINE    ONLINE    rac...ac03 
ora....SM1.asm application    ONLINE    ONLINE    rac...ac01 
ora....01.lsnr application    ONLINE    ONLINE    rac...ac01 
ora....c01.gsd application    ONLINE    ONLINE    rac...ac01 
ora....c01.ons application    ONLINE    ONLINE    rac...ac01 
ora....c01.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM2.asm application    ONLINE    ONLINE    rac...ac02 
ora....02.lsnr application    ONLINE    ONLINE    rac...ac02 
ora....c02.gsd application    ONLINE    ONLINE    rac...ac02 
ora....c02.ons application    ONLINE    ONLINE    rac...ac02 
ora....c02.vip application    ONLINE    ONLINE    rac...ac02 
ora....SM3.asm application    ONLINE    ONLINE    rac...ac03 
ora....03.lsnr application    ONLINE    OFFLINE               
ora....c03.gsd application    ONLINE    ONLINE    rac...ac03 
ora....c03.ons application    ONLINE    ONLINE    rac...ac03 
ora....c03.vip application    ONLINE    ONLINE    rac...ac01

随后插回第三节点网线
手工启动一节点监听service
RAC03:oracle:db3 > srvctl start listener -n RAC03


查看日志,可以看到随着监听服务的启动,vip进行了重新的分配,漂移回到了原始节点。
2012-06-15 14:35:28.374: [  CRSRES][1407371584] Attempting to stop `ora.rac03.LISTENER_RAC03.lsnr` on member `rac03`
2012-06-15 14:35:28.632: [  CRSRES][1407371584] Stop of `ora.rac03.LISTENER_RAC03.lsnr` on member `rac03` succeeded.
2012-06-15 14:35:28.683: [  CRSRES][1407371584] Attempting to start `ora.rac03.vip` on member `rac01`
2012-06-15 14:35:32.149: [  CRSRES][1407371584] Start of `ora.rac03.vip` on member `rac01` succeeded.
2012-06-15 14:45:44.548: [  CRSRES][1405270336] StopResource: setting CLI values
2012-06-15 14:45:44.562: [  CRSRES][1405270336] Attempting to stop `ora.rac03.vip` on member `rac01`
2012-06-15 14:45:44.816: [  CRSRES][1405270336] Stop of `ora.rac03.vip` on member `rac01` succeeded.
2012-06-15 14:45:44.819: [  CRSRES][1405270336] startRunnable: setting CLI values
2012-06-15 14:45:44.820: [  CRSRES][1405270336] Attempting to start `ora.rac03.vip` on member `rac03`
2012-06-15 14:45:48.831: [  CRSRES][1405270336] Start of `ora.rac03.vip` on member `rac03` succeeded.
2012-06-15 14:45:48.834: [  CRSRES][1405270336] startRunnable: setting CLI values
2012-06-15 14:45:48.850: [  CRSRES][1405270336] Attempting to start `ora.rac03.LISTENER_RAC03.lsnr` on member `rac03`
2012-06-15 14:45:52.868: [  CRSRES][1405270336] Start of `ora.rac03.LISTENER_RAC03.lsnr` on member `rac03` succeeded.
2012-06-15 14:45:52.878: [  CRSRES][1405270336] CRS-1002: Resource ‘ora.rac03.LISTENER_RAC03.lsnr‘ is already running on member ‘rac03‘
最终状态恢复,如下:
RAC02:oracle:db2 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.db.db application    ONLINE    ONLINE    rac...ac03 
ora....b1.inst application    ONLINE    ONLINE    rac...ac01 
ora....b2.inst application    ONLINE    ONLINE    rac...ac02 
ora....b3.inst application    ONLINE    ONLINE    rac...ac03 
ora....SM1.asm application    ONLINE    ONLINE    rac...ac01 
ora....01.lsnr application    ONLINE    ONLINE    rac...ac01 
ora....c01.gsd application    ONLINE    ONLINE    rac...ac01 
ora....c01.ons application    ONLINE    ONLINE    rac...ac01 
ora....c01.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM2.asm application    ONLINE    ONLINE    rac...ac02 
ora....02.lsnr application    ONLINE    ONLINE    rac...ac02 
ora....c02.gsd application    ONLINE    ONLINE    rac...ac02 
ora....c02.ons application    ONLINE    ONLINE    rac...ac02 
ora....c02.vip application    ONLINE    ONLINE    rac...ac02 
ora....SM3.asm application    ONLINE    ONLINE    rac...ac03 
ora....03.lsnr application    ONLINE    ONLINE    rac...ac03 
ora....c03.gsd application    ONLINE    ONLINE    rac...ac03 
ora....c03.ons application    ONLINE    ONLINE    rac...ac03 
ora....c03.vip application    ONLINE    ONLINE    rac...ac03

B. 拔除第二节点和第三节点外网网线测试
模拟拔出第二和第三节点网线
从日志中可以看到, crs检测到第二节点外网网络故障,尝试修复最终将vip漂移到第一节点的过程。几秒之内vip切换完成。
2012-06-15 14:48:38.768: [  CRSAPP][1409472832] CheckResource error for ora.rac02.vip error code = 1
2012-06-15 14:48:38.770: [  CRSRES][1409472832] In stateChanged, ora.rac02.vip target is ONLINE
2012-06-15 14:48:38.771: [  CRSRES][1409472832] ora.rac02.vip on rac02 went OFFLINE unexpectedly
2012-06-15 14:48:38.771: [  CRSRES][1409472832] StopResource: setting CLI values
2012-06-15 14:48:38.774: [  CRSRES][1409472832] Attempting to stop `ora.rac02.vip` on member `rac02`
2012-06-15 14:48:39.024: [  CRSRES][1409472832] Stop of `ora.rac02.vip` on member `rac02` succeeded.
2012-06-15 14:48:39.024: [  CRSRES][1409472832] ora.rac02.vip RESTART_COUNT=0 RESTART_ATTEMPTS=0
2012-06-15 14:48:39.027: [  CRSRES][1409472832] ora.rac02.vip failed on rac02 relocating.
2012-06-15 14:48:39.043: [  CRSRES][1409472832] StopResource: setting CLI values
2012-06-15 14:48:39.046: [  CRSRES][1409472832] Attempting to stop `ora.rac02.LISTENER_RAC02.lsnr` on member `rac02`
2012-06-15 14:48:39.312: [  CRSRES][1409472832] Stop of `ora.rac02.LISTENER_RAC02.lsnr` on member `rac02` succeeded.
2012-06-15 14:48:39.328: [  CRSRES][1409472832] Attempting to start `ora.rac02.vip` on member `rac01`
2012-06-15 14:48:42.591: [  CRSRES][1409472832] Start of `ora.rac02.vip` on member `rac01` succeeded.


2012-06-15 14:57:49.352: [  CRSRES][1405270336] StopResource: setting CLI values
2012-06-15 14:57:49.366: [  CRSRES][1405270336] Attempting to stop `ora.rac02.vip` on member `rac01`
2012-06-15 14:57:49.625: [  CRSRES][1405270336] Stop of `ora.rac02.vip` on member `rac01` succeeded.
2012-06-15 14:57:49.629: [  CRSRES][1405270336] startRunnable: setting CLI values
查看此时的crs服务状态,可以看到第二节点的vip和第三节点VIP已经漂移到第一节点,第三和第二节点监听offline
RAC01:oracle:db1 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.db.db application    ONLINE    ONLINE    rac...ac03 
ora....b1.inst application    ONLINE    ONLINE    rac...ac01 
ora....b2.inst application    ONLINE    ONLINE    rac...ac02 
ora....b3.inst application    ONLINE    ONLINE    rac...ac03 
ora....SM1.asm application    ONLINE    ONLINE    rac...ac01 
ora....01.lsnr application    ONLINE    ONLINE    rac...ac01 
ora....c01.gsd application    ONLINE    ONLINE    rac...ac01 
ora....c01.ons application    ONLINE    ONLINE    rac...ac01 
ora....c01.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM2.asm application    ONLINE    ONLINE    rac...ac02 
ora....02.lsnr application    ONLINE    OFFLINE               
ora....c02.gsd application    ONLINE    ONLINE    rac...ac02 
ora....c02.ons application    ONLINE    ONLINE    rac...ac02 
ora....c02.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM3.asm application    ONLINE    ONLINE    rac...ac03 
ora....03.lsnr application    ONLINE    OFFLINE               
ora....c03.gsd application    ONLINE    ONLINE    rac...ac03 
ora....c03.ons application    ONLINE    ONLINE    rac...ac03 
ora....c03.vip application    ONLINE    ONLINE    rac...ac01

插回网线并且重启相关监听后vip自动切换回到第二节点和第三节点
RAC01:oracle:db1 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.db.db application    ONLINE    ONLINE    rac...ac03 
ora....b1.inst application    ONLINE    ONLINE    rac...ac01 
ora....b2.inst application    ONLINE    ONLINE    rac...ac02 
ora....b3.inst application    ONLINE    ONLINE    rac...ac03 
ora....SM1.asm application    ONLINE    ONLINE    rac...ac01 
ora....01.lsnr application    ONLINE    ONLINE    rac...ac01 
ora....c01.gsd application    ONLINE    ONLINE    rac...ac01 
ora....c01.ons application    ONLINE    ONLINE    rac...ac01 
ora....c01.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM2.asm application    ONLINE    ONLINE    rac...ac02 
ora....02.lsnr application    ONLINE    ONLINE    rac...ac02 
ora....c02.gsd application    ONLINE    ONLINE    rac...ac02 
ora....c02.ons application    ONLINE    ONLINE    rac...ac02 
ora....c02.vip application    ONLINE    ONLINE    rac...ac02 
ora....SM3.asm application    ONLINE    ONLINE    rac...ac03 
ora....03.lsnr application    ONLINE    ONLINE    rac...ac03 
ora....c03.gsd application    ONLINE    ONLINE    rac...ac03 
ora....c03.ons application    ONLINE    ONLINE    rac...ac03 
ora....c03.vip application    ONLINE    ONLINE    rac...ac03

测试效果总结
外网网线被拔除之后,crs检测到网络异常,将会尝试修复,进而将vip切换到能够正常对外服务的节点上。
当外网网线重新恢复连接之后,等待一分钟左右,vip将自动切换到原始节点。也可以使用srvctl start listener的方式立刻启动监听服务。


Inter-connect网卡故障模拟测试
模拟操作步骤
在RAC的一个节点上,移去该节点上私用网卡的网线,模拟内网中断。
预期测试结果
CRS应侦测到集群分裂,节点及数据库实例将会从CRS和RAC集群众分别逐出。
在两节点RAC集群中,具有最小实例号的节点将继续存活。
在多节点RAC集群中,具有最大子集群的集群将继续存活。
若出现的等大小子集群,则具有最小实例号的子集群将继续存活。
测量过程记录
拔除第三节点内网网线
随后crs检测到内网故障自动重启第三节点
状态如下 VIP 迁移至第一节点
RAC02:oracle:db2 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.db.db application    ONLINE    ONLINE    rac...ac01 
ora....b1.inst application    ONLINE    ONLINE    rac...ac01 
ora....b2.inst application    ONLINE    ONLINE    rac...ac02 
ora....b3.inst application    ONLINE    OFFLINE               
ora....SM1.asm application    ONLINE    ONLINE    rac...ac01 
ora....01.lsnr application    ONLINE    ONLINE    rac...ac01 
ora....c01.gsd application    ONLINE    ONLINE    rac...ac01 
ora....c01.ons application    ONLINE    ONLINE    rac...ac01 
ora....c01.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM2.asm application    ONLINE    ONLINE    rac...ac02 
ora....02.lsnr application    ONLINE    ONLINE    rac...ac02 
ora....c02.gsd application    ONLINE    ONLINE    rac...ac02 
ora....c02.ons application    ONLINE    ONLINE    rac...ac02 
ora....c02.vip application    ONLINE    ONLINE    rac...ac02 
ora....SM3.asm application    ONLINE    OFFLINE               
ora....03.lsnr application    ONLINE    OFFLINE               
ora....c03.gsd application    ONLINE    OFFLINE               
ora....c03.ons application    ONLINE    OFFLINE               
ora....c03.vip application    ONLINE    ONLINE    rac...ac01
第三节点重启后由于内网自动恢复(经过询问厂商,确实如此),节点自动加入集群,所有组件自动启动,恢复正常。
2012-06-15 15:49:58.977: [  CLSVER][3482500800] Active Version from OCR:11.1.0.7.0
2012-06-15 15:49:58.977: [  CLSVER][3482500800] Active Version and Software Version are same
2012-06-15 15:49:58.977: [ CRSMAIN][3482500800] Initializing OCR
2012-06-15 15:49:58.984: [  OCRRAW][3482500800]proprioo: for disk 0 (/dev/raw/raw1), id match (1), my id set (1669906634,1028247821) total id sets (1), 1st set (1669906634,1028247821), 2nd set (0,0) my votes (2), total votes (2)
2012-06-15 15:49:59.013: [    CRSD][3482500800] ENV Logging level for Module: allcomp  0
2012-06-15 15:49:59.014: [    CRSD][3482500800] ENV Logging level for Module: default  0
2012-06-15 15:49:59.015: [    CRSD][3482500800] ENV Logging level for Module: COMMCRS  0
2012-06-15 15:49:59.016: [    CRSD][3482500800] ENV Logging level for Module: COMMNS  0
2012-06-15 15:49:59.016: [    CRSD][3482500800] ENV Logging level for Module: CRSUI  0
2012-06-15 15:49:59.017: [    CRSD][3482500800] ENV Logging level for Module: CRSCOMM  0
2012-06-15 15:49:59.018: [    CRSD][3482500800] ENV Logging level for Module: CRSRTI  0
2012-06-15 15:49:59.019: [    CRSD][3482500800] ENV Logging level for Module: CRSMAIN  0
2012-06-15 15:49:59.020: [    CRSD][3482500800] ENV Logging level for Module: CRSPLACE  0
2012-06-15 15:49:59.021: [    CRSD][3482500800] ENV Logging level for Module: CRSAPP  0
2012-06-15 15:49:59.022: [    CRSD][3482500800] ENV Logging level for Module: CRSRES  0
2012-06-15 15:49:59.023: [    CRSD][3482500800] ENV Logging level for Module: CRSOCR  0
2012-06-15 15:49:59.024: [    CRSD][3482500800] ENV Logging level for Module: CRSTIMER  0
2012-06-15 15:49:59.024: [    CRSD][3482500800] ENV Logging level for Module: CRSEVT  0
2012-06-15 15:49:59.025: [    CRSD][3482500800] ENV Logging level for Module: CRSD  0
2012-06-15 15:49:59.026: [    CRSD][3482500800] ENV Logging level for Module: CLUCLS  0
2012-06-15 15:49:59.027: [    CRSD][3482500800] ENV Logging level for Module: CLSVER  0
2012-06-15 15:49:59.028: [    CRSD][3482500800] ENV Logging level for Module: OCRRAW  0
2012-06-15 15:49:59.029: [    CRSD][3482500800] ENV Logging level for Module: OCROSD  0
2012-06-15 15:49:59.030: [    CRSD][3482500800] ENV Logging level for Module: OCRCAC  0
2012-06-15 15:49:59.031: [    CRSD][3482500800] ENV Logging level for Module: CSSCLNT  0
2012-06-15 15:49:59.032: [    CRSD][3482500800] ENV Logging level for Module: OCRAPI  0
2012-06-15 15:49:59.033: [    CRSD][3482500800] ENV Logging level for Module: OCRUTL  0
2012-06-15 15:49:59.034: [    CRSD][3482500800] ENV Logging level for Module: OCRMSG  0
2012-06-15 15:49:59.034: [    CRSD][3482500800] ENV Logging level for Module: OCRCLI  0
2012-06-15 15:49:59.035: [    CRSD][3482500800] ENV Logging level for Module: OCRSRV  0
2012-06-15 15:49:59.036: [    CRSD][3482500800] ENV Logging level for Module: OCRMAS  0
2012-06-15 15:49:59.036: [ CRSMAIN][3482500800] Filename is /home/crs/crs/init/rac03.pid
[  clsdmt][1333913920]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=rac03DBG_CRSD))
2012-06-15 15:49:59.057: [ CRSMAIN][3482500800] Using Authorizer location: /home/crs/crs/auth/
2012-06-15 15:49:59.065: [ CRSMAIN][3482500800] Initializing RTI
2012-06-15 15:49:59.077: [ CRSMAIN][3482500800] Initializing EVMMgr
2012-06-15 15:49:59.077: [CRSTIMER][1350699328] Timer Thread Starting.
2012-06-15 15:49:59.293: [ COMMCRS][1359092032]clsc_connect: (0x2aaaab01b3d0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=SYSTEM.evm.acceptor.auth))

2012-06-15 15:49:59.765: [ COMMCRS][1359092032]clsc_connect: (0x2aaaab01b3d0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=SYSTEM.evm.acceptor.auth))

2012-06-15 15:50:00.886: [ CRSMAIN][3482500800] CRSD locked during state recovery, please wait.
2012-06-15 15:50:00.953: [ CRSMAIN][3482500800] CRSD recovered, unlocked.
2012-06-15 15:50:00.954: [ CRSMAIN][3482500800] QS socket on: (ADDRESS=(PROTOCOL=ipc)(KEY=ora_crsqs))
2012-06-15 15:50:00.958: [ CRSMAIN][3482500800] CRSD UI socket on: (ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_UI_SOCKET))
2012-06-15 15:50:00.960: [ CRSMAIN][3482500800] E2E socket on: (ADDRESS=(PROTOCOL=tcp)(HOST=rac03_priv)(PORT=49896))
2012-06-15 15:50:00.960: [ CRSMAIN][3482500800] Starting Threads
2012-06-15 15:50:00.960: [ CRSMAIN][3482500800] CRS Daemon Started.

查询crs的情况
RAC02:oracle:db2 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.db.db application    ONLINE    ONLINE    rac...ac01 
ora....b1.inst application    ONLINE    ONLINE    rac...ac01 
ora....b2.inst application    ONLINE    ONLINE    rac...ac02 
ora....b3.inst application    ONLINE    ONLINE    rac...ac03 
ora....SM1.asm application    ONLINE    ONLINE    rac...ac01 
ora....01.lsnr application    ONLINE    ONLINE    rac...ac01 
ora....c01.gsd application    ONLINE    ONLINE    rac...ac01 
ora....c01.ons application    ONLINE    ONLINE    rac...ac01 
ora....c01.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM2.asm application    ONLINE    ONLINE    rac...ac02 
ora....02.lsnr application    ONLINE    ONLINE    rac...ac02 
ora....c02.gsd application    ONLINE    ONLINE    rac...ac02 
ora....c02.ons application    ONLINE    ONLINE    rac...ac02 
ora....c02.vip application    ONLINE    ONLINE    rac...ac02 
ora....SM3.asm application    ONLINE    ONLINE    rac...ac03 
ora....03.lsnr application    ONLINE    ONLINE    rac...ac03 
ora....c03.gsd application    ONLINE    ONLINE    rac...ac03 
ora....c03.ons application    ONLINE    ONLINE    rac...ac03 
ora....c03.vip application    ONLINE    ONLINE    rac...ac03

测试效果总结
内网网络中断的情况下,CRS将会侦测到集群分裂,首先会尝试重启故障节点,如果节点内网恢复,将会自动加入集群恢复正常。
如果内网连接无法恢复,将会按照一定规则将故障节点驱逐。
在两节点RAC集群中,具有最小实例号的节点将继续存活。
内网连接恢复之后,需要重启故障节点以便故障节点服务正常启动


ORACLE 监听程序crash模拟测试
模拟操作步骤
采用 ‘kill -9 listener进程ID’模拟Listener进程失效
预期测试结果
CRS侦测到监听进程异常,会自动尝试重启监听服务。监听将在很短的时间之内重新启动。尝试重启的次数不超过5次
测量过程记录
kill 一个节点监听
RAC03:oracle:db3 > ps -fe |grep tns
oracle   19815     1  0 11:03 ?        00:00:00 /home/oracle/product/11g/bin/tnslsnr LISTENER_RAC03 -inherit
oracle   19961 30903  0 11:03 pts/0    00:00:00 grep tns
RAC03:oracle:db3 > kill -9 19815
检查crsd日志,发现系统检测到监听异常,并立刻进行修复。直至修复完成,CRS 侦测到listener 异常为10分钟。
2012-06-14 11:13:41.159: [  CRSAPP][1419966784] CheckResource error for ora.rac03.LISTENER_RAC03.lsnr error code = 1
2012-06-14 11:13:41.162: [  CRSRES][1419966784] In stateChanged, ora.rac03.LISTENER_RAC03.lsnr target is ONLINE
2012-06-14 11:13:41.162: [  CRSRES][1419966784] ora.rac03.LISTENER_RAC03.lsnr on rac03 went OFFLINE unexpectedly
2012-06-14 11:13:41.162: [  CRSRES][1419966784] StopResource: setting CLI values
2012-06-14 11:13:41.170: [  CRSRES][1419966784] Attempting to stop `ora.rac03.LISTENER_RAC03.lsnr` on member `rac03`
2012-06-14 11:13:41.432: [  CRSRES][1419966784] Stop of `ora.rac03.LISTENER_RAC03.lsnr` on member `rac03` succeeded.
2012-06-14 11:13:41.432: [  CRSRES][1419966784] ora.rac03.LISTENER_RAC03.lsnr RESTART_COUNT=0 RESTART_ATTEMPTS=5
2012-06-14 11:13:41.432: [  CRSRES][1419966784] Restarting ora.rac03.LISTENER_RAC03.lsnr on rac03
2012-06-14 11:13:41.435: [  CRSRES][1419966784] startRunnable: setting CLI values
2012-06-14 11:13:41.436: [  CRSRES][1419966784] Attempting to start `ora.rac03.LISTENER_RAC03.lsnr` on member `rac03`
2012-06-14 11:13:41.705: [  CRSRES][1419966784] Start of `ora.rac03.LISTENER_RAC03.lsnr` on member `rac03` succeeded.
2012-06-14 11:13:41.705: [  CRSRES][1419966784] Successfully restarted ora.rac03.LISTENER_RAC03.lsnr on rac03, RESTART_COUNT=1
2012-06-14 11:13:41.713: [  CRSRES][1419966784] ora.rac03.LISTENER_RAC03.lsnr Updated LAST_RESTART time in ocr


kill 二个节点监听
RAC02:oracle:db2 > ps -fe |grep tns
oracle   25826     1  0 Jun12 ?        00:00:00 /home/oracle/product/11g/bin/tnslsnr LISTENER_RAC02 -inherit
oracle   29140 29079  0 11:17 pts/0    00:00:00 grep tns
RAC02:oracle:db2 > kill -9 25826

RAC03:oracle:db3 > ps -fe |grep tns
oracle   25751     1  0 11:13 ?        00:00:00 /home/oracle/product/11g/bin/tnslsnr LISTENER_RAC03 -inherit
oracle   27596 30903  0 11:16 pts/0    00:00:00 grep tns
RAC03:oracle:db3 > kill -9 25751
检查crsd日志,发现系统检测到监听异常,并立刻进行修复。直至修复完成,kill 时间至修复完成为6分钟。并且日志中记载了重新尝试启动服务的次数为(RESTART_COUNT=1)
2012-06-14 11:23:41.950: [  CRSAPP][1422068032] CheckResource error for ora.rac03.LISTENER_RAC03.lsnr error code = 1
2012-06-14 11:23:41.952: [  CRSRES][1422068032] In stateChanged, ora.rac03.LISTENER_RAC03.lsnr target is ONLINE
2012-06-14 11:23:41.952: [  CRSRES][1422068032] ora.rac03.LISTENER_RAC03.lsnr on rac03 went OFFLINE unexpectedly
2012-06-14 11:23:41.953: [  CRSRES][1422068032] StopResource: setting CLI values
2012-06-14 11:23:41.956: [  CRSRES][1422068032] Attempting to stop `ora.rac03.LISTENER_RAC03.lsnr` on member `rac03`
2012-06-14 11:23:42.216: [  CRSRES][1422068032] Stop of `ora.rac03.LISTENER_RAC03.lsnr` on member `rac03` succeeded.
2012-06-14 11:23:42.216: [  CRSRES][1422068032] ora.rac03.LISTENER_RAC03.lsnr RESTART_COUNT=1 RESTART_ATTEMPTS=5
2012-06-14 11:23:42.216: [  CRSRES][1422068032] ora.rac03.LISTENER_RAC03.lsnr Uptime does not exceed uptime_threshold
2012-06-14 11:23:42.217: [  CRSRES][1422068032] Restarting ora.rac03.LISTENER_RAC03.lsnr on rac03
2012-06-14 11:23:42.220: [  CRSRES][1422068032] startRunnable: setting CLI values
2012-06-14 11:23:42.220: [  CRSRES][1422068032] Attempting to start `ora.rac03.LISTENER_RAC03.lsnr` on member `rac03`
2012-06-14 11:23:42.481: [  CRSRES][1422068032] Start of `ora.rac03.LISTENER_RAC03.lsnr` on member `rac03` succeeded.
2012-06-14 11:23:42.481: [  CRSRES][1422068032] Successfully restarted ora.rac03.LISTENER_RAC03.lsnr on rac03, RESTART_COUNT=2
2012-06-14 11:23:42.486: [  CRSRES][1422068032] ora.rac03.LISTENER_RAC03.lsnr Updated LAST_RESTART time in ocr
检查数据库服务状态,已经恢复正常。
RAC01:oracle:db1 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.db.db application    ONLINE    ONLINE    rac...ac01 
ora....b1.inst application    ONLINE    ONLINE    rac...ac01 
ora....b2.inst application    ONLINE    ONLINE    rac...ac02 
ora....b3.inst application    ONLINE    ONLINE    rac...ac03 
ora....SM1.asm application    ONLINE    ONLINE    rac...ac01 
ora....01.lsnr application    ONLINE    ONLINE    rac...ac01 
ora....c01.gsd application    ONLINE    ONLINE    rac...ac01 
ora....c01.ons application    ONLINE    ONLINE    rac...ac01 
ora....c01.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM2.asm application    ONLINE    ONLINE    rac...ac02 
ora....02.lsnr application    ONLINE    ONLINE    rac...ac02 
ora....c02.gsd application    ONLINE    ONLINE    rac...ac02 
ora....c02.ons application    ONLINE    ONLINE    rac...ac02 
ora....c02.vip application    ONLINE    ONLINE    rac...ac02 
ora....SM3.asm application    ONLINE    ONLINE    rac...ac03 
ora....03.lsnr application    ONLINE    ONLINE    rac...ac03 
ora....c03.gsd application    ONLINE    ONLINE    rac...ac03 
ora....c03.ons application    ONLINE    ONLINE    rac...ac03 
ora....c03.vip application    ONLINE    ONLINE    rac...ac03  

测试效果总结
监听程序被kill造成丢失会被集群侦测到,集群会自动重启监听服务。对系统生产无明显影响。

ORACLE CRS故障模拟测试

CRSD进程crash模拟测试
模拟操作步骤
采用 ‘kill -9 crsd进程ID’模拟CRSD进程失效
预期测试结果
CRSD进程立刻被重起
测量过程记录
在第三节点测试。检查服务状态和crsd进程状态
RAC03:oracle:db3 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.db.db application    ONLINE    ONLINE    rac...ac01 
ora....b1.inst application    ONLINE    ONLINE    rac...ac01 
ora....b2.inst application    ONLINE    ONLINE    rac...ac02 
ora....b3.inst application    ONLINE    ONLINE    rac...ac03 
ora....SM1.asm application    ONLINE    ONLINE    rac...ac01 
ora....01.lsnr application    ONLINE    ONLINE    rac...ac01 
ora....c01.gsd application    ONLINE    ONLINE    rac...ac01 
ora....c01.ons application    ONLINE    ONLINE    rac...ac01 
ora....c01.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM2.asm application    ONLINE    ONLINE    rac...ac02 
ora....02.lsnr application    ONLINE    ONLINE    rac...ac02 
ora....c02.gsd application    ONLINE    ONLINE    rac...ac02 
ora....c02.ons application    ONLINE    ONLINE    rac...ac02 
ora....c02.vip application    ONLINE    ONLINE    rac...ac02 
ora....SM3.asm application    ONLINE    ONLINE    rac...ac03 
ora....03.lsnr application    ONLINE    ONLINE    rac...ac03 
ora....c03.gsd application    ONLINE    ONLINE    rac...ac03 
ora....c03.ons application    ONLINE    ONLINE    rac...ac03 
ora....c03.vip application    ONLINE    ONLINE    rac...ac03 
RAC03:oracle:db3 > ps -fe |grep crsd
root       358  8055  0 14:10 pts/1    00:00:00 tail -200f crsd.log
oracle    1070 30903  0 14:11 pts/0    00:00:00 grep crsd
root     10870     1  0 Jun12 ?        00:00:00 /bin/sh /etc/init.d/init.crsd run
root     11981 10870  0 Jun12 ?        00:02:23 /home/crs/bin/crsd.bin reboot
RAC03:oracle:db3 > exit
logout
RAC03:~ # kill -9 11981
RAC03:~ # ps -fe |grep crsd
root       358  8055  0 14:10 pts/1    00:00:00 tail -200f crsd.log
root      1816     1  0 14:12 ?        00:00:00 /bin/sh /etc/init.d/init.crsd run
root      2180  1816  4 14:12 ?        00:00:00 /home/crs/bin/crsd.bin restart
root      2378 30842  0 14:12 pts/0    00:00:00 grep crsd  
服务立刻重启。检查crsd.log日志,可以看到在极短时间内,crsd启动并且恢复完成
2012-06-14 14:12:23.039: [ CRSMAIN][1951506112] Filename is /home/crs/crs/init/rac03.pid
[  clsdmt][1333913920]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=rac03DBG_CRSD))
2012-06-14 14:12:23.045: [ CRSMAIN][1951506112] Using Authorizer location: /home/crs/crs/auth/
2012-06-14 14:12:23.053: [ CRSMAIN][1951506112] Initializing RTI
2012-06-14 14:12:23.065: [ CRSMAIN][1951506112] Initializing EVMMgr
2012-06-14 14:12:23.065: [CRSTIMER][1350699328] Timer Thread Starting.
2012-06-14 14:12:23.106: [ CRSMAIN][1951506112] CRSD locked during state recovery, please wait.
2012-06-14 14:12:23.417: [  CRSRES][1951506112] ora.rac03.vip check shows ONLINE
2012-06-14 14:12:23.718: [  CRSRES][1951506112] ora.rac03.gsd check shows ONLINE
2012-06-14 14:12:24.010: [  CRSRES][1951506112] ora.rac03.ons check shows ONLINE
2012-06-14 14:12:24.318: [  CRSRES][1951506112] ora.rac03.LISTENER_RAC03.lsnr check shows ONLINE
2012-06-14 14:12:24.631: [  CRSRES][1951506112] ora.rac03.ASM3.asm check shows ONLINE
2012-06-14 14:12:25.386: [  CRSRES][1951506112] ora.db.db3.inst check shows ONLINE
2012-06-14 14:12:25.388: [ CRSMAIN][1951506112] CRSD recovered, unlocked.
2012-06-14 14:12:25.389: [ CRSMAIN][1951506112] QS socket on: (ADDRESS=(PROTOCOL=ipc)(KEY=ora_crsqs))
2012-06-14 14:12:25.392: [ CRSMAIN][1951506112] CRSD UI socket on: (ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_UI_SOCKET))
2012-06-14 14:12:25.394: [ CRSMAIN][1951506112] E2E socket on: (ADDRESS=(PROTOCOL=tcp)(HOST=rac03_priv)(PORT=49896))
2012-06-14 14:12:25.394: [ CRSMAIN][1951506112] Starting Threads
2012-06-14 14:12:25.394: [ CRSMAIN][1951506112] CRS Daemon Started.
2012-06-14 14:12:25.394: [ CRSMAIN][1394764096] Starting runCommandServer for (UI = 1, E2E = 0). 0
2012-06-14 14:12:25.394: [ CRSMAIN][1396865344] Starting runCommandServer for (UI = 1, E2E = 0). 1
2012-06-14 14:12:25.425: [  CRSRES][1405270336] CRS-1002: Resource ‘ora.db.db‘ is already running on member ‘rac01‘
继续在二节点上kill crsd进程测试
RAC02:oracle:db2 > ps -fe |grep crsd
oracle    6563  6349  0 14:18 pts/1    00:00:00 tail -200f crsd.log
oracle    6706 29079  0 14:18 pts/0    00:00:00 grep crsd
root     10851     1  0 Jun12 ?        00:00:00 /bin/sh /etc/init.d/init.crsd run
root     11949 10851  0 Jun12 ?        00:02:11 /home/crs/bin/crsd.bin reboot
RAC02:oracle:db2 > exit
logout
RAC02:~ # kill -9 11949
检查二节点crsd.log,恢复crs用时不到10秒钟
2012-06-14 14:12:23.045: [ CRSMAIN][1951506112] Using Authorizer location: /home/crs/crs/auth/
2012-06-14 14:12:23.053: [ CRSMAIN][1951506112] Initializing RTI
2012-06-14 14:12:23.065: [ CRSMAIN][1951506112] Initializing EVMMgr
2012-06-14 14:12:23.065: [CRSTIMER][1350699328] Timer Thread Starting.
2012-06-14 14:12:23.106: [ CRSMAIN][1951506112] CRSD locked during state recovery, please wait.
2012-06-14 14:12:23.417: [  CRSRES][1951506112] ora.rac03.vip check shows ONLINE
2012-06-14 14:12:23.718: [  CRSRES][1951506112] ora.rac03.gsd check shows ONLINE
2012-06-14 14:12:24.010: [  CRSRES][1951506112] ora.rac03.ons check shows ONLINE
2012-06-14 14:12:24.318: [  CRSRES][1951506112] ora.rac03.LISTENER_RAC03.lsnr check shows ONLINE
2012-06-14 14:12:24.631: [  CRSRES][1951506112] ora.rac03.ASM3.asm check shows ONLINE
2012-06-14 14:12:25.386: [  CRSRES][1951506112] ora.db.db3.inst check shows ONLINE
2012-06-14 14:12:25.388: [ CRSMAIN][1951506112] CRSD recovered, unlocked.
2012-06-14 14:12:25.389: [ CRSMAIN][1951506112] QS socket on: (ADDRESS=(PROTOCOL=ipc)(KEY=ora_crsqs))
2012-06-14 14:12:25.392: [ CRSMAIN][1951506112] CRSD UI socket on: (ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_UI_SOCKET))
2012-06-14 14:12:25.394: [ CRSMAIN][1951506112] E2E socket on: (ADDRESS=(PROTOCOL=tcp)(HOST=rac03_priv)(PORT=49896))
2012-06-14 14:12:25.394: [ CRSMAIN][1951506112] Starting Threads
2012-06-14 14:12:25.394: [ CRSMAIN][1951506112] CRS Daemon Started.
2012-06-14 14:12:25.394: [ CRSMAIN][1394764096] Starting runCommandServer for (UI = 1, E2E = 0). 0
2012-06-14 14:12:25.394: [ CRSMAIN][1396865344] Starting runCommandServer for (UI = 1, E2E = 0). 1
2012-06-14 14:12:25.425: [  CRSRES][1405270336] CRS-1002: Resource ‘ora.db.db‘ is already running on member ‘rac01‘
测试效果总结
crsd进程一旦异常,数据库集群将会自动进行修复并且重新启动,用时在10秒钟之内。整个数据库系统对外提供服务不受影响。


OCSSD进程crash模拟测试
模拟操作步骤
采用 ‘kill -9 ocssd进程ID’模拟OCSSD进程失效
预期测试结果
该节点将重起
测试过程记录
在第三节点进行测试:记录当时状态
RAC03:oracle:db3 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------ora.db.db application    ONLINE    ONLINE    rac...ac01 
ora....b1.inst application    ONLINE    ONLINE    rac...ac01 
ora....b2.inst application    ONLINE    ONLINE    rac...ac02 
ora....b3.inst application    ONLINE    ONLINE    rac...ac03 
ora....SM1.asm application    ONLINE    ONLINE    rac...ac01 
ora....01.lsnr application    ONLINE    ONLINE    rac...ac01 
ora....c01.gsd application    ONLINE    ONLINE    rac...ac01 
ora....c01.ons application    ONLINE    ONLINE    rac...ac01 
ora....c01.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM2.asm application    ONLINE    ONLINE    rac...ac02 
ora....02.lsnr application    ONLINE    ONLINE    rac...ac02 
ora....c02.gsd application    ONLINE    ONLINE    rac...ac02 
ora....c02.ons application    ONLINE    ONLINE    rac...ac02 
ora....c02.vip application    ONLINE    ONLINE    rac...ac02 
ora....SM3.asm application    ONLINE    ONLINE    rac...ac03 
ora....03.lsnr application    ONLINE    ONLINE    rac...ac03 
ora....c03.gsd application    ONLINE    ONLINE    rac...ac03 
ora....c03.ons application    ONLINE    ONLINE    rac...ac03 
ora....c03.vip application    ONLINE    ONLINE    rac...ac03 
RAC03:oracle:db3 > exit
logout
RAC03:~ # ps -fe |grep ocssd
root      8304 30842  0 14:22 pts/0    00:00:00 grep ocssd
oracle   12969 12042  0 Jun12 ?        00:02:13 /home/crs/bin/ocssd.bin
RAC03:~ # kill -9 12969


杀掉第三节点ocssd进程
RAC02:oracle:db2 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.db.db application    ONLINE    ONLINE    rac...ac01 
ora....b1.inst application    ONLINE    ONLINE    rac...ac01 
ora....b2.inst application    ONLINE    ONLINE    rac...ac02 
ora....b3.inst application    ONLINE    OFFLINE               
ora....SM1.asm application    ONLINE    ONLINE    rac...ac01 
ora....01.lsnr application    ONLINE    ONLINE    rac...ac01 
ora....c01.gsd application    ONLINE    ONLINE    rac...ac01 
ora....c01.ons application    ONLINE    ONLINE    rac...ac01 
ora....c01.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM2.asm application    ONLINE    ONLINE    rac...ac02 
ora....02.lsnr application    ONLINE    ONLINE    rac...ac02 
ora....c02.gsd application    ONLINE    ONLINE    rac...ac02 
ora....c02.ons application    ONLINE    ONLINE    rac...ac02 
ora....c02.vip application    ONLINE    ONLINE    rac...ac02 
ora....SM3.asm application    ONLINE    OFFLINE               
ora....03.lsnr application    ONLINE    OFFLINE               
ora....c03.gsd application    ONLINE    OFFLINE               
ora....c03.ons application    ONLINE    OFFLINE               
ora....c03.vip application    ONLINE    ONLINE    rac...ac01


2012-06-14 14:25:14.269: [  CRSRES][1415764288] Attempting to start `ora.rac03.vip` on member `rac01`
2012-06-14 14:25:14.624: [  CRSRES][1415764288] Start of `ora.rac03.vip` on member `rac01` succeeded.
2012-06-14 14:25:14.637: [  CRSEVT][1405270336] Post recovery done evmd event for: rac03
2012-06-14 14:25:14.637: [    CRSD][1405270336] SM: recoveryDone: 0
可以发现第三节点的VIP已经迁移到一节点中,连接并不受影响,第三节点已经开始重启,所有nodeapps offline
第三节点服务器重启后crs日志中记载 VIP 和 nodeapps 启动成功,但是数据库启动失败,经过日志查询发现是由于系统自动重启时实例先于ASM 启动,ASM没有启动完成时实例无法启动。
2012-06-14 14:38:00.991: [  CRSRES][1403169088] Attempting to stop `ora.rac03.vip` on member `rac01`
2012-06-14 14:38:01.005: [  CRSRES][1407371584] startRunnable: setting CLI values
2012-06-14 14:38:01.016: [  CRSRES][1409472832] startRunnable: setting CLI values
2012-06-14 14:38:01.020: [  CRSRES][1407371584] Attempting to start `ora.db.db3.inst` on member `rac03`
2012-06-14 14:38:01.023: [  CRSRES][1409472832] Attempting to start `ora.rac03.ASM3.asm` on member `rac03`
2012-06-14 14:38:01.256: [  CRSRES][1403169088] Stop of `ora.rac03.vip` on member `rac01` succeeded.
2012-06-14 14:38:01.270: [  CRSRES][1403169088] startRunnable: setting CLI values
2012-06-14 14:38:01.270: [  CRSRES][1403169088] Attempting to start `ora.rac03.vip` on member `rac03`
2012-06-14 14:38:05.834: [  CRSRES][1403169088] Start of `ora.rac03.vip` on member `rac03` succeeded.
2012-06-14 14:38:05.864: [  CRSRES][1403169088] startRunnable: setting CLI values
2012-06-14 14:38:05.870: [  CRSRES][1403169088] Attempting to start `ora.rac03.LISTENER_RAC03.lsnr` on member `rac03`
2012-06-14 14:38:06.157: [  CRSRES][1440954688] CRS-1002: Resource ‘ora.rac03.vip‘ is already running on member ‘rac03‘

2012-06-14 14:38:07.410: [  CRSAPP][1407371584] StartResource error for ora.db.db3.inst error code = 1
2012-06-14 14:38:08.925: [  CRSRES][1407371584] Start of `ora.db.db3.inst` on member `rac03` failed.
2012-06-14 14:38:09.264: [  CRSRES][1403169088] Start of `ora.rac03.LISTENER_RAC03.lsnr` on member `rac03` succeeded.
2012-06-14 14:38:09.952: [  CRSRES][1405270336] CRS-1002: Resource ‘ora.rac03.LISTENER_RAC03.lsnr‘ is already running on member ‘rac03‘

2012-06-14 14:38:10.939: [  CRSRES][1405270336] startRunnable: setting CLI values
2012-06-14 14:38:10.957: [  CRSRES][1405270336] Attempting to start `ora.rac03.ons` on member `rac03`
2012-06-14 14:38:12.433: [  CRSRES][1405270336] Start of `ora.rac03.ons` on member `rac03` succeeded.
2012-06-14 14:38:13.462: [  CRSRES][1409472832] Start of `ora.rac03.ASM3.asm` on member `rac03` succeeded.
2012-06-14 14:38:13.463: [  CRSRES][1409472832] Skip online resource: ora.rac03.ons
2012-06-14 14:38:13.481: [  CRSRES][1411574080] startRunnable: setting CLI values
2012-06-14 14:38:13.484: [  CRSRES][1411574080] Attempting to start `ora.rac03.gsd` on member `rac03`
2012-06-14 14:38:13.794: [  CRSRES][1411574080] Start of `ora.rac03.gsd` on member `rac03` succeeded.
随后手动启动数据库实例,恢复正常
RAC02:oracle:db2 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.db.db application    ONLINE    ONLINE    rac...ac01 
ora....b1.inst application    ONLINE    ONLINE    rac...ac01 
ora....b2.inst application    ONLINE    ONLINE    rac...ac02 
ora....b3.inst application    ONLINE    ONLINE    rac...ac03 
ora....SM1.asm application    ONLINE    ONLINE    rac...ac01 
ora....01.lsnr application    ONLINE    ONLINE    rac...ac01 
ora....c01.gsd application    ONLINE    ONLINE    rac...ac01 
ora....c01.ons application    ONLINE    ONLINE    rac...ac01 
ora....c01.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM2.asm application    ONLINE    ONLINE    rac...ac02 
ora....02.lsnr application    ONLINE    ONLINE    rac...ac02 
ora....c02.gsd application    ONLINE    ONLINE    rac...ac02 
ora....c02.ons application    ONLINE    ONLINE    rac...ac02 
ora....c02.vip application    ONLINE    ONLINE    rac...ac02 
ora....SM3.asm application    ONLINE    ONLINE    rac...ac03 
ora....03.lsnr application    ONLINE    ONLINE    rac...ac03 
ora....c03.gsd application    ONLINE    ONLINE    rac...ac03 
ora....c03.ons application    ONLINE    ONLINE    rac...ac03 
ora....c03.vip application    ONLINE    ONLINE    rac...ac03

对第二节点测试ocssd 进程crash 情况
RAC02:oracle:db2 > ps -fe |grep ocssd
oracle   12951 12034  0 Jun12 ?        00:02:09 /home/crs/bin/ocssd.bin
oracle   27312 11260  0 14:52 pts/0    00:00:00 grep ocssd
RAC02:oracle:db2 > exit
logout
RAC02:~ # kill -9 12951

随后可以发现第二节点服务器重启,VIP 迁移到第一节点,nodeapps offline
RAC01:oracle:db1 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.db.db application    ONLINE    ONLINE    rac...ac01 
ora....b1.inst application    ONLINE    ONLINE    rac...ac01 
ora....b2.inst application    ONLINE    OFFLINE               
ora....b3.inst application    ONLINE    ONLINE    rac...ac03 
ora....SM1.asm application    ONLINE    ONLINE    rac...ac01 
ora....01.lsnr application    ONLINE    ONLINE    rac...ac01 
ora....c01.gsd application    ONLINE    ONLINE    rac...ac01 
ora....c01.ons application    ONLINE    ONLINE    rac...ac01 
ora....c01.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM2.asm application    ONLINE    OFFLINE               
ora....02.lsnr application    ONLINE    OFFLINE               
ora....c02.gsd application    ONLINE    OFFLINE               
ora....c02.ons application    ONLINE    OFFLINE               
ora....c02.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM3.asm application    ONLINE    ONLINE    rac...ac03 
ora....03.lsnr application    ONLINE    ONLINE    rac...ac03 
ora....c03.gsd application    ONLINE    ONLINE    rac...ac03 
ora....c03.ons application    ONLINE    ONLINE    rac...ac03 
ora....c03.vip application    ONLINE    ONLINE    rac...ac03 

2012-06-14 14:52:40.611: [  CRSRES][1407371584] Attempting to start `ora.rac02.vip` on member `rac01`
2012-06-14 14:52:43.842: [  CRSRES][1407371584] Start of `ora.rac02.vip` on member `rac01` succeeded.
2012-06-14 14:52:43.857: [  CRSEVT][1411574080] Post recovery done evmd event for: rac02


第三节点服务器重启后crs日志中记载 VIP 和 nodeapps 启动成功,但是数据库启动失败,经过日志查询发现是由于系统自动重启时实例先于ASM 启动,ASM没有启动完成时实例无法启动。
2012-06-14 15:05:27.034: [  CRSRES][1409472832] startRunnable: setting CLI values
2012-06-14 15:05:27.037: [  CRSRES][1411574080] Attempting to start `ora.db.db2.inst` on member `rac02`
2012-06-14 15:05:27.040: [  CRSRES][1409472832] Attempting to start `ora.rac02.ASM2.asm` on member `rac02`
2012-06-14 15:05:27.253: [  CRSRES][1403169088] Stop of `ora.rac02.vip` on member `rac01` succeeded.
2012-06-14 15:05:27.258: [  CRSRES][1403169088] startRunnable: setting CLI values
2012-06-14 15:05:27.258: [  CRSRES][1403169088] Attempting to start `ora.rac02.vip` on member `rac02`
2012-06-14 15:05:31.879: [  CRSRES][1403169088] Start of `ora.rac02.vip` on member `rac02` succeeded.
2012-06-14 15:05:31.915: [  CRSRES][1403169088] startRunnable: setting CLI values
2012-06-14 15:05:31.921: [  CRSRES][1403169088] Attempting to start `ora.rac02.LISTENER_RAC02.lsnr` on member `rac02`
2012-06-14 15:05:32.176: [  CRSRES][1440954688] CRS-1002: Resource ‘ora.rac02.vip‘ is already running on member ‘rac02‘

2012-06-14 15:05:33.432: [  CRSAPP][1411574080] StartResource error for ora.db.db2.inst error code = 1
2012-06-14 15:05:34.966: [  CRSRES][1411574080] Start of `ora.db.db2.inst` on member `rac02` failed.
2012-06-14 15:05:35.338: [  CRSRES][1403169088] Start of `ora.rac02.LISTENER_RAC02.lsnr` on member `rac02` succeeded.
2012-06-14 15:05:36.020: [  CRSRES][1405270336] CRS-1002: Resource ‘ora.rac02.LISTENER_RAC02.lsnr‘ is already running on member ‘rac02‘

2012-06-14 15:05:37.051: [  CRSRES][1405270336] startRunnable: setting CLI values
2012-06-14 15:05:37.057: [  CRSRES][1405270336] Attempting to start `ora.rac02.ons` on member `rac02`
2012-06-14 15:05:38.517: [  CRSRES][1405270336] Start of `ora.rac02.ons` on member `rac02` succeeded.
2012-06-14 15:05:39.530: [  CRSRES][1409472832] Start of `ora.rac02.ASM2.asm` on member `rac02` succeeded.
2012-06-14 15:05:39.531: [  CRSRES][1409472832] Skip online resource: ora.rac02.ons
2012-06-14 15:05:39.547: [  CRSRES][1411574080] startRunnable: setting CLI values
2012-06-14 15:05:39.550: [  CRSRES][1411574080] Attempting to start `ora.rac02.gsd` on member `rac02`
2012-06-14 15:05:39.844: [  CRSRES][1411574080] Start of `ora.rac02.gsd` on member `rac02` succeeded.
随后手动启动数据库第二节点实例,恢复正常
RAC01:oracle:db1 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.db.db application    ONLINE    ONLINE    rac...ac01 
ora....b1.inst application    ONLINE    ONLINE    rac...ac01 
ora....b2.inst application    ONLINE    ONLINE    rac...ac02 
ora....b3.inst application    ONLINE    ONLINE    rac...ac03 
ora....SM1.asm application    ONLINE    ONLINE    rac...ac01 
ora....01.lsnr application    ONLINE    ONLINE    rac...ac01 
ora....c01.gsd application    ONLINE    ONLINE    rac...ac01 
ora....c01.ons application    ONLINE    ONLINE    rac...ac01 
ora....c01.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM2.asm application    ONLINE    ONLINE    rac...ac02 
ora....02.lsnr application    ONLINE    ONLINE    rac...ac02 
ora....c02.gsd application    ONLINE    ONLINE    rac...ac02 
ora....c02.ons application    ONLINE    ONLINE    rac...ac02 
ora....c02.vip application    ONLINE    ONLINE    rac...ac02 
ora....SM3.asm application    ONLINE    ONLINE    rac...ac03 
ora....03.lsnr application    ONLINE    ONLINE    rac...ac03 
ora....c03.gsd application    ONLINE    ONLINE    rac...ac03 
ora....c03.ons application    ONLINE    ONLINE    rac...ac03 
ora....c03.vip application    ONLINE    ONLINE    rac...ac03


测试效果总结
ocssd进程是数据库集群运行的核心进程之一。Kill ocssd进程会造成主机立刻重启。重启之后nodeapps 会自动重启,VIP会切换回原节点,数据库由于启动时ASM还没有完成启动,故启动失败,需要干预手动启动实例方能重新加入集群。

此问题需要更改 实例中的REQUIRED_RESOURCES 配置,修改后重新测试OCSSD 进程crash ,服务器重启后数据库可以随ASM启动后启动,无须手工干预。
增加配置方式如下:
srvctl modify instance -d db -i db1 -s +ASM1
srvctl modify instance -d db -i db2 -s +ASM2
srvctl modify instance -d db -i db3 -s +ASM3
结果如下:
NAME=ora.db.db1.inst
TYPE=application
ACTION_SCRIPT=/home/oracle/product/11g/bin/racgwrap
ACTIVE_PLACEMENT=0
AUTO_START=1
CHECK_INTERVAL=300
DESCRIPTION=CRS application for Instance
FAILOVER_DELAY=0
FAILURE_INTERVAL=0
FAILURE_THRESHOLD=0
HOSTING_MEMBERS=RAC01
OPTIONAL_RESOURCES=
PLACEMENT=restricted
REQUIRED_RESOURCES=ora.rac01.ASM1.asm
RESTART_ATTEMPTS=5
SCRIPT_TIMEOUT=600
START_TIMEOUT=900
STOP_TIMEOUT=300
UPTIME_THRESHOLD=7d
USR_ORA_ALERT_NAME=
USR_ORA_CHECK_TIMEOUT=0
USR_ORA_CONNECT_STR=/ as sysdba
USR_ORA_DEBUG=0
USR_ORA_DISCONNECT=false
USR_ORA_FLAGS=
USR_ORA_IF=
USR_ORA_INST_NOT_SHUTDOWN=
USR_ORA_LANG=
USR_ORA_NETMASK=
USR_ORA_OPEN_MODE=
USR_ORA_OPI=false
USR_ORA_PFILE=
USR_ORA_PRECONNECT=none
USR_ORA_SRV=
USR_ORA_START_TIMEOUT=0
USR_ORA_STOP_MODE=immediate
USR_ORA_STOP_TIMEOUT=0
USR_ORA_VIP=


EVMD进程crash模拟测试
模拟操作步骤
采用 ‘kill -9 evmd进程ID’模拟EVMD进程失效
预期测试结果
EVMD进程将被重起
测量过程记录
在第三节点测试,查看evmd进程状态并且kill掉相关进程
RAC03:oracle:db3 > ps -fe |grep evmd
root     11383     1  0 14:37 ?        00:00:00 /bin/sh /etc/init.d/init.evmd run
oracle   12478 11383  0 14:37 ?        00:00:00 /bin/su -l oracle -c sh -c ‘ulimit -c unlimited; cd /home/crs/log/rac03/evmd; exec /home/crs/bin/evmd ‘
oracle   12479 12478  0 14:37 ?        00:00:00 /home/crs/bin/evmd.bin
oracle   15712 16457  0 15:31 pts/0    00:00:00 grep evmd
RAC03:oracle:db3 > kill -9 12479
立刻检查evmd进程,可以看到evmd进程已经重启
RAC03:oracle:db3 > ps -fe |grep evmd
root     16303     1  0 15:32 ?        00:00:00 /bin/sh /etc/init.d/init.evmd run
oracle   16635 16303  0 15:32 ?        00:00:00 /bin/su -l oracle -c sh -c ‘ulimit -c unlimited; cd /home/crs/log/rac03/evmd; exec /home/crs/bin/evmd ‘
oracle   16636 16635 10 15:32 ?        00:00:00 /home/crs/bin/evmd.bin
oracle   16768 16457  0 15:32 pts/0    00:00:00 grep evmd
查看evmd.log日志
2012-06-14 15:05:26.337: [  EVMEVT][1191237952][ENTER] Establishing P2P connection with node: rac02
2012-06-14 15:32:13.209: [    EVMD][1568729504] EVMD Starting
2012-06-14 15:32:13.209: [    EVMD][1568729504] Initializing OCR
2012-06-14 15:32:13.215: [    EVMD][1568729504] Get OCR context succeeded
2012-06-14 15:32:13.231: [    EVMD][1568729504] Active Version from OCR:11.1.0.7.0
2012-06-14 15:32:13.231: [    EVMD][1568729504] Active Version and Software Version are same
2012-06-14 15:32:13.231: [    EVMD][1568729504] Initializing Diagnostics Settings
2012-06-14 15:32:13.234: [    EVMD][1568729504] ENV Logging level for Module: allcomp  0
2012-06-14 15:32:13.235: [    EVMD][1568729504] ENV Logging level for Module: default  0
2012-06-14 15:32:13.236: [    EVMD][1568729504] ENV Logging level for Module: COMMCRS  0
2012-06-14 15:32:13.237: [    EVMD][1568729504] ENV Logging level for Module: COMMNS  0
2012-06-14 15:32:13.239: [    EVMD][1568729504] ENV Logging level for Module: EVMD  0
2012-06-14 15:32:13.239: [    EVMD][1568729504] ENV Logging level for Module: EVMDMAIN  0
2012-06-14 15:32:13.240: [    EVMD][1568729504] ENV Logging level for Module: EVMCOMM  0
2012-06-14 15:32:13.241: [    EVMD][1568729504] ENV Logging level for Module: EVMEVT  0
2012-06-14 15:32:13.242: [    EVMD][1568729504] ENV Logging level for Module: EVMAPP  0
2012-06-14 15:32:13.243: [    EVMD][1568729504] ENV Logging level for Module: EVMAGENT  0
2012-06-14 15:32:13.244: [    EVMD][1568729504] ENV Logging level for Module: CRSOCR  0
2012-06-14 15:32:13.245: [    EVMD][1568729504] ENV Logging level for Module: CLUCLS  0
2012-06-14 15:32:13.246: [    EVMD][1568729504] ENV Logging level for Module: OCRRAW  0
2012-06-14 15:32:13.247: [    EVMD][1568729504] ENV Logging level for Module: OCROSD  0
2012-06-14 15:32:13.248: [    EVMD][1568729504] ENV Logging level for Module: OCRCAC  0
2012-06-14 15:32:13.249: [    EVMD][1568729504] ENV Logging level for Module: OCRAPI  0
2012-06-14 15:32:13.250: [    EVMD][1568729504] ENV Logging level for Module: OCRUTL  0
2012-06-14 15:32:13.251: [    EVMD][1568729504] ENV Logging level for Module: OCRMSG  0
2012-06-14 15:32:13.253: [    EVMD][1568729504] ENV Logging level for Module: OCRCLI  0
2012-06-14 15:32:13.254: [    EVMD][1568729504] ENV Logging level for Module: CSSCLNT  0
2012-06-14 15:32:13.254: [    EVMD][1568729504] Creating pidfile  /home/crs/evm/init/rac03.pid
[  clsdmt][1098918208]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=rac03DBG_EVMD))
2012-06-14 15:32:13.270: [    EVMD][1568729504] Authorization database built successfully.
2012-06-14 15:32:13.349: [  EVMEVT][1568729504][ENTER] EVM Listening on: 11732346
2012-06-14 15:32:13.350: [  EVMAPP][1568729504] EVMD Started
2012-06-14 15:32:13.352: [  EVMEVT][1166059840] Listening at (ADDRESS=(PROTOCOL=tcp)(HOST=rac03_priv)(PORT=49898)) for P2P evmd connections requests
2012-06-14 15:32:13.355: [    EVMD][1568729504] Authorization database built successfully.
第二节点结果与一节点相同,evmd进程在被杀死之后将很快重启。

测试效果总结
evmd进程是集群服务的事件记录进程,该进程被kill情况下将会立刻修复,对系统不造成影响。


Oracle ASM 故障模拟测试
ASM 实例crash 模拟测试
模拟操作步骤
采用 ‘kill -9 pmon_asm 进程ID’模拟asm 实例crash

预期测试结果
该节点的*.asm和*.inst资源将显示为offline状态。并且集群将会自动重启这些资源。
数据库另外一节点将会进行实例恢复。
客户端连接会保持,配置了TAF 的select 语句会继续执行。DML操作会回滚。
测量过程记录
计划使用第三节点模拟测试。
先kill 掉 ASM pmon 进程,模拟ASM crash
RAC03:~ # ps -fe |grep pmon
oracle   13909     1  0 Jun15 ?        00:00:00 asm_pmon_+ASM3
oracle   14205     1  0 Jun15 ?        00:00:00 ora_pmon_db3
root     26249 26079  0 11:01 pts/2    00:00:00 grep pmon
RAC03:~ # kill -9 13909
通过日志可以发现,CRS发现 第三节点ASM异常到恢复实例用时29秒
2012-06-18 11:02:05.726: [  CRSRES][1407371584] In stateChanged, ora.rac03.ASM3.asm target is ONLINE
2012-06-18 11:02:05.726: [  CRSRES][1407371584] ora.rac03.ASM3.asm on rac03 went OFFLINE unexpectedly
2012-06-18 11:02:05.726: [  CRSRES][1407371584] StopResource: setting CLI values
2012-06-18 11:02:05.729: [  CRSRES][1407371584] Attempting to stop `ora.rac03.ASM3.asm` on member `rac03`
2012-06-18 11:02:06.293: [  CRSRES][1405270336] In stateChanged, ora.db.db3.inst target is ONLINE
2012-06-18 11:02:06.293: [  CRSRES][1405270336] ora.db.db3.inst on rac03 went OFFLINE unexpectedly
2012-06-18 11:02:06.293: [  CRSRES][1405270336] StopResource: setting CLI values
2012-06-18 11:02:06.296: [  CRSRES][1405270336] Attempting to stop `ora.db.db3.inst` on member `rac03`
2012-06-18 11:02:08.732: [  CRSRES][1405270336] Stop of `ora.db.db3.inst` on member `rac03` succeeded.
2012-06-18 11:02:08.733: [  CRSRES][1405270336] ora.db.db3.inst RESTART_COUNT=0 RESTART_ATTEMPTS=5
2012-06-18 11:02:09.168: [  CRSRES][1407371584] Stop of `ora.rac03.ASM3.asm` on member `rac03` succeeded.
2012-06-18 11:02:09.168: [  CRSRES][1407371584] ora.rac03.ASM3.asm RESTART_COUNT=0 RESTART_ATTEMPTS=5
2012-06-18 11:02:09.168: [  CRSRES][1407371584] Restarting ora.rac03.ASM3.asm on rac03
2012-06-18 11:02:09.171: [  CRSRES][1407371584] startRunnable: setting CLI values
2012-06-18 11:02:09.172: [  CRSRES][1407371584] Attempting to start `ora.rac03.ASM3.asm` on member `rac03`
2012-06-18 11:02:15.079: [  CRSRES][1407371584] Start of `ora.rac03.ASM3.asm` on member `rac03` succeeded.
2012-06-18 11:02:15.080: [  CRSRES][1407371584] Successfully restarted ora.rac03.ASM3.asm on rac03, RESTART_COUNT=1
2012-06-18 11:02:15.114: [  CRSRES][1407371584] ora.rac03.ASM3.asm Updated LAST_RESTART time in ocr
2012-06-18 11:02:15.115: [  CRSRES][1405270336] Restarting ora.db.db3.inst on rac03
2012-06-18 11:02:15.118: [  CRSRES][1405270336] startRunnable: setting CLI values
2012-06-18 11:02:15.119: [  CRSRES][1405270336] Attempting to start `ora.db.db3.inst` on member `rac03`
2012-06-18 11:02:16.567: [  OCRUTL][1266772288]u_freem: mem passed is null
2012-06-18 11:02:34.144: [  CRSRES][1405270336] Start of `ora.db.db3.inst` on member `rac03` succeeded.
2012-06-18 11:02:34.144: [  CRSRES][1405270336] Successfully restarted ora.db.db3.inst on rac03, RESTART_COUNT=1
2012-06-18 11:02:34.162: [  CRSRES][1405270336] ora.db.db3.inst Updated LAST_RESTART time in ocr
2012-06-18 11:02:34.205: [  CRSRES][1407371584] CRS-1002: Resource ‘ora.db.db3.inst‘ is already running on member ‘rac03‘
最终,所有资源恢复完成
RAC02:oracle:db2 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.db.db application    ONLINE    ONLINE    rac...ac01 
ora....b1.inst application    ONLINE    ONLINE    rac...ac01 
ora....b2.inst application    ONLINE    ONLINE    rac...ac02 
ora....b3.inst application    ONLINE    ONLINE    rac...ac03 
ora....SM1.asm application    ONLINE    ONLINE    rac...ac01 
ora....01.lsnr application    ONLINE    ONLINE    rac...ac01 
ora....c01.gsd application    ONLINE    ONLINE    rac...ac01 
ora....c01.ons application    ONLINE    ONLINE    rac...ac01 
ora....c01.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM2.asm application    ONLINE    ONLINE    rac...ac02 
ora....02.lsnr application    ONLINE    ONLINE    rac...ac02 
ora....c02.gsd application    ONLINE    ONLINE    rac...ac02 
ora....c02.ons application    ONLINE    ONLINE    rac...ac02 
ora....c02.vip application    ONLINE    ONLINE    rac...ac02 
ora....SM3.asm application    ONLINE    ONLINE    rac...ac03 
ora....03.lsnr application    ONLINE    ONLINE    rac...ac03 
ora....c03.gsd application    ONLINE    ONLINE    rac...ac03 
ora....c03.ons application    ONLINE    ONLINE    rac...ac03 
ora....c03.vip application    ONLINE    ONLINE    rac...ac03

尝试模拟第二和第三节点ASM 进程crash。
状态如下
RAC02:oracle:db2 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.db.db application    ONLINE    ONLINE    rac...ac01 
ora....b1.inst application    ONLINE    ONLINE    rac...ac01 
ora....b2.inst application    ONLINE    OFFLINE               
ora....b3.inst application    ONLINE    OFFLINE               
ora....SM1.asm application    ONLINE    ONLINE    rac...ac01 
ora....01.lsnr application    ONLINE    ONLINE    rac...ac01 
ora....c01.gsd application    ONLINE    ONLINE    rac...ac01 
ora....c01.ons application    ONLINE    ONLINE    rac...ac01 
ora....c01.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM2.asm application    ONLINE    OFFLINE               
ora....02.lsnr application    ONLINE    ONLINE    rac...ac02 
ora....c02.gsd application    ONLINE    ONLINE    rac...ac02 
ora....c02.ons application    ONLINE    ONLINE    rac...ac02 
ora....c02.vip application    ONLINE    ONLINE    rac...ac02 
ora....SM3.asm application    ONLINE    OFFLINE               
ora....03.lsnr application    ONLINE    ONLINE    rac...ac03 
ora....c03.gsd application    ONLINE    ONLINE    rac...ac03 
ora....c03.ons application    ONLINE    ONLINE    rac...ac03 
ora....c03.vip application    ONLINE    ONLINE    rac...ac03
观察日志发现,集群发现ASM节点异常并且尝试重启恢复,恢复后重启数据库实例,用时28秒
2012-06-18 11:22:31.371: [  CRSRES][1407371584] In stateChanged, ora.rac03.ASM3.asm target is ONLINE
2012-06-18 11:22:31.371: [  CRSRES][1407371584] ora.rac03.ASM3.asm on rac03 went OFFLINE unexpectedly
2012-06-18 11:22:31.371: [  CRSRES][1407371584] StopResource: setting CLI values
2012-06-18 11:22:31.374: [  CRSRES][1407371584] Attempting to stop `ora.rac03.ASM3.asm` on member `rac03`
2012-06-18 11:22:32.051: [  CRSRES][1405270336] In stateChanged, ora.db.db3.inst target is ONLINE
2012-06-18 11:22:32.051: [  CRSRES][1405270336] ora.db.db3.inst on rac03 went OFFLINE unexpectedly
2012-06-18 11:22:32.051: [  CRSRES][1405270336] StopResource: setting CLI values
2012-06-18 11:22:32.055: [  CRSRES][1405270336] Attempting to stop `ora.db.db3.inst` on member `rac03`
2012-06-18 11:22:34.482: [  CRSRES][1405270336] Stop of `ora.db.db3.inst` on member `rac03` succeeded.
2012-06-18 11:22:34.483: [  CRSRES][1405270336] ora.db.db3.inst RESTART_COUNT=2 RESTART_ATTEMPTS=5
2012-06-18 11:22:34.483: [  CRSRES][1405270336] ora.db.db3.inst Uptime does not exceed uptime_threshold
2012-06-18 11:22:34.797: [  CRSRES][1407371584] Stop of `ora.rac03.ASM3.asm` on member `rac03` succeeded.
2012-06-18 11:22:34.797: [  CRSRES][1407371584] ora.rac03.ASM3.asm RESTART_COUNT=2 RESTART_ATTEMPTS=5
2012-06-18 11:22:34.798: [  CRSRES][1407371584] ora.rac03.ASM3.asm Uptime does not exceed uptime_threshold
2012-06-18 11:22:34.798: [  CRSRES][1407371584] Restarting ora.rac03.ASM3.asm on rac03
2012-06-18 11:22:34.801: [  CRSRES][1407371584] startRunnable: setting CLI values
2012-06-18 11:22:34.801: [  CRSRES][1407371584] Attempting to start `ora.rac03.ASM3.asm` on member `rac03`
2012-06-18 11:22:42.713: [  CRSRES][1407371584] Start of `ora.rac03.ASM3.asm` on member `rac03` succeeded.
2012-06-18 11:22:42.713: [  CRSRES][1407371584] Successfully restarted ora.rac03.ASM3.asm on rac03, RESTART_COUNT=3
2012-06-18 11:22:42.727: [  CRSRES][1407371584] ora.rac03.ASM3.asm Updated LAST_RESTART time in ocr
2012-06-18 11:22:42.728: [  CRSRES][1405270336] Restarting ora.db.db3.inst on rac03
2012-06-18 11:22:42.731: [  CRSRES][1405270336] startRunnable: setting CLI values
2012-06-18 11:22:42.731: [  CRSRES][1405270336] Attempting to start `ora.db.db3.inst` on member `rac03`
2012-06-18 11:22:43.818: [  OCRUTL][1317128512]u_freem: mem passed is null
2012-06-18 11:23:03.424: [  CRSRES][1405270336] Start of `ora.db.db3.inst` on member `rac03` succeeded.
2012-06-18 11:23:03.424: [  CRSRES][1405270336] Successfully restarted ora.db.db3.inst on rac03, RESTART_COUNT=3
2012-06-18 11:23:03.443: [  CRSRES][1405270336] ora.db.db3.inst Updated LAST_RESTART time in ocr
2012-06-18 11:23:03.839: [  CRSRES][1407371584] CRS-1002: Resource ‘ora.db.db3.inst‘ is already running on member ‘rac03‘
最终恢复后状态如下
RAC02:oracle:db2 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.db.db application    ONLINE    ONLINE    rac...ac01 
ora....b1.inst application    ONLINE    ONLINE    rac...ac01 
ora....b2.inst application    ONLINE    ONLINE    rac...ac02 
ora....b3.inst application    ONLINE    ONLINE    rac...ac03 
ora....SM1.asm application    ONLINE    ONLINE    rac...ac01 
ora....01.lsnr application    ONLINE    ONLINE    rac...ac01 
ora....c01.gsd application    ONLINE    ONLINE    rac...ac01 
ora....c01.ons application    ONLINE    ONLINE    rac...ac01 
ora....c01.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM2.asm application    ONLINE    ONLINE    rac...ac02 
ora....02.lsnr application    ONLINE    ONLINE    rac...ac02 
ora....c02.gsd application    ONLINE    ONLINE    rac...ac02 
ora....c02.ons application    ONLINE    ONLINE    rac...ac02 
ora....c02.vip application    ONLINE    ONLINE    rac...ac02 
ora....SM3.asm application    ONLINE    ONLINE    rac...ac03 
ora....03.lsnr application    ONLINE    ONLINE    rac...ac03 
ora....c03.gsd application    ONLINE    ONLINE    rac...ac03 
ora....c03.ons application    ONLINE    ONLINE    rac...ac03 
ora....c03.vip application    ONLINE    ONLINE    rac...ac03

测试效果总结
ASM实例如果异常crash会同时导致数据库crash,此时集群会自动监测到状态异常并且尝试重启,重启后会尝试重启并且恢复数据库实例。


ORACLE instance 故障模拟测试
instance crash 故障模拟测试
模拟操作测试
采用kill -9 pmon 模拟数据库实例crash
预期测试结果
*.inst资源将显示为offline状态。并且集群将会自动重启数据库实例
测量过程记录
先使用第三节点测试。
kill 掉pmon 进程
RAC03:~ # ps -fe |grep pmon
oracle    7581     1  0 11:22 ?        00:00:00 asm_pmon_+ASM3
oracle    7890     1  0 11:22 ?        00:00:00 ora_pmon_db3
root     19824 26079  0 11:42 pts/2    00:00:00 grep pmon
RAC03:~ # kill -9 7890
状态如下
RAC01:oracle:db1 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.db.db application    ONLINE    ONLINE    rac...ac01 
ora....b1.inst application    ONLINE    ONLINE    rac...ac01 
ora....b2.inst application    ONLINE    ONLINE    rac...ac02 
ora....b3.inst application    ONLINE    OFFLINE               
ora....SM1.asm application    ONLINE    ONLINE    rac...ac01 
ora....01.lsnr application    ONLINE    ONLINE    rac...ac01 
ora....c01.gsd application    ONLINE    ONLINE    rac...ac01 
ora....c01.ons application    ONLINE    ONLINE    rac...ac01 
ora....c01.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM2.asm application    ONLINE    ONLINE    rac...ac02 
ora....02.lsnr application    ONLINE    ONLINE    rac...ac02 
ora....c02.gsd application    ONLINE    ONLINE    rac...ac02 
ora....c02.ons application    ONLINE    ONLINE    rac...ac02 
ora....c02.vip application    ONLINE    ONLINE    rac...ac02 
ora....SM3.asm application    ONLINE    ONLINE    rac...ac03 
ora....03.lsnr application    ONLINE    ONLINE    rac...ac03 
ora....c03.gsd application    ONLINE    ONLINE    rac...ac03 
ora....c03.ons application    ONLINE    ONLINE    rac...ac03 
ora....c03.vip application    ONLINE    ONLINE    rac...ac03
CRS立即发现第三节点状态异常,并且尝试重启恢复,用时23秒
2012-06-18 11:42:22.365: [  CRSRES][1407371584] In stateChanged, ora.db.db3.inst target is ONLINE
2012-06-18 11:42:22.365: [  CRSRES][1407371584] ora.db.db3.inst on rac03 went OFFLINE unexpectedly
2012-06-18 11:42:22.365: [  CRSRES][1407371584] StopResource: setting CLI values
2012-06-18 11:42:22.368: [  CRSRES][1407371584] Attempting to stop `ora.db.db3.inst` on member `rac03`
2012-06-18 11:42:24.795: [  CRSRES][1407371584] Stop of `ora.db.db3.inst` on member `rac03` succeeded.
2012-06-18 11:42:24.795: [  CRSRES][1407371584] ora.db.db3.inst RESTART_COUNT=3 RESTART_ATTEMPTS=5
2012-06-18 11:42:24.796: [  CRSRES][1407371584] ora.db.db3.inst Uptime does not exceed uptime_threshold
2012-06-18 11:42:24.796: [  CRSRES][1407371584] Restarting ora.db.db3.inst on rac03
2012-06-18 11:42:24.799: [  CRSRES][1407371584] startRunnable: setting CLI values
2012-06-18 11:42:24.800: [  CRSRES][1407371584] Attempting to start `ora.db.db3.inst` on member `rac03`
2012-06-18 11:42:24.854: [  OCRUTL][1249986880]u_freem: mem passed is null
2012-06-18 11:42:41.445: [  CRSRES][1407371584] Start of `ora.db.db3.inst` on member `rac03` succeeded.
2012-06-18 11:42:41.445: [  CRSRES][1407371584] Successfully restarted ora.db.db3.inst on rac03, RESTART_COUNT=4
2012-06-18 11:42:41.450: [  CRSRES][1407371584] ora.db.db3.inst Updated LAST_RESTART time in ocr
2012-06-18 11:42:46.441: [  OCRUTL][1241594176]u_freem: mem passed is null
最终恢复后状态如下:
RAC01:oracle:db1 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.db.db application    ONLINE    ONLINE    rac...ac01 
ora....b1.inst application    ONLINE    ONLINE    rac...ac01 
ora....b2.inst application    ONLINE    ONLINE    rac...ac02 
ora....b3.inst application    ONLINE    ONLINE    rac...ac03 
ora....SM1.asm application    ONLINE    ONLINE    rac...ac01 
ora....01.lsnr application    ONLINE    ONLINE    rac...ac01 
ora....c01.gsd application    ONLINE    ONLINE    rac...ac01 
ora....c01.ons application    ONLINE    ONLINE    rac...ac01 
ora....c01.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM2.asm application    ONLINE    ONLINE    rac...ac02 
ora....02.lsnr application    ONLINE    ONLINE    rac...ac02 
ora....c02.gsd application    ONLINE    ONLINE    rac...ac02 
ora....c02.ons application    ONLINE    ONLINE    rac...ac02 
ora....c02.vip application    ONLINE    ONLINE    rac...ac02 
ora....SM3.asm application    ONLINE    ONLINE    rac...ac03 
ora....03.lsnr application    ONLINE    ONLINE    rac...ac03 
ora....c03.gsd application    ONLINE    ONLINE    rac...ac03 
ora....c03.ons application    ONLINE    ONLINE    rac...ac03 
ora....c03.vip application    ONLINE    ONLINE    rac...ac03

再次使用第二节点和第三节点模拟实例crash测试
最终同前次测试相同,最终集群会发现数据库实例状态异常并且尝试重启。最终恢复状态如下
RAC01:oracle:db1 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.db.db application    ONLINE    ONLINE    rac...ac01 
ora....b1.inst application    ONLINE    ONLINE    rac...ac01 
ora....b2.inst application    ONLINE    ONLINE    rac...ac02 
ora....b3.inst application    ONLINE    ONLINE    rac...ac03 
ora....SM1.asm application    ONLINE    ONLINE    rac...ac01 
ora....01.lsnr application    ONLINE    ONLINE    rac...ac01 
ora....c01.gsd application    ONLINE    ONLINE    rac...ac01 
ora....c01.ons application    ONLINE    ONLINE    rac...ac01 
ora....c01.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM2.asm application    ONLINE    ONLINE    rac...ac02 
ora....02.lsnr application    ONLINE    ONLINE    rac...ac02 
ora....c02.gsd application    ONLINE    ONLINE    rac...ac02 
ora....c02.ons application    ONLINE    ONLINE    rac...ac02 
ora....c02.vip application    ONLINE    ONLINE    rac...ac02 
ora....SM3.asm application    ONLINE    ONLINE    rac...ac03 
ora....03.lsnr application    ONLINE    ONLINE    rac...ac03 
ora....c03.gsd application    ONLINE    ONLINE    rac...ac03 
ora....c03.ons application    ONLINE    ONLINE    rac...ac03 
ora....c03.vip application    ONLINE    ONLINE    rac...ac03

测试效果总结
数据库实例如果异常crash此时集群会自动监测到状态异常并且尝试重启,重启后会尝试重启并且恢复数据库实例。

应用连接下数据库模拟故障测试
应用连接下故障模拟测试
模拟操作测试
前端开启应用,始终保持应用连接情况下模拟数据库crash 测试
预期测试结果
应用始终可以保持连接,测试对当前操作影响
测试过程记录
和应用厂商合作,厂商人员保持应用开启,登录用户并且做一定操作,
连接tns如下
test =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = 10.0.12.18)(PORT = 1521))
    (ADDRESS = (PROTOCOL = TCP)(HOST = 10.0.12.20)(PORT = 1521))
    (ADDRESS = (PROTOCOL = TCP)(HOST = 10.0.12.22)(PORT = 1521))
    (LOAD_BALANCE = yes)
    (FAILOVER = ON)
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = db)
      (FAILOVER_MODE =
         (TYPE = SESSION)
         (METHOD = BASIC)
      )
    )
  )


随后使用kill -9 ocssd 重启第二第三节点。数据库状态如下:
RAC01:oracle:db1 > crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.db.db application    ONLINE    ONLINE    rac...ac01 
ora....b1.inst application    ONLINE    ONLINE    rac...ac01 
ora....b2.inst application    ONLINE    OFFLINE               
ora....b3.inst application    ONLINE    OFFLINE               
ora....SM1.asm application    ONLINE    ONLINE    rac...ac01 
ora....01.lsnr application    ONLINE    ONLINE    rac...ac01 
ora....c01.gsd application    ONLINE    ONLINE    rac...ac01 
ora....c01.ons application    ONLINE    ONLINE    rac...ac01 
ora....c01.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM2.asm application    ONLINE    OFFLINE               
ora....02.lsnr application    ONLINE    OFFLINE               
ora....c02.gsd application    ONLINE    OFFLINE               
ora....c02.ons application    ONLINE    OFFLINE               
ora....c02.vip application    ONLINE    ONLINE    rac...ac01 
ora....SM3.asm application    ONLINE    OFFLINE               
ora....03.lsnr application    ONLINE    OFFLINE               
ora....c03.gsd application    ONLINE    OFFLINE               
ora....c03.ons application    ONLINE    OFFLINE               
ora....c03.vip application    ONLINE    ONLINE    rac...ac01 

当前操作无法继续进行,需要重新登录可继续操作。
应用始终保持数据库连接,并无异常。
测试效果总结
测试结果表明,前端应用连接数据库可以做到当当前连接出现异常中断时failover 切换至另一可用节点,但是当前操作无法进行(应用操作),需要重新登录方可继续操作。但是应用本身始终保持数据库连接,无需额外操作。

RAC集群节点故障模拟测试

标签:

原文地址:http://www.cnblogs.com/shengs/p/4521092.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!