一号节点:
2017-10-24 08:49:49.963 [CLSECHO(5883)]CRS-10001: 24-Oct-17 08:49 AFD-9204: false 2017-10-24 08:49:59.005 [CLSECHO(7786)]CRS-10001: 24-Oct-17 08:49 AFD-9204: false 2017-10-24 08:51:36.138 [OCSSD(271335)]CRS-1612: Network communication with node 12crac2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.050 seconds 2017-10-24 08:51:43.139 [OCSSD(271335)]CRS-1611: Network communication with node 12crac2 (2) missing for 75% of timeout interval. Removal of this node from cluster in 7.050 seconds 2017-10-24 08:51:48.140 [OCSSD(271335)]CRS-1610: Network communication with node 12crac2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.050 seconds
二号节点
2017-10-24 08:49:49.896 [CLSECHO(78374)]CRS-10001: 24-Oct-17 08:49 AFD-9204: false 2017-10-24 08:49:58.902 [CLSECHO(79205)]CRS-10001: 24-Oct-17 08:49 AFD-9204: false 2017-10-24 08:51:07.952 [ORAAGENT(281402)]CRS-5011: Check of resource "policejx" failed: details at "(:CLSN00007:)" in "/u01/app/grid/diag/crs/12crac2/crs/trace/crsd_oraagent_oracle.trc" 2017-10-24 08:51:10.064 [ORAAGENT(83682)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 83682 2017-10-24 09:10:21.937 [OHASD(202039)]CRS-8500: Oracle Clusterware OHASD process is starting with operating system process ID 202039
查看crsd_oraagent_oracle.trc:
2017-10-24 08:51:07.955381 : AGENT:3816797952: {0:13:2033} Agfw calling user exitCB, will exit on return 2017-10-24 08:51:07.955390 : AGENT:3816797952: {0:13:2033} returned from user exitCB, exiting 2017-10-24 08:51:07.955455 : AGFW:3816797952: {0:13:2033} Agent is exiting with exit code: 1 2017-10-24 08:51:08.006860 : USRTHRD:3812595456: {0:13:2} ORA-01092: ORACLE instance terminated. Disconnection forced Process ID: 0 Session ID: 4739 Serial number: 3830
查看:ohasd_oraagent_grid.trc
2017-10-24 08:50:48.592645 :CLSDYNAM:1666320128: [ora.gipcd]{0:0:2} [check] ClsdmClient::sendMessage clsdmc_respget return: status=0, ecode=0 2017-10-24 08:51:09.396880 : USRTHRD:2702620416: Usrco UsrcoEventForwarder::postMyEvent posting event "INSTANCE VERSION=1.0 service=policejx database=policejx instance=policejx_1 host=12crac2 status=down reason=FAILURE timestamp=2017-10-24 08:51:07 timezone=+08:00 db_domain= " 2017-10-24 08:51:09.407933 : USRTHRD:2702620416: clsnUsrco: path=/u01/app/12.1.0/grid/racg/usrco/
crsd_oraagent_oracle.trc
2017-10-24 08:51:07.952430 :CLSDYNAM:3797776128: [ora.policejx.db]{0:13:2} [check] DbAgent:checkCbk shutdown reset s_PDBStatusMap
2017-10-24 08:51:07.952493 :CLSDYNAM:3797776128: [ora.policejx.db]{0:13:2} [check] InstAgent::checkState db/asm 2clsagfw_res_status 5 poolState 2
2017-10-24 08:51:07.952606 : USRTHRD:3797776128: {0:13:2} Gimh::destructor gimh_dest_query_ctx rc=0
2017-10-24 08:51:07.952802 : USRTHRD:3797776128: {0:13:2} Gimh::destructor gimh_dest_inst_ctx rc=0
2017-10-24 08:51:07.952827 :CLSDYNAM:3797776128: [ora.policejx.db]{0:13:2} [check] ConnectionPool::stopConnection
2017-10-24 08:51:07.952850 :CLSDYNAM:3797776128: [ora.policejx.db]{0:13:2} [check] ConnectionPool::removeConnection connection count 0
查看alert_sid.log:
Auto-tuning: Starting background process GTXi Tue Oct 24 08:41:09 2017 Dumping diagnostic data in directory=[cdmp_20171024084109], requested by (instance=2, osid=1165285 (M000)), summary=[incident=643254]. Tue Oct 24 08:43:27 2017 Tue Oct 24 08:43:27 2017 System State dumped to trace file /u01/app/oracle/diag/rdbms/policejx/policejx_1/trace/policejx_1_ora_784542.trc System State dumped to trace file /u01/app/oracle/diag/rdbms/policejx/policejx_1/trace/policejx_1_ora_783891.trc Tue Oct 24 08:43:27 2017 Warning: VKTM detected a time drift. Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more details. Tue Oct 24 08:43:35 2017 LMON (ospid: 646174) waits for event ‘latch: enqueue hash chains‘ for 248 secs. LMON (ospid: 646174) waits for latch ‘enqueue hash chains‘ for 248 secs. Tue Oct 24 08:43:44 2017 Errors in file /u01/app/oracle/diag/rdbms/policejx/policejx_1/trace/policejx_1_mmon_646328.trc (incident=577523) (PDBNAME=CDB$ROOT): ORA-00445: background process "m005" did not start after 120 seconds Incident details in: /u01/app/oracle/diag/rdbms/policejx/policejx_1/incident/incdir_577523/policejx_1_mmon_646328_i577523.trc Dumping diagnostic data in directory=[cdmp_20171024084803], requested by (instance=1, osid=646328 (MMON)), summary=[incident=577523]. Tue Oct 24 08:48:35 2017 Dumping diagnostic data in directory=[cdmp_20171024084313], requested by (instance=2, osid=1165285 (M000)), summary=[incident=643255]. Tue Oct 24 08:49:47 2017 Errors in file /u01/app/oracle/diag/rdbms/policejx/policejx_1/trace/policejx_1_ora_775205.trc (incident=586003) (PDBNAME=POLICE): ORA-00445: 后台进程 "PP9S" 在 120 秒之后仍没有启动 Incident details in: /u01/app/oracle/diag/rdbms/policejx/policejx_1/incident/incdir_586003/policejx_1_ora_775205_i586003.trc Tue Oct 24 08:49:52 2017 Dumping diagnostic data in directory=[cdmp_20171024084952], requested by (instance=1, osid=775205), summary=[incident=586003]. Tue Oct 24 08:51:06 2017 Dumping diagnostic data in directory=[cdmp_20171024085106], requested by (instance=2, osid=1175981 (M004)), summary=[incident=649702]. Tue Oct 24 08:51:07 2017 DRM FREEZE TIMEOUT: kjfzpdrmfrz: ospid 775215 not frozen. Process waiting on ‘SQL*Net message from client‘, 62 secs since wait started. Parallel DRM freeze timeout (70 secs) exceeded, terminating the instance. See /u01/app/oracle/diag/rdbms/policejx/policejx_1/trace/policejx_1_rmv6_646880.trc. USER (ospid: 646880): terminating the instance due to error 481 Tue Oct 24 08:51:07 2017 DRM FREEZE TIMEOUT: kjfzpdrmfrz: ospid 770029 not frozen. Process waiting on ‘gc buffer busy acquire‘, 76 secs since wait started. Parallel DRM freeze timeout (70 secs) exceeded, terminating the instance. See /u01/app/oracle/diag/rdbms/policejx/policejx_1/trace/policejx_1_rmv1_646886.trc. Tue Oct 24 08:51:07 2017 opiodr aborting process unknown ospid (281446) as a result of ORA-1092 Tue Oct 24 08:51:08 2017
查看mmon dump
*** 2017-10-24 08:39:46.952
loadavg : 191.78 118.51 71.78
System user time: 0.17 sys time: 0.49 context switch: 88107
Memory (Avail / Total) = 658.31M / 2066865.29M
Swap (Avail / Total) = 127618.42M / 131072.00M
skgpgcmdout: read() for cmd /bin/ps -elf | /bin/egrep ‘PID | 646150‘ | /bin/grep -v grep timed out after 15.000 seconds
skgpgcmdout: read() for cmd /bin/cat /proc/646150/task/646150/status timed out after 0.000 seconds
Short stack dump:
相关问题:
DRM的步骤
1. Oracle停止所有在需要进行remastering的buffer上的操作。注意:DRM是渐进的,也就是说以windows 为单位,每次对一部分的buffer 进行remastering 操作。
2. Lmon 通知所有实例,准备进行remastering
3. 在旧的master实例清除对应buffer的master信息
4. 将master信息传递给新的master实例
5. 在新的master实例构建资源的最新状态
6. 结束,并释放所有之前所有步骤占用的资源。
关闭DRM:
_gc_affinity_time=0
_gc_undo_affinity=FALSE
这2个参数是静态参数,必须要重启实例才能生效。
或者设置成超长时间:
_gc_affinity_limit=250
_gc_affinity_minimum=10485760
_gc_policy_time :单位为分钟,控制DRM统计实例访问buffer次数的时间间隔,默认为是10分钟。
_gc_affinity_ratio:控制进行remastering所需要达到的最小比例(阀值),默认为50。也就是说,如果某个实例在10分钟(_gc_policy_time)之内,访问某个数据库对象的次数大于其他所有实例50倍时(注意:是50倍,而不是50次),对该数据库对象的buffer进行remastering。
注意:11g里改为_gc_affinity_limit改名为_gc_policy_limit;_gc_affinity_time改名为_gc_policy_time;_gc_affinity_minimun改名为_gc_policy_minimum
诊断DRM:"gcs drm freeze in enter server mode" 等待事件:
Script to Collect DRM Information (drmdiag.sql) (文档 ID 1492990.1)
-- NAME: DRMDIAG.SQL -- ------------------------------------------------------------------------ -- AUTHOR: Michael Polaski - Oracle Support Services -- ------------------------------------------------------------------------ -- PURPOSE: -- This script is intended to provide a user friendly guide to troubleshoot -- drm (dynamic resource remastering) waits. The script will create a file -- called drmdiag_<timestamp>.out in your local directory.set echo off set feedback off column timecol new_value timestamp column spool_extension new_value suffix select to_char(sysdate,‘Mondd_hh24mi‘) timecol, ‘.out‘ spool_extension from sys.dual; column output new_value dbname select value || ‘_‘ output from v$parameter where name = ‘db_name‘; spool drmdiag_&&dbname&×tamp&&suffix set trim on set trims on set lines 140 set pages 100 set verify off set feedback on PROMPT DRMDIAG DATA FOR &&dbname&×tamp PROMPT Important paramenters: PROMPT PROMPT _gc_policy_minimum (default is 1500). Increasing this would cause DRMs to happen less frequently. PROMPT Use the "OBJECT_POLICY_STATISTICS" section later in this report to see how active various objects are. PROMPT PROMPT _gc_policy_time (default to 10 (minutes)). Amount of time to evaluate policy stats. Use the PROMPT "OBJECT_POLICY_STATISTICS" section later in this report to see how active various objects are for the PROMPT _gc_policy_time. Usually not necessary to change this parameter. PROMPT PROMPT _gc_read_mostly_locking (default is TRUE). Setting this to FALSE would disable read mostly related DRMs. PROMPT PROMPT gcs_server_processes (default is derived from CPU count/4). May need to increase this above the PROMPT default to add LMS processes to complte the work during a DRM but the default is usually adequate. PROMPT PROMPT _gc_element_percent (default is 110). May need to apply the fix for bug 14791477 and increase this to PROMPT 140 if running out of lock elements. Usually not necessary to change this parameter. PROMPT PROMPT GC Related parameters set in this instance: show parameter gc PROMPT PROMPT CPU count on this instance: show parameter cpu_count PROMPT PROMPT SGA INFO FOR &&dbname&×tamp PROMPT PROMPT Larger buffer caches (above 100 gig) may increase the cost of DRMs significantly. set lines 120 set pages 100 column component format a40 tru column current_size format 99999999999999999 column min_size format 99999999999999999 column max_size format 99999999999999999 column user_specified_size format 99999999999999999 select component, current_size, min_size, max_size, user_specified_size from v$sga_dynamic_components where current_size > 0; PROMPT PROMPT ASH THRESHOLD... PROMPT PROMPT This will be the threshold in milliseconds for total drm freeze PROMPT times. This will be used for the next queries to look for the worst PROMPT ‘drm freeze‘ minutes. Any minutes that have an average log file PROMPT sync time greater than the threshold will be analyzed further. column threshold_in_ms new_value threshold format 999999999.999 select decode(min(threshold_in_ms),null,0,min(threshold_in_ms)) threshold_in_ms from (select inst_id, to_char(sample_time,‘Mondd_hh24mi‘) minute, sum(time_waited)/1000 threshold_in_ms from gv$active_session_history where event like ‘%drm freeze%‘ group by inst_id,to_char(sample_time,‘Mondd_hh24mi‘) order by 3 desc) where rownum <= 10; PROMPT PROMPT ASH WORST MINUTES FOR DRM FREEZE WAITS: PROMPT PROMPT APPROACH: These are the minutes where the avg drm freeze time PROMPT was the highest (in milliseconds). column event format a30 tru column program format a35 tru column total_wait_time format 999999999999.999 column avg_time_waited format 999999999999.999 select to_char(sample_time,‘Mondd_hh24mi‘) minute, inst_id, event, sum(time_waited)/1000 TOTAL_WAIT_TIME , count(*) WAITS, avg(time_waited)/1000 AVG_TIME_WAITED from gv$active_session_history where event like ‘%drm freeze%‘ group by to_char(sample_time,‘Mondd_hh24mi‘), inst_id, event having sum(time_waited)/1000 > &&threshold order by 1,2; PROMPT PROMPT ASH DRM BACKGROUND PROCESS WAITS DURING WORST MINUTES: PROMPT PROMPT APPROACH: What are LMS and RMV doing when ‘drm freeze‘ waits PROMPT are happening? LMD and LMON info may also be relevant column inst format 999 column minute format a12 tru column event format a50 tru column program format a55 wra select to_char(sample_time,‘Mondd_hh24mi‘) minute, inst_id inst, sum(time_waited)/1000 TOTAL_WAIT_TIME , count(*) WAITS, avg(time_waited)/1000 AVG_TIME_WAITED, program, event from gv$active_session_history where to_char(sample_time,‘Mondd_hh24mi‘) in (select to_char(sample_time,‘Mondd_hh24mi‘) from gv$active_session_history where event like ‘%drm freeze%‘ group by to_char(sample_time,‘Mondd_hh24mi‘), inst_id having sum(time_waited)/1000 > &&threshold and sum(time_waited)/1000 > 0.5) and (program like ‘%LMS%‘ or program like ‘%RMV%‘ or program like ‘%LMD%‘ or program like ‘%LMON%‘ or event like ‘%drm freeze%‘) group by to_char(sample_time,‘Mondd_hh24mi‘), inst_id, program, event order by 1,2,3,5 desc, 4; PROMPT PROMPT POLICY HISTORY INFO: PROMPT See if you can correlate policy history events with minutes of high PROMPT wait time. select * from gv$policy_history order by event_date; PROMPT PROMPT DYNAMIC_REMASTER_STATS PROMPT This shows where time is spent during DRM operations. set heading off set lines 60 select ‘Instance: ‘||inst_id inst, ‘Remaster Ops: ‘||remaster_ops rops, ‘Remaster Time: ‘||remaster_time rtime, ‘Remastered Objects: ‘||remastered_objects robjs, ‘Quiesce Time: ‘||quiesce_time qtime, ‘Freeze Time: ‘||freeze_time ftime, ‘Cleanup Time: ‘||cleanup_time ctime, ‘Replay Time: ‘||replay_time rptime, ‘Fixwrite Time: ‘||fixwrite_time fwtime, ‘Sync Time: ‘||sync_time stime, ‘Resources Cleaned: ‘||resources_cleaned rclean, ‘Replayed Locks Sent: ‘||replayed_locks_sent rlockss, ‘Replayed Locks Received: ‘||replayed_locks_received rlocksr, ‘Current Objects: ‘||current_objects from gv$dynamic_remaster_stats order by 1; set lines 120 set heading on PROMPT PROMPT OBJECT_POLICY_STATISTICS: PROMPT The sum of the last 3 columns (sopens,xopens,xfers) decides whether the object PROMPT will be considered for DRM (_gc_policy_minimum). The duration of the stats PROMPT are controlled by _gc_policy_time (default is 10 minutes). select object,node,sopens,xopens,xfers from x$object_policy_statistics; PROMPT PROMPT ACTIVE OBJECTS (OBJECT_POLICY_STATISTICS) PROMPT These are the objects that are above the default _gc_policy_minimum (1500). select object, node, sopens+xopens+xfers activity from x$object_policy_statistics where sopens+xopens+xfers > 1500 order by 3 desc; PROMPT LWM FOR LE FREELIST PROMPT This number should never get near zero, if it does consider the fix for bug 14791477 PROMPT and/or increasing _gc_element_percent. select sum(lwm) from x$kclfx; PROMPT PROMPT GCSPFMASTER INFO WITH OBJECT NAMES column objname format a120 tru select o.name || ‘ - ‘|| o.subname objname, o.type#, h.* from v$gcspfmaster_info h, obj$ o where h.data_object_id=o.dataobj# order by data_object_id; PROMPT PROMPT ASH DETAILS FOR WORST MINUTES: PROMPT PROMPT APPROACH: If you cannot determine the problem from the data PROMPT above, you may need to look at the details of what each session PROMPT is doing during each ‘bad‘ snap. Most likely you will want to PROMPT note the times of the high drm freeze waits, look at what PROMPT LMS, RMV, LMD0, LMON is doing at those times, and go from there... set lines 140 column program format a45 wra column sample_time format a25 tru column event format a30 tru column time_waited format 999999.999 column p1 format a40 tru column p2 format a40 tru column p3 format a40 tru select sample_time, inst_id inst, session_id, program, event, time_waited/1000 TIME_WAITED, p1text||‘: ‘||p1 p1,p2text||‘: ‘||p2 p2,p3text||‘: ‘||p3 p3 from gv$active_session_history where to_char(sample_time,‘Mondd_hh24mi‘) in (select to_char(sample_time,‘Mondd_hh24mi‘) from gv$active_session_history where event like ‘%drm freeze%‘ group by to_char(sample_time,‘Mondd_hh24mi‘), inst_id having sum(time_waited)/1000 > &&threshold) and time_waited > 0.5 order by 1,2,3,4,5; spool off PROMPT PROMPT OUTPUT FILE IS: drmdiag_&&dbname&×tamp&&suffix PROMPT
相关文档: 390483.1
ORA-00445: Background Process "xxxx" Did Not Start After 120 Seconds (文档 ID 1345364.1)
add/modify this parameter in /etc/sysctl.conf
kernel.randomize_va_space=0
kernel.exec-shield=0
原文地址:http://snowhill.blog.51cto.com/339421/1975703