版本 : 11.2.0.4 RAC 雙節點
-------------
DB alert log 錯誤
ORA-32701: Possible hangs up to hang ID=6 detected
Tue Oct 23 10:32:55 2018
Errors in file /ORA/db/diag/rdbms/lg/LG1/trace/LG1_dia0_438170.trc (incident=196081):
ORA-32701: Possible hangs up to hang ID=6 detected
Incident details in: /ORA/db/diag/rdbms/lg/LG1/incident/incdir_196081/LG1_dia0_438170_i196081.trc
DIA0 requesting termination of session sid:789 with serial # 39305 (ospid:103221) on instance 2
due to a GLOBAL, HIGH confidence hang with ID=6.
Hang Resolution Reason: Although the number of affected sessions did not
justify automatic hang resolution initially, this previously ignored
hang was automatically resolved.
DIA0: Examine the alert log on instance 2 for session termination status of hang with ID=6.
Tue Oct 23 10:32:56 2018
Sweep [inc][196081]: completed
Sweep [inc2][196081]: completed
Tue Oct 23 10:34:29 2018
有東西被 hangs 然後 DB 自動去砍掉 hangs 的process
trc 內容
*** 2018-10-23 10:32:55.067
Resolvable Hangs in the System
Root Chain Total Hang
Hang Hang Inst Root #hung #hung Hang Hang Resolution
ID Type Status Num Sess Sess Sess Conf Span Action
----- ---- -------- ---- ----- ----- ----- ------ ------ -------------------
6 HANG RSLNPEND 2 789 2 2 HIGH GLOBAL Terminate Process
Hang Resolution Reason: Although the number of affected sessions did not
justify automatic hang resolution initially, this previously ignored
hang was automatically resolved.
inst# SessId Ser# OSPID PrcNm Event
----- ------ ----- --------- ----- -----
1 674 21509 268777 M001 enq: WF - contention
2 789 39305 103221 M000 not in wait
The incident file on instance 2 may contain information
about session 789 with serial number 39305 and operating system
process ID 103221. The incident file may contain a short stack
and a process state dump.
Victim Information
Ignored
HangID Inst# Sessid Ser Num OSPID Fatal BG Previous Hang Count
------ ----- ------ ------- --------- -------- ------------- -------
6 2 789 39305 103221 N Existing Hang 1
*** 2018-10-23 10:32:55.067
這邊看到 是 inst 1 的 M001 被砍掉 , 原因是 被 inst 2 的 M000 hangs 住了 M001
在往 trc 下看 確實是
-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
Oracle session identified by:
{
instance: 1 (lg.lg1)
os id: 268777
process id: 99, oracle@ora-36 (M001)
session id: 674
session serial #: 21509
}
is waiting for 'enq: WF - contention' with wait info:
{
p1: 'name|mode'=0x57460006
p2: '0'=0x38
p3: '0'=0x0
time in wait: 1 min 56 sec
timeout after: never
wait id: 5150
blocking: 0 sessions
wait history:
* time between current wait and wait #1: 0.000306 sec
1. event: 'db file sequential read'
time waited: 0.000270 sec
wait id: 5149 p1: 'file#'=0x6c
p2: 'block#'=0x99fa
p3: 'blocks'=0x1
* time between wait #1 and #2: 0.000638 sec
2. event: 'db file sequential read'
time waited: 0.000477 sec
wait id: 5148 p1: 'file#'=0x6c
p2: 'block#'=0x99f9
p3: 'blocks'=0x1
* time between wait #2 and #3: 0.000106 sec
3. event: 'db file sequential read'
time waited: 0.000301 sec
wait id: 5147 p1: 'file#'=0x6c
p2: 'block#'=0x9d07
p3: 'blocks'=0x1
}
and is blocked by
=> Oracle session identified by:
{
instance: 2 (lg.lg2)
os id: 103221
process id: 453, oracle@ora-35 (M000)
session id: 789
session serial #: 39305
}
which is not in a wait:
{
last wait: 26 min 36 sec ago
blocking: 1 session
wait history:
1. event: 'gc current grant 2-way'
time waited: 0.000089 sec
wait id: 6377 p1: ''=0x5
p2: ''=0x36cb
p3: ''=0x20001ea
* time between wait #1 and #2: 0.000053 sec
2. event: 'gc current grant 2-way'
time waited: 0.000138 sec
wait id: 6376 p1: ''=0x5
p2: ''=0x36ca
p3: ''=0x20001ea
* time between wait #2 and #3: 0.000419 sec
3. event: 'gc current grant 2-way'
time waited: 0.000110 sec
wait id: 6375 p1: ''=0x5
p2: ''=0x36c9
p3: ''=0x20001ea
}
Chain 1 Signature: <not in a wait><='enq: WF - contention'
Chain 1 Signature Hash: 0xee4c6e02
-------------------------------------------------------------------------------
不過看到 M000 這個其實也不用看擔心, M000 開頭的 ,是 Oracle 是mmon 的process, 用於收集AWR 相關數據 ,所以被砍掉暫時不會對系統造成影響
不過 M000 hang住被砍掉表示 AWR 或者 系統使用 SYS.DBMS_STATS.GATHER_TABLE_STATS 有問題
以下是我遇過的案例
案例1.有一條 delete sql , AP執行了但未commit及rollback,卡了快一天
AP也似乎掛掉,導致這一條資源一直無法釋放, 當發現的時候是半夜DB在做 Automated Maintenance Tasks (Optimizer Statistics Gathering
),因為這條delete sql 讓 table 無法收集資訊一直hangs, 導致系統資源吃滿 session 爆量
解決: 此delete sql 砍掉, 但我這邊見鬼的砍不掉 ,pmon也無法釋放該資源 ,暫時先把Optimizer Statistics Gathering 關掉, 但過一個月後因為太久沒有 Optimizer Statistics Gathering, sql的執行計畫有一些跑掉, 這邊是重開DB 解決
要是無法重開 就手動收集Statistics吧
案例2.在ORA-32701: Possible hangs up to hang ID=6 detected 發生前 EM 有 大量發生 insert into wrh$_sql_bind_metadata
看來是這個 wrh$_sql_bind_metadata 出現問題
使用下面的命令:
alter system set "_awr_disabled_flush_tables" = 'wrh$_sql_bind_metadata';
該命令是動態的,不需要重啟實例,禁用多個表時參數用逗號分隔;
禁用了和上面表相關的統計信息flush ash to AWR