Oracle DBA 生涯日誌: 10月 2018

版本 : 11.2.0.4 RAC 雙節點

-------------

DB alert log 錯誤

ORA-32701: Possible hangs up to hang ID=6 detected

Tue Oct 23 10:32:55 2018
Errors in file /ORA/db/diag/rdbms/lg/LG1/trace/LG1_dia0_438170.trc (incident=196081):
ORA-32701: Possible hangs up to hang ID=6 detected
Incident details in: /ORA/db/diag/rdbms/lg/LG1/incident/incdir_196081/LG1_dia0_438170_i196081.trc
DIA0 requesting termination of session sid:789 with serial # 39305 (ospid:103221) on instance 2
due to a GLOBAL, HIGH confidence hang with ID=6.
Hang Resolution Reason: Although the number of affected sessions did not
justify automatic hang resolution initially, this previously ignored
hang was automatically resolved.
DIA0: Examine the alert log on instance 2 for session termination status of hang with ID=6.
Tue Oct 23 10:32:56 2018
Sweep [inc][196081]: completed
Sweep [inc2][196081]: completed
Tue Oct 23 10:34:29 2018

有東西被 hangs 然後 DB 自動去砍掉 hangs 的process

trc 內容

*** 2018-10-23 10:32:55.067

Resolvable Hangs in the System

Root Chain Total Hang

Hang Hang Inst Root #hung #hung Hang Hang Resolution

ID Type Status Num Sess Sess Sess Conf Span Action

----- ---- -------- ---- ----- ----- ----- ------ ------ -------------------

6 HANG RSLNPEND 2 789 2 2 HIGH GLOBAL Terminate Process

Hang Resolution Reason: Although the number of affected sessions did not

justify automatic hang resolution initially, this previously ignored

hang was automatically resolved.

inst# SessId Ser# OSPID PrcNm Event

----- ------ ----- --------- ----- -----

1 674 21509 268777 M001 enq: WF - contention

2 789 39305 103221 M000 not in wait

The incident file on instance 2 may contain information

about session 789 with serial number 39305 and operating system

process ID 103221. The incident file may contain a short stack

and a process state dump.

Victim Information

Ignored

HangID Inst# Sessid Ser Num OSPID Fatal BG Previous Hang Count

------ ----- ------ ------- --------- -------- ------------- -------

6 2 789 39305 103221 N Existing Hang 1

*** 2018-10-23 10:32:55.067

這邊看到是 inst 1 的 M001 被砍掉 , 原因是被 inst 2 的 M000 hangs 住了 M001

在往 trc 下看確實是

-------------------------------------------------------------------------------

Chain 1:

-------------------------------------------------------------------------------

Oracle session identified by:

{

instance: 1 (lg.lg1)

os id: 268777

process id: 99, oracle@ora-36 (M001)

session id: 674

session serial #: 21509

}

is waiting for 'enq: WF - contention' with wait info:

{

p1: 'name|mode'=0x57460006

p2: '0'=0x38

p3: '0'=0x0

time in wait: 1 min 56 sec

timeout after: never

wait id: 5150

blocking: 0 sessions

wait history:

* time between current wait and wait #1: 0.000306 sec

1. event: 'db file sequential read'

time waited: 0.000270 sec

wait id: 5149 p1: 'file#'=0x6c

p2: 'block#'=0x99fa

p3: 'blocks'=0x1

* time between wait #1 and #2: 0.000638 sec

2. event: 'db file sequential read'

time waited: 0.000477 sec

wait id: 5148 p1: 'file#'=0x6c

p2: 'block#'=0x99f9

p3: 'blocks'=0x1

* time between wait #2 and #3: 0.000106 sec

3. event: 'db file sequential read'

time waited: 0.000301 sec

wait id: 5147 p1: 'file#'=0x6c

p2: 'block#'=0x9d07

p3: 'blocks'=0x1

}

and is blocked by

=> Oracle session identified by:

{

instance: 2 (lg.lg2)

os id: 103221

process id: 453, oracle@ora-35 (M000)

session id: 789

session serial #: 39305

}

which is not in a wait:

{

last wait: 26 min 36 sec ago

blocking: 1 session

wait history:

1. event: 'gc current grant 2-way'

time waited: 0.000089 sec

wait id: 6377 p1: ''=0x5

p2: ''=0x36cb

p3: ''=0x20001ea

* time between wait #1 and #2: 0.000053 sec

2. event: 'gc current grant 2-way'

time waited: 0.000138 sec

wait id: 6376 p1: ''=0x5

p2: ''=0x36ca

p3: ''=0x20001ea

* time between wait #2 and #3: 0.000419 sec

3. event: 'gc current grant 2-way'

time waited: 0.000110 sec

wait id: 6375 p1: ''=0x5

p2: ''=0x36c9

p3: ''=0x20001ea

}

Chain 1 Signature: <not in a wait><='enq: WF - contention'

Chain 1 Signature Hash: 0xee4c6e02

-------------------------------------------------------------------------------

不過看到 M000 這個其實也不用看擔心, M000 開頭的 ,是 Oracle 是mmon 的process, 用於收集AWR 相關數據 ,所以被砍掉暫時不會對系統造成影響

不過 M000 hang住被砍掉表示 AWR 或者系統使用 SYS.DBMS_STATS.GATHER_TABLE_STATS 有問題

以下是我遇過的案例

案例1.有一條 delete sql , AP執行了但未commit及rollback,卡了快一天

AP也似乎掛掉,導致這一條資源一直無法釋放, 當發現的時候是半夜DB在做 Automated Maintenance Tasks (Optimizer Statistics Gathering

),因為這條delete sql 讓 table 無法收集資訊一直hangs, 導致系統資源吃滿 session 爆量

解決: 此delete sql 砍掉, 但我這邊見鬼的砍不掉 ,pmon也無法釋放該資源 ,暫時先把Optimizer Statistics Gathering 關掉, 但過一個月後因為太久沒有 Optimizer Statistics Gathering, sql的執行計畫有一些跑掉, 這邊是重開DB 解決

要是無法重開就手動收集Statistics吧

案例2.在ORA-32701: Possible hangs up to hang ID=6 detected 發生前 EM 有大量發生 insert into wrh$_sql_bind_metadata

看來是這個 wrh$_sql_bind_metadata 出現問題

使用下面的命令：

alter system set "_awr_disabled_flush_tables" = 'wrh$_sql_bind_metadata';

該命令是動態的，不需要重啟實例，禁用多個表時參數用逗號分隔;

禁用了和上面表相關的統計信息flush ash to AWR

Oracle DBA 生涯日誌

2018年10月22日星期一

ORA-32701: Possible hangs up to hang ID=6 detected

文章列表

搜尋此網誌

2018年10月22日 星期一

ORA-32701: Possible hangs up to hang ID=6 detected

文章列表

搜尋此網誌

2018年10月22日星期一