2018年10月22日 星期一

ORA-32701: Possible hangs up to hang ID=6 detected


版本 : 11.2.0.4 RAC 雙節點

-------------

DB alert log 錯誤

ORA-32701: Possible hangs up to hang ID=6 detected




Tue Oct 23 10:32:55 2018
Errors in file /ORA/db/diag/rdbms/lg/LG1/trace/LG1_dia0_438170.trc  (incident=196081):
ORA-32701: Possible hangs up to hang ID=6 detected
Incident details in: /ORA/db/diag/rdbms/lg/LG1/incident/incdir_196081/LG1_dia0_438170_i196081.trc
DIA0 requesting termination of session sid:789 with serial # 39305 (ospid:103221) on instance 2
     due to a GLOBAL, HIGH confidence hang with ID=6.
     Hang Resolution Reason: Although the number of affected sessions did not
    justify automatic hang resolution initially, this previously ignored
    hang was automatically resolved.
DIA0: Examine the alert log on instance 2 for session termination status of hang with ID=6.
Tue Oct 23 10:32:56 2018
Sweep [inc][196081]: completed
Sweep [inc2][196081]: completed
Tue Oct 23 10:34:29 2018



有東西被 hangs 然後 DB 自動去砍掉 hangs 的process



trc 內容

*** 2018-10-23 10:32:55.067
Resolvable Hangs in the System
                     Root       Chain Total               Hang
  Hang Hang          Inst Root  #hung #hung  Hang   Hang  Resolution
    ID Type Status   Num  Sess   Sess  Sess  Conf   Span  Action
 ----- ---- -------- ---- ----- ----- ----- ------ ------ -------------------
      6 HANG RSLNPEND    2   789     2     2   HIGH GLOBAL Terminate Process
  Hang Resolution Reason: Although the number of affected sessions did not
    justify automatic hang resolution initially, this previously ignored
    hang was automatically resolved.

  inst# SessId  Ser#     OSPID PrcNm Event
  ----- ------ ----- --------- ----- -----
      1    674 21509    268777  M001 enq: WF - contention
      2    789 39305    103221  M000 not in wait

  The incident file on instance 2 may contain information
  about session 789 with serial number 39305 and operating system
  process ID 103221.  The incident file may contain a short stack
  and a process state dump.

Victim Information
                                                                      Ignored
  HangID  Inst#  Sessid  Ser Num      OSPID  Fatal BG  Previous Hang    Count
  ------  -----  ------  -------  ---------  --------  -------------  -------
       6      2     789    39305     103221     N      Existing Hang        1

*** 2018-10-23 10:32:55.067


這邊看到 是 inst 1 的  M001  被砍掉 , 原因是 被 inst 2 的 M000 hangs 住了 M001 
在往  trc 下看 確實是

-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 1 (lg.lg1)
                   os id: 268777
              process id: 99, oracle@ora-36 (M001)
              session id: 674
        session serial #: 21509
    }
    is waiting for 'enq: WF - contention' with wait info:
    {
                      p1: 'name|mode'=0x57460006
                      p2: '0'=0x38
                      p3: '0'=0x0
            time in wait: 1 min 56 sec
           timeout after: never
                 wait id: 5150
                blocking: 0 sessions
            wait history:
              * time between current wait and wait #1: 0.000306 sec
              1.       event: 'db file sequential read'
                 time waited: 0.000270 sec
                     wait id: 5149            p1: 'file#'=0x6c
                                              p2: 'block#'=0x99fa
                                              p3: 'blocks'=0x1
              * time between wait #1 and #2: 0.000638 sec
              2.       event: 'db file sequential read'
                 time waited: 0.000477 sec
                     wait id: 5148            p1: 'file#'=0x6c
                                              p2: 'block#'=0x99f9
                                              p3: 'blocks'=0x1
              * time between wait #2 and #3: 0.000106 sec
              3.       event: 'db file sequential read'
                 time waited: 0.000301 sec
                     wait id: 5147            p1: 'file#'=0x6c
                                              p2: 'block#'=0x9d07
                                              p3: 'blocks'=0x1
    }
    and is blocked by
 => Oracle session identified by:
    {
                instance: 2 (lg.lg2)
                   os id: 103221
              process id: 453, oracle@ora-35 (M000)
              session id: 789
        session serial #: 39305
    }
    which is not in a wait:
    {
               last wait: 26 min 36 sec ago
                blocking: 1 session
            wait history:
              1.       event: 'gc current grant 2-way'
                 time waited: 0.000089 sec
                     wait id: 6377            p1: ''=0x5
                                              p2: ''=0x36cb
                                              p3: ''=0x20001ea
              * time between wait #1 and #2: 0.000053 sec
              2.       event: 'gc current grant 2-way'
                 time waited: 0.000138 sec
                     wait id: 6376            p1: ''=0x5
                                              p2: ''=0x36ca
                                              p3: ''=0x20001ea
              * time between wait #2 and #3: 0.000419 sec
              3.       event: 'gc current grant 2-way'
                 time waited: 0.000110 sec
                     wait id: 6375            p1: ''=0x5
                                              p2: ''=0x36c9
                                              p3: ''=0x20001ea
    }

Chain 1 Signature: <not in a wait><='enq: WF - contention'
Chain 1 Signature Hash: 0xee4c6e02
-------------------------------------------------------------------------------


不過看到 M000 這個其實也不用看擔心, M000 開頭的 ,是 Oracle 是mmon 的process, 用於收集AWR 相關數據 ,所以被砍掉暫時不會對系統造成影響


不過 M000 hang住被砍掉表示 AWR 或者 系統使用 SYS.DBMS_STATS.GATHER_TABLE_STATS 有問題 


以下是我遇過的案例

案例1.有一條 delete sql , AP執行了但未commit及rollback,卡了快一天
AP也似乎掛掉,導致這一條資源一直無法釋放, 當發現的時候是半夜DB在做 Automated Maintenance Tasks (Optimizer Statistics Gathering
),因為這條delete sql 讓 table 無法收集資訊一直hangs, 導致系統資源吃滿 session 爆量

解決: 此delete sql 砍掉, 但我這邊見鬼的砍不掉 ,pmon也無法釋放該資源 ,暫時先把Optimizer Statistics Gathering 關掉, 但過一個月後因為太久沒有 Optimizer Statistics Gathering, sql的執行計畫有一些跑掉, 這邊是重開DB 解決
要是無法重開 就手動收集Statistics吧


案例2.在ORA-32701: Possible hangs up to hang ID=6 detected 發生前 EM 有 大量發生 insert into wrh$_sql_bind_metadata

看來是這個 wrh$_sql_bind_metadata 出現問題 

















使用下面的命令:

alter system set "_awr_disabled_flush_tables" = 'wrh$_sql_bind_metadata';

該命令是動態的,不需要重啟實例,禁用多個表時參數用逗號分隔;

禁用了和上面表相關的統計信息flush ash to AWR