2018年10月22日 星期一

ORA-32701: Possible hangs up to hang ID=6 detected


版本 : 11.2.0.4 RAC 雙節點

-------------

DB alert log 錯誤

ORA-32701: Possible hangs up to hang ID=6 detected




Tue Oct 23 10:32:55 2018
Errors in file /ORA/db/diag/rdbms/lg/LG1/trace/LG1_dia0_438170.trc  (incident=196081):
ORA-32701: Possible hangs up to hang ID=6 detected
Incident details in: /ORA/db/diag/rdbms/lg/LG1/incident/incdir_196081/LG1_dia0_438170_i196081.trc
DIA0 requesting termination of session sid:789 with serial # 39305 (ospid:103221) on instance 2
     due to a GLOBAL, HIGH confidence hang with ID=6.
     Hang Resolution Reason: Although the number of affected sessions did not
    justify automatic hang resolution initially, this previously ignored
    hang was automatically resolved.
DIA0: Examine the alert log on instance 2 for session termination status of hang with ID=6.
Tue Oct 23 10:32:56 2018
Sweep [inc][196081]: completed
Sweep [inc2][196081]: completed
Tue Oct 23 10:34:29 2018



有東西被 hangs 然後 DB 自動去砍掉 hangs 的process



trc 內容

*** 2018-10-23 10:32:55.067
Resolvable Hangs in the System
                     Root       Chain Total               Hang
  Hang Hang          Inst Root  #hung #hung  Hang   Hang  Resolution
    ID Type Status   Num  Sess   Sess  Sess  Conf   Span  Action
 ----- ---- -------- ---- ----- ----- ----- ------ ------ -------------------
      6 HANG RSLNPEND    2   789     2     2   HIGH GLOBAL Terminate Process
  Hang Resolution Reason: Although the number of affected sessions did not
    justify automatic hang resolution initially, this previously ignored
    hang was automatically resolved.

  inst# SessId  Ser#     OSPID PrcNm Event
  ----- ------ ----- --------- ----- -----
      1    674 21509    268777  M001 enq: WF - contention
      2    789 39305    103221  M000 not in wait

  The incident file on instance 2 may contain information
  about session 789 with serial number 39305 and operating system
  process ID 103221.  The incident file may contain a short stack
  and a process state dump.

Victim Information
                                                                      Ignored
  HangID  Inst#  Sessid  Ser Num      OSPID  Fatal BG  Previous Hang    Count
  ------  -----  ------  -------  ---------  --------  -------------  -------
       6      2     789    39305     103221     N      Existing Hang        1

*** 2018-10-23 10:32:55.067


這邊看到 是 inst 1 的  M001  被砍掉 , 原因是 被 inst 2 的 M000 hangs 住了 M001 
在往  trc 下看 確實是

-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 1 (lg.lg1)
                   os id: 268777
              process id: 99, oracle@ora-36 (M001)
              session id: 674
        session serial #: 21509
    }
    is waiting for 'enq: WF - contention' with wait info:
    {
                      p1: 'name|mode'=0x57460006
                      p2: '0'=0x38
                      p3: '0'=0x0
            time in wait: 1 min 56 sec
           timeout after: never
                 wait id: 5150
                blocking: 0 sessions
            wait history:
              * time between current wait and wait #1: 0.000306 sec
              1.       event: 'db file sequential read'
                 time waited: 0.000270 sec
                     wait id: 5149            p1: 'file#'=0x6c
                                              p2: 'block#'=0x99fa
                                              p3: 'blocks'=0x1
              * time between wait #1 and #2: 0.000638 sec
              2.       event: 'db file sequential read'
                 time waited: 0.000477 sec
                     wait id: 5148            p1: 'file#'=0x6c
                                              p2: 'block#'=0x99f9
                                              p3: 'blocks'=0x1
              * time between wait #2 and #3: 0.000106 sec
              3.       event: 'db file sequential read'
                 time waited: 0.000301 sec
                     wait id: 5147            p1: 'file#'=0x6c
                                              p2: 'block#'=0x9d07
                                              p3: 'blocks'=0x1
    }
    and is blocked by
 => Oracle session identified by:
    {
                instance: 2 (lg.lg2)
                   os id: 103221
              process id: 453, oracle@ora-35 (M000)
              session id: 789
        session serial #: 39305
    }
    which is not in a wait:
    {
               last wait: 26 min 36 sec ago
                blocking: 1 session
            wait history:
              1.       event: 'gc current grant 2-way'
                 time waited: 0.000089 sec
                     wait id: 6377            p1: ''=0x5
                                              p2: ''=0x36cb
                                              p3: ''=0x20001ea
              * time between wait #1 and #2: 0.000053 sec
              2.       event: 'gc current grant 2-way'
                 time waited: 0.000138 sec
                     wait id: 6376            p1: ''=0x5
                                              p2: ''=0x36ca
                                              p3: ''=0x20001ea
              * time between wait #2 and #3: 0.000419 sec
              3.       event: 'gc current grant 2-way'
                 time waited: 0.000110 sec
                     wait id: 6375            p1: ''=0x5
                                              p2: ''=0x36c9
                                              p3: ''=0x20001ea
    }

Chain 1 Signature: <not in a wait><='enq: WF - contention'
Chain 1 Signature Hash: 0xee4c6e02
-------------------------------------------------------------------------------


不過看到 M000 這個其實也不用看擔心, M000 開頭的 ,是 Oracle 是mmon 的process, 用於收集AWR 相關數據 ,所以被砍掉暫時不會對系統造成影響


不過 M000 hang住被砍掉表示 AWR 或者 系統使用 SYS.DBMS_STATS.GATHER_TABLE_STATS 有問題 


以下是我遇過的案例

案例1.有一條 delete sql , AP執行了但未commit及rollback,卡了快一天
AP也似乎掛掉,導致這一條資源一直無法釋放, 當發現的時候是半夜DB在做 Automated Maintenance Tasks (Optimizer Statistics Gathering
),因為這條delete sql 讓 table 無法收集資訊一直hangs, 導致系統資源吃滿 session 爆量

解決: 此delete sql 砍掉, 但我這邊見鬼的砍不掉 ,pmon也無法釋放該資源 ,暫時先把Optimizer Statistics Gathering 關掉, 但過一個月後因為太久沒有 Optimizer Statistics Gathering, sql的執行計畫有一些跑掉, 這邊是重開DB 解決
要是無法重開 就手動收集Statistics吧


案例2.在ORA-32701: Possible hangs up to hang ID=6 detected 發生前 EM 有 大量發生 insert into wrh$_sql_bind_metadata

看來是這個 wrh$_sql_bind_metadata 出現問題 

















使用下面的命令:

alter system set "_awr_disabled_flush_tables" = 'wrh$_sql_bind_metadata';

該命令是動態的,不需要重啟實例,禁用多個表時參數用逗號分隔;

禁用了和上面表相關的統計信息flush ash to AWR
















2018年6月6日 星期三

ORACLE 11G 設定帶有參數的 SCHEDULER JOB


---------- 建立一個 SCHEDULER 裏頭的 PROGRAM 先設定變數型態
---------- 以下範例的 PROCEDURE 是需要帶入兩個參數 , 這邊先設定變數型態 一個是 NUMBER 另外一個是 VARCHAR2

BEGIN
  SYS.DBMS_SCHEDULER.CREATE_PROGRAM
    (
      program_name         => 'DBA_TN_REPORT_P'
     ,program_type         => 'STORED_PROCEDURE'
     ,program_action       => 'DBA_TN_REPORT'
     ,number_of_arguments  => 2
     ,enabled              => FALSE
     ,comments             => 'DBA_TN_REPORT'
    );

  SYS.DBMS_SCHEDULER.DEFINE_PROGRAM_ARGUMENT
    (
      program_name         => 'DBA_TN_REPORT_P'
     ,argument_name        => 't_thread_no'
     ,argument_position    => 1
     ,argument_type        => 'NUMBER'
     ,default_value        => ''
    );

   SYS.DBMS_SCHEDULER.DEFINE_PROGRAM_ARGUMENT
    (
      program_name         => 'DBA_TN_REPORT_P'
     ,argument_name        => 't_over_night'
     ,argument_position    => 2
     ,argument_type        => 'VARCHAR2'
     ,default_value        => ''
    );

  SYS.DBMS_SCHEDULER.ENABLE
    (name                  => 'DBA_TN_REPORT_P');
END;
/



---------- 建立一個 SCHEDULER 設定排程, 且這邊帶入參數

BEGIN
  SYS.DBMS_SCHEDULER.CREATE_JOB
    (
       job_name        => 'DBA_TN_REPORT_JOB'
      ,start_date      => TO_TIMESTAMP_TZ('2017/04/14 01:54:00.000000 +08:00','yyyy/mm/dd hh24:mi:ss.ff tzr')
      ,repeat_interval => 'FREQ=minutely; INTERVAL= 3'
      ,end_date        => NULL
      ,program_name    => 'DBA_TN_REPORT_P'
      ,comments        => 'DBA_TN_REPORT_P'
    );
  SYS.DBMS_SCHEDULER.SET_ATTRIBUTE
    ( name      => 'DBA_TN_REPORT_JOB'
     ,attribute => 'RESTARTABLE'
     ,value     => FALSE);
  SYS.DBMS_SCHEDULER.SET_ATTRIBUTE
    ( name      => 'DBA_TN_REPORT_JOB'
     ,attribute => 'LOGGING_LEVEL'
     ,value     => SYS.DBMS_SCHEDULER.LOGGING_OFF);
  SYS.DBMS_SCHEDULER.SET_ATTRIBUTE_NULL
    ( name      => 'DBA_TN_REPORT_JOB'
     ,attribute => 'MAX_FAILURES');
  SYS.DBMS_SCHEDULER.SET_ATTRIBUTE_NULL
    ( name      => 'DBA_TN_REPORT_JOB'
     ,attribute => 'MAX_RUNS');
  SYS.DBMS_SCHEDULER.SET_ATTRIBUTE
    ( name      => 'DBA_TN_REPORT_JOB'
     ,attribute => 'STOP_ON_WINDOW_CLOSE'
     ,value     => FALSE);
  SYS.DBMS_SCHEDULER.SET_ATTRIBUTE
    ( name      => 'DBA_TN_REPORT_JOB'
     ,attribute => 'JOB_PRIORITY'
     ,value     => 3);
  SYS.DBMS_SCHEDULER.SET_ATTRIBUTE_NULL
    ( name      => 'DBA_TN_REPORT_JOB'
     ,attribute => 'SCHEDULE_LIMIT');
  SYS.DBMS_SCHEDULER.SET_ATTRIBUTE
    ( name      => 'DBA_TN_REPORT_JOB'
     ,attribute => 'AUTO_DROP'
     ,value     => FALSE);

---  這邊帶入想要的值進去
  SYS.DBMS_SCHEDULER.SET_JOB_ARGUMENT_VALUE
    ( job_name             => 'DBA_TN_REPORT_JOB'
     ,argument_name        => 't_thread_no'
     ,argument_value       => 1);
 
 
  SYS.DBMS_SCHEDULER.SET_JOB_ARGUMENT_VALUE
    ( job_name             => 'DBA_TN_REPORT_JOB'
     ,argument_name        => 't_over_night'
     ,argument_value       => 's');
END;
/

ORACLE RAC ASM 出現 壞BLOCK


線上業務系統的DB,,, 突然被告知無法寫資料進去某張table

本以為是 table block 壞掉,,  
但被RD人員無法執行的SQL是 insert into .... select * from 

噴出來的錯誤訊息是 
### Cause: java.sql.SQLException: ORA-01578: ORACLE data block corrupted (file # 39, block # 288386)

查了一下 此 block 的 object  是什麼
select * from dba_extents where file_id = 39  and block_id >= 288381   and  block_id  <= 288386 +10

還好此 object 是 index ....  重新 rebuild  就可以正常運作


之後使用dbv 檢測一下此datafile
* 這邊的userid 必須可以登入 asm 且需要 sysasm 權限


[grid@db-15 ~]$ dbv file='+***_***01/**/datafile/l****.285.9909' userid=sysasm/******

DBVERIFY: Release 11.2.0.4.0 - Production on Wed Jun 6 15:25:28 2018

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

DBVERIFY - Verification starting : FILE = +***_***01/**/datafile/l****.285.976987909


DBVERIFY - Verification complete

Total Pages Examined         : 1310720
Total Pages Processed (Data) : 431638
Total Pages Failing   (Data) : 0
Total Pages Processed (Index): 345145
Total Pages Failing   (Index): 0
Total Pages Processed (Other): 4632
Total Pages Processed (Seg)  : 0
Total Pages Failing   (Seg)  : 0
Total Pages Empty            : 529305
Total Pages Marked Corrupt   : 0
Total Pages Influx           : 0
Total Pages Encrypted        : 0
Highest block SCN            : 0 (0.0)


沒有看到 Total Pages Marked Corrupt  筆數
...............

select * from v$database_block_corruption; 
也沒有資料了


比較可惜的是 此案例是線上生產DB, 比較急著要處理 ,就沒有完整紀錄處理前後DB的狀態





案例二:
出現了也是 相關錯誤, 但不是index ,

而是 partition table 的extent出現錯誤

這不過是已經沒資料的 partition table , 剛好可以直接砍掉, 
在重新執行一次 就沒問題


看來搜尋到空的partition table 的壞掉區段 他也會爆錯








2018年4月9日 星期一

ORACLE SINGLE DB + DATA GUARD RESIZE REDO LOG


Oracle single db + data guard  ,  resize redo log 

環境: 11.2.0.4

目前的 single db + data guard 的 redo log 和 standby redo log 都是50MB , 現在想要增加到 100MB



1.主庫修改  standby_file_management 為 MANUAL

show parameter standby_file_management
alter system set standby_file_management=manual;






2.對主庫刪除舊的 standby redo log , 並且直接新增 standby redo log


SQL> select group#,status from v$standby_log;

    GROUP# STATUS
---------- ----------
4 UNASSIGNED
5 UNASSIGNED
6 UNASSIGNED
7 UNASSIGNED

SQL> alter database drop standby logfile group 4;

Database altered.

SQL> alter database drop standby logfile group 5;

Database altered.

SQL> alter database drop standby logfile group 6;

Database altered.

SQL> alter database drop standby logfile group 7;

Database altered.
SQL> 
SQL> alter database add standby logfile group 13 ('/ORA/11.2.4/oradata/LG/redo13.log') size 100M; 

Database altered.

SQL> alter database add standby logfile group 14 ('/ORA/11.2.4/oradata/LG/redo14.log') size 100M; 

Database altered.

SQL> alter database add standby logfile group 15 ('/ORA/11.2.4/oradata/LG/redo15.log') size 100M; 

Database altered.

SQL> alter database add standby logfile group 16 ('/ORA/11.2.4/oradata/LG/redo16.log') size 100M; 

Database altered.

SQL> alter database add standby logfile group 17 ('/ORA/11.2.4/oradata/LG/redo17.log') size 100M; 

Database altered.

SQL> alter database add standby logfile group 18 ('/ORA/11.2.4/oradata/LG/redo18.log') size 100M; 

Database altered.

SQL> 
SQL> select group#,thread#,bytes,status from v$standby_log;

    GROUP#    THREAD#    BYTES STATUS
---------- ---------- ---------- ----------
13     0  104857600 UNASSIGNED
14     0  104857600 UNASSIGNED
15     0  104857600 UNASSIGNED
16     0  104857600 UNASSIGNED
17     0  104857600 UNASSIGNED
18     0  104857600 UNASSIGNED

6 rows selected.

SQL> 
SQL> 







3.對主庫新增 redo log, 再來刪除舊的 redo log 



SQL>   select group#,thread#,bytes,status,archived from v$log;

    GROUP#    THREAD#    BYTES STATUS   ARC
---------- ---------- ---------- ---------------- ---
1     1 52428800 CURRENT   NO
2     1 52428800 INACTIVE   YES
3     1 52428800 INACTIVE   YES

SQL> 

SQL> alter database add logfile group 8 ('/ORA/11.2.4/oradata/LG/redo08.log') size 100M reuse;

Database altered.

SQL> 
SQL> alter database add logfile group 9 ('/ORA/11.2.4/oradata/LG/redo09.log') size 100M reuse;

Database altered.

SQL> 
SQL> alter database add logfile group 10 ('/ORA/11.2.4/oradata/LG/redo10.log') size 100M reuse;

Database altered.

SQL> 
SQL> alter database add logfile group 11 ('/ORA/11.2.4/oradata/LG/redo11.log') size 100M reuse;

Database altered.

SQL> 
SQL> alter database add logfile group 12 ('/ORA/11.2.4/oradata/LG/redo12.log') size 100M reuse;

Database altered.

SQL> 
SQL> 
SQL> alter database drop logfile group 3;

Database altered.

SQL> alter database drop logfile group 2;

Database altered.

SQL> 

因為 group 1 狀態是 CURRENT, 使用check point 指令讓他變成 INACTIVE


SQL> 
SQL> alter system checkpoint;

System altered.

SQL> 
SQL> select group#,thread#,bytes,status,archived from v$log;

    GROUP#    THREAD#    BYTES STATUS   ARC
---------- ---------- ---------- ---------------- ---
1     1 52428800 INACTIVE   YES
8     1 83886080 CURRENT   NO
9     1 83886080 UNUSED   YES
10     1 83886080 UNUSED   YES
11     1 83886080 UNUSED   YES
12     1 83886080 UNUSED   YES

6 rows selected.

SQL> 
SQL> alter database drop logfile group 1;

Database altered.


SQL> 


select f.group#,l.thread#,l.bytes,l.status,f.type from v$logfile f,v$log l where f.group#=l.group#
union all
select f.group#,s.thread#,s.bytes,s.status,f.type from v$logfile f,v$standby_log s where f.group#=s.group#;





4.先在備庫處理 standby redo log


查看一下 備庫狀態
SQL> select group#,thread#,bytes,status,archived from v$log;

    GROUP#    THREAD#    BYTES STATUS   ARC
---------- ---------- ---------- ---------------- ---
1     1 52428800 CLEARING   YES
2     1 52428800 CLEARING   YES
3     1 52428800 CLEARING   YES

SQL> 
SQL> 
SQL> select group#,status from v$standby_log;

    GROUP# STATUS
---------- ----------
4 UNASSIGNED
5 UNASSIGNED
6 UNASSIGNED
7 UNASSIGNED

SQL> 

SQL> 


先關掉備庫同步

SQL> 
SQL> alter database recover managed standby database cancel;

Database altered.


刪除standby redo log

SQL> 
SQL>  alter database drop standby logfile group 4;

Database altered.

SQL>  alter database drop standby logfile group 5;

Database altered.

SQL>  alter database drop standby logfile group 6;

Database altered.

SQL>  alter database drop standby logfile group 7;

Database altered.

SQL> 
SQL> 
SQL> 

新增 standby redo log

SQL> 
SQL> alter database add standby logfile group 13 ('/ORA/11.2.4/oradata/LG/redo13.log') size 100M; 

Database altered.

SQL> alter database add standby logfile group 14 ('/ORA/11.2.4/oradata/LG/redo14.log') size 100M; 

Database altered.

SQL> alter database add standby logfile group 15 ('/ORA/11.2.4/oradata/LG/redo15.log') size 100M; 

Database altered.

SQL> alter database add standby logfile group 16 ('/ORA/11.2.4/oradata/LG/redo16.log') size 100M; 

Database altered.

SQL> alter database add standby logfile group 17 ('/ORA/11.2.4/oradata/LG/redo17.log') size 100M; 

Database altered.

SQL> alter database add standby logfile group 18 ('/ORA/11.2.4/oradata/LG/redo18.log') size 100M; 

Database altered.

SQL> select group#,status from v$standby_log;

    GROUP# STATUS
---------- ----------
13 UNASSIGNED
14 UNASSIGNED
15 UNASSIGNED
16 UNASSIGNED
17 UNASSIGNED
18 UNASSIGNED

6 rows selected.

SQL> 


備庫新增redo log 和刪除舊有的redo log

SQL> 
SQL> alter database add logfile group 8 ('/ORA/11.2.4/oradata/LG/redo08.log') size 100M reuse;

Database altered.

SQL> alter database add logfile group 9 ('/ORA/11.2.4/oradata/LG/redo09.log') size 100M reuse;

Database altered.

SQL> alter database add logfile group 10 ('/ORA/11.2.4/oradata/LG/redo10.log') size 100M reuse;

Database altered.

SQL> alter database add logfile group 11 ('/ORA/11.2.4/oradata/LG/redo11.log') size 100M reuse;

Database altered.

SQL> alter database add logfile group 12 ('/ORA/11.2.4/oradata/LG/redo12.log') size 100M reuse;

Database altered.

SQL> 
SQL> 
SQL>   select group#,thread#,bytes,status,archived from v$log
  2  ;

    GROUP#    THREAD#    BYTES STATUS   ARC
---------- ---------- ---------- ---------------- ---
1     1 52428800 CLEARING   YES
2     1 52428800 CLEARING   YES
3     1 52428800 UNUSED   YES
8     1  104857600 UNUSED   YES
9     1  104857600 UNUSED   YES
10     1  104857600 UNUSED   YES
11     1  104857600 CURRENT   YES
12     1  104857600 UNUSED   YES


8 rows selected.


group 3 status 是 UNUSED , 可以直接刪除

SQL>  alter database drop logfile group 3;


Database altered.

group 1,2 是 CLEARING ,刪除會爆 ORA-01624 錯誤 先執行 clear logfile


SQL>  alter database drop logfile group 2;
 alter database drop logfile group 2
*
ERROR at line 1:
ORA-01624: log 2 needed for crash recovery of instance LG (thread 1)
ORA-00312: online log 2 thread 1: '/ORA/11.2.4/oradata/LG/redo02.log'

--------------


SQL> 
SQL> ALTER DATABASE CLEAR LOGFILE GROUP 1;

Database altered.

SQL> ALTER DATABASE CLEAR LOGFILE GROUP 2;

Database altered.


SQL> 
SQL>  select group#,thread#,bytes,status,archived from v$log;

    GROUP#    THREAD#    BYTES STATUS   ARC
---------- ---------- ---------- ---------------- ---
1     1 52428800 UNUSED   YES
2     1 52428800 UNUSED   YES
8     1  104857600 UNUSED   YES
9     1  104857600 UNUSED   YES
10     1  104857600 UNUSED   YES
11     1  104857600 CURRENT   YES
12     1  104857600 UNUSED   YES

7 rows selected.

SQL>  alter database drop logfile group 1;

Database altered.

SQL>  alter database drop logfile group 2;

Database altered.

SQL> 

接下來正備庫執行


SQL> 
SQL> ALTER SYSTEM SET STANDBY_FILE_MANAGEMENT='AUTO';

System altered.

SQL> 

備庫恢復同步

SQL> 
SQL>  ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT;

Database altered.




正庫多次執行 alter system switch lofile;
看看 log 是否可以正常切換





RAC 

ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 group 4 ('+DATA_TAB01/lg/onlinelog/') SIZE 1000M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 2 group 5 ('+DATA_TAB01/lg/onlinelog/') SIZE 1000M;

alter database add logfile THREAD 1 group 8 ('/ORA/11.2.4/oradata/LG/redo08.log') size 100M reuse;
alter database add logfile THREAD 2 group 9 ('/ORA/11.2.4/oradata/LG/redo08.log') size 100M reuse;