2018年12月19日 星期三

LGWR (ospid: 40339): terminating the instance due to error 4021 (Active Data Guard )


在有一台 active data guard  Crash 掉 , 查看一下 log 出現



WARNING: inbound connection timed out (ORA-3136)
  Time: 19-DEC-2018 17:14:42
Wed Dec 19 17:14:42 2018
Errors in file /ORA/db/diag/rdbms/lg_st/LG/trace/LG_lgwr_189317.trc:
ORA-04021: timeout occurred while waiting to lock object 
    nt OS err code: 0
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535
WARNING: inbound connection timed out (ORA-3136)
    
TNS-12535: TNS:operation timed out
    ns secondary err code: 12606
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
WARNING: inbound connection timed out (ORA-3136)
LGWR (ospid: 189317): terminating the instance due to error 4021


在這之前出現大量
WARNING: inbound connection timed out (ORA-3136) 


最主要是
LGWR (ospid: 189317): terminating the instance due to error 4021

LGWR 導致 DB Crash 掉 


首先先看一下 LGWR trace log

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /ORA/db/11.2.4
System name: Linux
Node name: ora-74
Release: 2.6.32-696.el6.x86_64
Version: #1 SMP Tue Feb 21 00:53:17 EST 2017
Machine: x86_64
Instance name: LG
Redo thread mounted by this instance: 1
Oracle process number: 21
Unix process pid: 189317, image: oracle@ora-74 (LGWR)


*** 2018-12-19 17:14:42.853
*** SESSION ID:(1261.1) 2018-12-19 17:14:42.853
*** CLIENT ID:() 2018-12-19 17:14:42.853
*** SERVICE NAME:(SYS$BACKGROUND) 2018-12-19 17:14:42.853
*** MODULE NAME:() 2018-12-19 17:14:42.853
*** ACTION NAME:() 2018-12-19 17:14:42.853
error 4021 detected in background process
ORA-04021: timeout occurred while waiting to lock object 
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+465<-kjzdssdmp()+267<-kjzduptcctx()+232<-kjzdicrshnfy()+63<-ksuitm()+5570<-ksbrdp()+3507<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253 
----- End of Abridged Call Stack Trace -----

*** 2018-12-19 17:14:42.929
LGWR (ospid: 189317): terminating the instance due to error 4021
ksuitm: waiting up to [5] seconds before killing DIAG(189287)




這幾個狀況 蠻符合  metalink 上說的 

Bug 18242740 : ACTIVE STANDBY DATABASE CRASHED WITH ORA-4021
LGWR (ospid: Xxx): Terminating The Instance Due To Error 4021 (Doc ID 2177473.1)



LGWR需要獲取到解析鎖(parse lock)才能刷新,可能這個會話話費了大量時間解析。超過了設定時間,依然沒有獲取到解析鎖,就發生了ORA-4021崩潰


解決方法: 增加時間


alter system set "_adg_parselock_timeout" = 3000 scope = both ;



受影響的版本 :12.2 已下
也可透過官方 patch 包 修正



















2018年12月6日 星期四

Oracle Data guard move datafile



Standby database :


SQL> SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE cancel;

Database altered.

SQL> 
SQL> ALTER SYSTEM SET STANDBY_FILE_MANAGEMENT=MANUAL SCOPE=BOTH;

System altered.

SQL> 
SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup nomount ;
ORACLE instance started.

SQL> alter database mount standby database;

Database altered.

SQL> host mv /ORA/TEST/TEST01  /ORA/OTHER_DATA/TEST/TEST01


SQL> alter database rename file ' /ORA/TEST/TEST01'  to  '/ORA/OTHER_DATA/TEST/TEST01';

Database altered.

SQL> 
SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT;

Database altered.

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE cancel;

Database altered.

SQL> alter database open read only;

Database altered.

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT;


Database altered.

SQL> SQL> 
SQL> 

SQL> ALTER SYSTEM SET STANDBY_FILE_MANAGEMENT=AUTO SCOPE=BOTH;

System altered.

SQL> 
SQL> 



完成

2018年11月12日 星期一

oracle cursor: mutex X




前幾天在生產其中一套 RAC上, 發生有AP 會不定時 session 衝高, 雖然沒有到無法運作的狀況

但看它一直告警也覺得....心裡也覺得毛毛的


查看一下 


em 及 AWR 出現 大量  cursor: mutex X , 且   cursor: mutex X  佔了 34.6 % 的wait time , 連半夜產量較低的狀況下也是這樣



1.



2.




3.




發生的 cursor: mutex X  都是由一條 sql 引起的


SELECT *

          FROM RELAY_NAME b 
         WHERE    b.id = :1
               AND b.name = :2
               AND b.create_date >= TRUNC ( :3)
               AND b.create_date < TRUNC ( :4 + 1)


SQL 非常簡單 index 也沒什麼問題 table 也不大



網路上有其他案例是 後面變數 太多導致引起bug , 但我這只有四個


本來想直接把 cursor_sharing 改成 force  , 但這生產load 蠻高, 怕運氣不好 整組壞掉


---------------------------

這邊不建議直接 cursor_sharing 改成 force , 有蠻多案例也會導致 cursor: mutex X  升高
---------------------------



先來看一下原因.


通過 V$SQL_SHARED_CURSOR的列值是Y / N來顯示當前指定SQL_ID不能被共享的原因



select * from v$sql_shared_cursor where sql_id = '168vjw7bg9p8g'



在眾多欄位上 只有看到 bind_equiv_failure 是 Y 



BIND_LENGTH_UPGRADEABLE 跟 TRANSLATION_MISMATCH  跟AUTH_CHECK_MISMATCH 都是 N



看起來問題沒有這麼複雜....



oracle 官方文檔上有說

BIND_EQUIV_FAILURE  : the bind value’s selectivity does not match that used to optimize the existing child cursor.This definition together with the selectivity xml tag mentioned above gave me a first clue: Adaptive Cursor Sharing 



the bind value’s selectivity does not match that used to optimize the existing child cursor


這句話讓我想起前幾天我有對這條 SQL 執行 SQL 調整建議程式 根據 oracle 最佳方案去執行最優執行計畫



這邊就先把它停用





停用後此SQL立刻出現大波動.....過了1分後oracle大概有固定了變數...

之後就沒有再出現 大量cursor: mutex X






以上當然是很特例的處理狀況



另外

bind_equiv_failure的意思就是,之所以沒有使用已經存在的子游標,是因為當前的綁定變量值和已經存在子游標當時使用的綁定變量值之間的選擇性有較大差別,出於性能考慮,不能和已經存在的子游標共享


要是綁定變量窺探和自適應游標沒關

那這時候就要猜想綁定變量的列上有直方圖存在是否有傾斜





以下再分享一個案例


在凌晨02:00的時間,某一個SQL load 飆高 且event 都是 cursor : muter X



查詢原因:


通過 V$SQL_SHARED_CURSOR的列值是Y / N來顯示當前指定SQL_ID不能被共享的原因


select * from v$sql_shared_cursor where sql_id = '168vjw7bg9p8g':


在眾多欄位上 只有看到 bind_equiv_failure 是 Y 


大概就覺得 直方圖存在已經有傾斜 或者 table 的 statistics 一直不斷地在變動
看到這邊直覺就先把 table和index  重新收集資訊 analyze (使用GATHER_TABLE_STATS ) ,但無用

MOS 上有建議 執行 alter system flush shared_pool 把 shared_pool 清除或者使用 _CURSOR_OBSOLETE_THRESHOLD 參數

但....這是生產的環境 不太允許可以這樣測試



這時候突然想到 出現 bind_equiv_failure
是因為當前的綁定變量值和已經存在子游標當時使用的綁定變量值之間的選擇性有較大差別


難道 table 的 statistics 變動頻率很高 ?!

執行  dbms_stats.lock_table_stats  
結果  cursor : muter X 降低正常值, session 不再飆高








要是以上方法 

table index analyze 及unlock table statistics 都無法解決的話 
那就砍掉SQL在shared pool的執行計畫, 讓DB重新產生一份新的


1.先查詢SQL的位址
select SQL_TEXT,sql_id, address, hash_value, executions, loads, parse_calls, invalidations  
from v$sqlarea  where sql_id = '168vjw7bg9p8g'; 


這邊需要的是 address, hash_value 欄位的值




2.
單獨砍掉他的執行計畫 
exec sys.dbms_shared_pool.purge('0000001D4A15D578,3606369551','c') ; 


第三個參數 C 是 CURSOR 
以下是可使用清單

  P          package/procedure/function
  Q          sequence
  R          trigger
  T          type
  JS         java source
  JC         java class
  JR         java resource
  JD         java shared data
  C          cursor

3.查詢SQL還在不在
select SQL_TEXT,sql_id, address, hash_value, executions, loads, parse_calls, invalidations  
from v$sqlarea  where sql_id = '168vjw7bg9p8g'; 

沒有資料! 刪除成功


4.重新產生新的執行計畫後, 就不再出現高 cursor:mutex X 及 cursor:mutex S 








參考


Troubleshooting: High Version Count Issues (文档 ID 296377.1)
VIEW: “V$SQL_SHARED_CURSOR” Reference Note (文档 ID 120655.1)


https://hourim.wordpress.com/2015/04/06/bind_equiv_failure-or-when-you-will-regret-using-adaptive-cursor-sharing/


http://yong321.freeshell.org/computer/SharedPoolPurging.html

http://www.xifenfei.com/2013/05/dbms_shared_pool-purge%E5%B7%A5%E4%BD%9C%E5%8E%9F%E7%90%86%E7%8C%9C%E6%B5%8B.html








2018年11月8日 星期四

Oracle temp ORA-01122 ORA-01110 驗證失敗錯誤


oracle data guard 備庫出現錯誤


Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE cancel
alter database open read only
Fri Nov 09 10:08:37 2018
Read of datafile '/ORA/db/oradata/TX/temp01.dbf' (fno 201) header failed with ORA-01203
Rereading datafile 201 header failed with ORA-01203
Errors in file /ORA/db/diag/rdbms/lg_st/TX/trace/LG_dbw0_450588.trc:
ORA-01186: file 201 failed verification tests
ORA-01122: database file 201 failed verification check
ORA-01110: data file 201: '/ORA/db/oradata/TX/temp01.dbf'
ORA-01203: wrong incarnation of this file - wrong creation SCN
File 201 not verified due to error ORA-01122
Fri Nov 09 10:08:37 2018


執行

select s.name tbsname,t.name,(t.bytes/1024/1024) bytes,status
from v$tablespace s,v$tempfile t
 where s.ts# = t.ts#;

出現 ORA-01122  及 ORA-01110 錯誤



這是tmep 文件 失效

解決方是 新增一個datafile 並且砍掉原本




SQL> alter tablespace temp add tempfile '/ORA/db/oradata/TX/temp01_01.dbf' size 100m autoextend on;


Tablespace altered.


SQL> alter tablespace temp drop tempfile '/ORA/db/oradata/TX/temp01.dbf';

Tablespace altered.

SQL> 
SQL> 


解決完成

執行
select s.name tbsname,t.name,(t.bytes/1024/1024) bytes,status
from v$tablespace s,v$tempfile t
 where s.ts# = t.ts#;

出現 新增的 datafile


alert.log 內容

Fri Nov 09 10:26:34 2018
alter tablespace temp add tempfile '/ORA/db/oradata/TX/temp01_01.dbf' size 100m autoextend on
Completed: alter tablespace temp add tempfile '/ORA/db/oradata/TX/temp01_01.dbf' size 100m autoextend on
Fri Nov 09 10:27:21 2018
alter tablespace temp drop tempfile '/ORA/db/oradata/TX/temp01.dbf'
Errors in file /ORA/db/diag/rdbms/lg_st/TX/trace/LG_ora_451105.trc:
ORA-01122: database file 201 failed verification check
ORA-01110: data file 201: '/ORA/db/oradata/TX/temp01.dbf'
ORA-01203: wrong incarnation of this file - wrong creation SCN
Errors in file /ORA/db/diag/rdbms/lg_st/TX/trace/LG_ora_451105.trc:
ORA-01258: unable to delete temporary file /ORA/db/oradata/TX/temp01.dbf
Completed: alter tablespace temp drop tempfile '/ORA/db/oradata/TX/temp01.dbf'














2018年10月22日 星期一

ORA-32701: Possible hangs up to hang ID=6 detected


版本 : 11.2.0.4 RAC 雙節點

-------------

DB alert log 錯誤

ORA-32701: Possible hangs up to hang ID=6 detected




Tue Oct 23 10:32:55 2018
Errors in file /ORA/db/diag/rdbms/lg/LG1/trace/LG1_dia0_438170.trc  (incident=196081):
ORA-32701: Possible hangs up to hang ID=6 detected
Incident details in: /ORA/db/diag/rdbms/lg/LG1/incident/incdir_196081/LG1_dia0_438170_i196081.trc
DIA0 requesting termination of session sid:789 with serial # 39305 (ospid:103221) on instance 2
     due to a GLOBAL, HIGH confidence hang with ID=6.
     Hang Resolution Reason: Although the number of affected sessions did not
    justify automatic hang resolution initially, this previously ignored
    hang was automatically resolved.
DIA0: Examine the alert log on instance 2 for session termination status of hang with ID=6.
Tue Oct 23 10:32:56 2018
Sweep [inc][196081]: completed
Sweep [inc2][196081]: completed
Tue Oct 23 10:34:29 2018



有東西被 hangs 然後 DB 自動去砍掉 hangs 的process



trc 內容

*** 2018-10-23 10:32:55.067
Resolvable Hangs in the System
                     Root       Chain Total               Hang
  Hang Hang          Inst Root  #hung #hung  Hang   Hang  Resolution
    ID Type Status   Num  Sess   Sess  Sess  Conf   Span  Action
 ----- ---- -------- ---- ----- ----- ----- ------ ------ -------------------
      6 HANG RSLNPEND    2   789     2     2   HIGH GLOBAL Terminate Process
  Hang Resolution Reason: Although the number of affected sessions did not
    justify automatic hang resolution initially, this previously ignored
    hang was automatically resolved.

  inst# SessId  Ser#     OSPID PrcNm Event
  ----- ------ ----- --------- ----- -----
      1    674 21509    268777  M001 enq: WF - contention
      2    789 39305    103221  M000 not in wait

  The incident file on instance 2 may contain information
  about session 789 with serial number 39305 and operating system
  process ID 103221.  The incident file may contain a short stack
  and a process state dump.

Victim Information
                                                                      Ignored
  HangID  Inst#  Sessid  Ser Num      OSPID  Fatal BG  Previous Hang    Count
  ------  -----  ------  -------  ---------  --------  -------------  -------
       6      2     789    39305     103221     N      Existing Hang        1

*** 2018-10-23 10:32:55.067


這邊看到 是 inst 1 的  M001  被砍掉 , 原因是 被 inst 2 的 M000 hangs 住了 M001 
在往  trc 下看 確實是

-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 1 (lg.lg1)
                   os id: 268777
              process id: 99, oracle@ora-36 (M001)
              session id: 674
        session serial #: 21509
    }
    is waiting for 'enq: WF - contention' with wait info:
    {
                      p1: 'name|mode'=0x57460006
                      p2: '0'=0x38
                      p3: '0'=0x0
            time in wait: 1 min 56 sec
           timeout after: never
                 wait id: 5150
                blocking: 0 sessions
            wait history:
              * time between current wait and wait #1: 0.000306 sec
              1.       event: 'db file sequential read'
                 time waited: 0.000270 sec
                     wait id: 5149            p1: 'file#'=0x6c
                                              p2: 'block#'=0x99fa
                                              p3: 'blocks'=0x1
              * time between wait #1 and #2: 0.000638 sec
              2.       event: 'db file sequential read'
                 time waited: 0.000477 sec
                     wait id: 5148            p1: 'file#'=0x6c
                                              p2: 'block#'=0x99f9
                                              p3: 'blocks'=0x1
              * time between wait #2 and #3: 0.000106 sec
              3.       event: 'db file sequential read'
                 time waited: 0.000301 sec
                     wait id: 5147            p1: 'file#'=0x6c
                                              p2: 'block#'=0x9d07
                                              p3: 'blocks'=0x1
    }
    and is blocked by
 => Oracle session identified by:
    {
                instance: 2 (lg.lg2)
                   os id: 103221
              process id: 453, oracle@ora-35 (M000)
              session id: 789
        session serial #: 39305
    }
    which is not in a wait:
    {
               last wait: 26 min 36 sec ago
                blocking: 1 session
            wait history:
              1.       event: 'gc current grant 2-way'
                 time waited: 0.000089 sec
                     wait id: 6377            p1: ''=0x5
                                              p2: ''=0x36cb
                                              p3: ''=0x20001ea
              * time between wait #1 and #2: 0.000053 sec
              2.       event: 'gc current grant 2-way'
                 time waited: 0.000138 sec
                     wait id: 6376            p1: ''=0x5
                                              p2: ''=0x36ca
                                              p3: ''=0x20001ea
              * time between wait #2 and #3: 0.000419 sec
              3.       event: 'gc current grant 2-way'
                 time waited: 0.000110 sec
                     wait id: 6375            p1: ''=0x5
                                              p2: ''=0x36c9
                                              p3: ''=0x20001ea
    }

Chain 1 Signature: <not in a wait><='enq: WF - contention'
Chain 1 Signature Hash: 0xee4c6e02
-------------------------------------------------------------------------------


不過看到 M000 這個其實也不用看擔心, M000 開頭的 ,是 Oracle 是mmon 的process, 用於收集AWR 相關數據 ,所以被砍掉暫時不會對系統造成影響


不過 M000 hang住被砍掉表示 AWR 或者 系統使用 SYS.DBMS_STATS.GATHER_TABLE_STATS 有問題 


以下是我遇過的案例

案例1.有一條 delete sql , AP執行了但未commit及rollback,卡了快一天
AP也似乎掛掉,導致這一條資源一直無法釋放, 當發現的時候是半夜DB在做 Automated Maintenance Tasks (Optimizer Statistics Gathering
),因為這條delete sql 讓 table 無法收集資訊一直hangs, 導致系統資源吃滿 session 爆量

解決: 此delete sql 砍掉, 但我這邊見鬼的砍不掉 ,pmon也無法釋放該資源 ,暫時先把Optimizer Statistics Gathering 關掉, 但過一個月後因為太久沒有 Optimizer Statistics Gathering, sql的執行計畫有一些跑掉, 這邊是重開DB 解決
要是無法重開 就手動收集Statistics吧


案例2.在ORA-32701: Possible hangs up to hang ID=6 detected 發生前 EM 有 大量發生 insert into wrh$_sql_bind_metadata

看來是這個 wrh$_sql_bind_metadata 出現問題 

















使用下面的命令:

alter system set "_awr_disabled_flush_tables" = 'wrh$_sql_bind_metadata';

該命令是動態的,不需要重啟實例,禁用多個表時參數用逗號分隔;

禁用了和上面表相關的統計信息flush ash to AWR