Oracle DBA 生涯日誌: 2018

2018年12月19日星期三

LGWR (ospid: 40339): terminating the instance due to error 4021 (Active Data Guard )

在有一台 active data guard Crash 掉 , 查看一下 log 出現

WARNING: inbound connection timed out (ORA-3136)
Time: 19-DEC-2018 17:14:42
Wed Dec 19 17:14:42 2018
Errors in file /ORA/db/diag/rdbms/lg_st/LG/trace/LG_lgwr_189317.trc:
ORA-04021: timeout occurred while waiting to lock object
nt OS err code: 0
Tracing not turned on.
Tns error struct:
ns main err code: 12535
WARNING: inbound connection timed out (ORA-3136)

TNS-12535: TNS:operation timed out
ns secondary err code: 12606
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
WARNING: inbound connection timed out (ORA-3136)
LGWR (ospid: 189317): terminating the instance due to error 4021

在這之前出現大量

WARNING: inbound connection timed out (ORA-3136)

最主要是

LGWR (ospid: 189317): terminating the instance due to error 4021

LGWR 導致 DB Crash 掉

首先先看一下 LGWR trace log

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production

With the Partitioning, OLAP, Data Mining and Real Application Testing options

ORACLE_HOME = /ORA/db/11.2.4

System name: Linux

Node name: ora-74

Release: 2.6.32-696.el6.x86_64

Version: #1 SMP Tue Feb 21 00:53:17 EST 2017

Machine: x86_64

Instance name: LG

Redo thread mounted by this instance: 1

Oracle process number: 21

Unix process pid: 189317, image: oracle@ora-74 (LGWR)

*** 2018-12-19 17:14:42.853

*** SESSION ID:(1261.1) 2018-12-19 17:14:42.853

*** CLIENT ID:() 2018-12-19 17:14:42.853

*** SERVICE NAME:(SYS$BACKGROUND) 2018-12-19 17:14:42.853

*** MODULE NAME:() 2018-12-19 17:14:42.853

*** ACTION NAME:() 2018-12-19 17:14:42.853

error 4021 detected in background process

ORA-04021: timeout occurred while waiting to lock object

kjzduptcctx: Notifying DIAG for crash event

----- Abridged Call Stack Trace -----

ksedsts()+465<-kjzdssdmp()+267<-kjzduptcctx()+232<-kjzdicrshnfy()+63<-ksuitm()+5570<-ksbrdp()+3507<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253

----- End of Abridged Call Stack Trace -----

*** 2018-12-19 17:14:42.929

LGWR (ospid: 189317): terminating the instance due to error 4021

ksuitm: waiting up to [5] seconds before killing DIAG(189287)

這幾個狀況蠻符合 metalink 上說的

Bug 18242740 : ACTIVE STANDBY DATABASE CRASHED WITH ORA-4021

LGWR (ospid: Xxx): Terminating The Instance Due To Error 4021 (Doc ID 2177473.1)

或者 (Doc ID 17018214.8)

LGWR需要獲取到解析鎖（parse lock）才能刷新，可能這個會話話費了大量時間解析。超過了設定時間，依然沒有獲取到解析鎖，就發生了ORA-4021崩潰

解決方法: 增加時間

alter system set "_adg_parselock_timeout" = 3000 scope = both ;

受影響的版本 :12.2 已下
也可透過官方 patch 包修正

2018年12月6日星期四

Oracle Data guard move datafile

Standby database :

SQL> SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE cancel;

Database altered.

SQL>

SQL> ALTER SYSTEM SET STANDBY_FILE_MANAGEMENT=MANUAL SCOPE=BOTH;

System altered.

SQL>

SQL> shutdown immediate;

Database closed.

Database dismounted.

ORACLE instance shut down.

SQL> startup nomount ;

ORACLE instance started.

SQL> alter database mount standby database;

Database altered.

SQL> host mv /ORA/TEST/TEST01 /ORA/OTHER_DATA/TEST/TEST01

SQL> alter database rename file ' /ORA/TEST/TEST01' to '/ORA/OTHER_DATA/TEST/TEST01';

Database altered.

SQL>

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT;

Database altered.

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE cancel;

Database altered.

SQL> alter database open read only;

Database altered.

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT;

Database altered.

SQL> SQL>

SQL>

SQL> ALTER SYSTEM SET STANDBY_FILE_MANAGEMENT=AUTO SCOPE=BOTH;

System altered.

SQL>

完成

2018年11月12日星期一

oracle cursor: mutex X

前幾天在生產其中一套 RAC上, 發生有AP 會不定時 session 衝高, 雖然沒有到無法運作的狀況
但看它一直告警也覺得....心裡也覺得毛毛的

查看一下

em 及 AWR 出現大量 cursor: mutex X , 且 cursor: mutex X 佔了 34.6 % 的wait time , 連半夜產量較低的狀況下也是這樣

1.

發生的 cursor: mutex X 都是由一條 sql 引起的

SELECT *
FROM RELAY_NAME b
WHERE b.id = :1
AND b.name = :2
AND b.create_date >= TRUNC ( :3)
AND b.create_date < TRUNC ( :4 + 1)

SQL 非常簡單 index 也沒什麼問題 table 也不大

網路上有其他案例是後面變數太多導致引起bug , 但我這只有四個

本來想直接把 cursor_sharing 改成 force , 但這生產load 蠻高, 怕運氣不好整組壞掉

---------------------------

這邊不建議直接 cursor_sharing 改成 force , 有蠻多案例也會導致 cursor: mutex X 升高
---------------------------

先來看一下原因.

通過 V$SQL_SHARED_CURSOR的列值是Y / N來顯示當前指定SQL_ID不能被共享的原因

select * from v$sql_shared_cursor where sql_id = '168vjw7bg9p8g'

在眾多欄位上只有看到 bind_equiv_failure 是 Y

BIND_LENGTH_UPGRADEABLE 跟 TRANSLATION_MISMATCH 跟AUTH_CHECK_MISMATCH 都是 N

看起來問題沒有這麼複雜....

oracle 官方文檔上有說
BIND_EQUIV_FAILURE : the bind value’s selectivity does not match that used to optimize the existing child cursor.This definition together with the selectivity xml tag mentioned above gave me a first clue: Adaptive Cursor Sharing

the bind value’s selectivity does not match that used to optimize the existing child cursor

這句話讓我想起前幾天我有對這條 SQL 執行 SQL 調整建議程式根據 oracle 最佳方案去執行最優執行計畫

這邊就先把它停用

停用後此SQL立刻出現大波動.....過了1分後oracle大概有固定了變數...
之後就沒有再出現大量cursor: mutex X

以上當然是很特例的處理狀況

另外
bind_equiv_failure的意思就是，之所以沒有使用已經存在的子游標，是因為當前的綁定變量值和已經存在子游標當時使用的綁定變量值之間的選擇性有較大差別，出於性能考慮，不能和已經存在的子游標共享

要是綁定變量窺探和自適應游標沒關
那這時候就要猜想綁定變量的列上有直方圖存在是否有傾斜

以下再分享一個案例

在凌晨02:00的時間,某一個SQL load 飆高且event 都是 cursor : muter X

查詢原因:

通過 V$SQL_SHARED_CURSOR的列值是Y / N來顯示當前指定SQL_ID不能被共享的原因

select * from v$sql_shared_cursor where sql_id = '168vjw7bg9p8g':

在眾多欄位上只有看到 bind_equiv_failure 是 Y

大概就覺得直方圖存在已經有傾斜或者 table 的 statistics 一直不斷地在變動
看到這邊直覺就先把 table和index 重新收集資訊 analyze (使用GATHER_TABLE_STATS ) ,但無用

MOS 上有建議執行 alter system flush shared_pool 把 shared_pool 清除或者使用 _CURSOR_OBSOLETE_THRESHOLD 參數

但....這是生產的環境不太允許可以這樣測試

這時候突然想到出現 bind_equiv_failure

是因為當前的綁定變量值和已經存在子游標當時使用的綁定變量值之間的選擇性有較大差別

難道 table 的 statistics 變動頻率很高 ?!

執行 dbms_stats.lock_table_stats

結果 cursor : muter X 降低正常值, session 不再飆高

要是以上方法
table index analyze 及unlock table statistics 都無法解決的話
那就砍掉SQL在shared pool的執行計畫, 讓DB重新產生一份新的

1.先查詢SQL的位址
select SQL_TEXT,sql_id, address, hash_value, executions, loads, parse_calls, invalidations
from v$sqlarea where sql_id = '168vjw7bg9p8g';

這邊需要的是 address, hash_value 欄位的值

2.
單獨砍掉他的執行計畫

exec sys.dbms_shared_pool.purge('0000001D4A15D578,3606369551','c') ;

第三個參數 C 是 CURSOR

以下是可使用清單

P package/procedure/function

Q sequence

R trigger

T type

JS java source

JC java class

JR java resource

JD java shared data

C cursor

3.查詢SQL還在不在
select SQL_TEXT,sql_id, address, hash_value, executions, loads, parse_calls, invalidations
from v$sqlarea where sql_id = '168vjw7bg9p8g';

沒有資料! 刪除成功

4.重新產生新的執行計畫後, 就不再出現高 cursor:mutex X 及 cursor:mutex S

參考

Troubleshooting: High Version Count Issues (文档 ID 296377.1)
VIEW: “V$SQL_SHARED_CURSOR” Reference Note (文档 ID 120655.1)

https://hourim.wordpress.com/2015/04/06/bind_equiv_failure-or-when-you-will-regret-using-adaptive-cursor-sharing/

http://yong321.freeshell.org/computer/SharedPoolPurging.html

http://www.xifenfei.com/2013/05/dbms_shared_pool-purge%E5%B7%A5%E4%BD%9C%E5%8E%9F%E7%90%86%E7%8C%9C%E6%B5%8B.html

2018年11月8日星期四

Oracle temp ORA-01122 ORA-01110 驗證失敗錯誤

oracle data guard 備庫出現錯誤

Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE cancel
alter database open read only
Fri Nov 09 10:08:37 2018
Read of datafile '/ORA/db/oradata/TX/temp01.dbf' (fno 201) header failed with ORA-01203
Rereading datafile 201 header failed with ORA-01203
Errors in file /ORA/db/diag/rdbms/lg_st/TX/trace/LG_dbw0_450588.trc:
ORA-01186: file 201 failed verification tests
ORA-01122: database file 201 failed verification check
ORA-01110: data file 201: '/ORA/db/oradata/TX/temp01.dbf'
ORA-01203: wrong incarnation of this file - wrong creation SCN
File 201 not verified due to error ORA-01122
Fri Nov 09 10:08:37 2018

執行

select s.name tbsname,t.name,(t.bytes/1024/1024) bytes,status

from v$tablespace s,v$tempfile t

where s.ts# = t.ts#;

出現 ORA-01122 及 ORA-01110 錯誤

這是tmep 文件失效

解決方是新增一個datafile 並且砍掉原本

SQL> alter tablespace temp add tempfile '/ORA/db/oradata/TX/temp01_01.dbf' size 100m autoextend on;

Tablespace altered.

SQL> alter tablespace temp drop tempfile '/ORA/db/oradata/TX/temp01.dbf';

Tablespace altered.

SQL>

解決完成

執行

select s.name tbsname,t.name,(t.bytes/1024/1024) bytes,status

from v$tablespace s,v$tempfile t

where s.ts# = t.ts#;

出現新增的 datafile

alert.log 內容

Fri Nov 09 10:26:34 2018

alter tablespace temp add tempfile '/ORA/db/oradata/TX/temp01_01.dbf' size 100m autoextend on

Completed: alter tablespace temp add tempfile '/ORA/db/oradata/TX/temp01_01.dbf' size 100m autoextend on

Fri Nov 09 10:27:21 2018

alter tablespace temp drop tempfile '/ORA/db/oradata/TX/temp01.dbf'

Errors in file /ORA/db/diag/rdbms/lg_st/TX/trace/LG_ora_451105.trc:

ORA-01122: database file 201 failed verification check

ORA-01110: data file 201: '/ORA/db/oradata/TX/temp01.dbf'

ORA-01203: wrong incarnation of this file - wrong creation SCN

Errors in file /ORA/db/diag/rdbms/lg_st/TX/trace/LG_ora_451105.trc:

ORA-01258: unable to delete temporary file /ORA/db/oradata/TX/temp01.dbf

Completed: alter tablespace temp drop tempfile '/ORA/db/oradata/TX/temp01.dbf'

2018年10月22日星期一

ORA-32701: Possible hangs up to hang ID=6 detected

版本 : 11.2.0.4 RAC 雙節點

-------------

DB alert log 錯誤

ORA-32701: Possible hangs up to hang ID=6 detected

Tue Oct 23 10:32:55 2018
Errors in file /ORA/db/diag/rdbms/lg/LG1/trace/LG1_dia0_438170.trc (incident=196081):
ORA-32701: Possible hangs up to hang ID=6 detected
Incident details in: /ORA/db/diag/rdbms/lg/LG1/incident/incdir_196081/LG1_dia0_438170_i196081.trc
DIA0 requesting termination of session sid:789 with serial # 39305 (ospid:103221) on instance 2
due to a GLOBAL, HIGH confidence hang with ID=6.
Hang Resolution Reason: Although the number of affected sessions did not
justify automatic hang resolution initially, this previously ignored
hang was automatically resolved.
DIA0: Examine the alert log on instance 2 for session termination status of hang with ID=6.
Tue Oct 23 10:32:56 2018
Sweep [inc][196081]: completed
Sweep [inc2][196081]: completed
Tue Oct 23 10:34:29 2018

有東西被 hangs 然後 DB 自動去砍掉 hangs 的process

trc 內容

*** 2018-10-23 10:32:55.067

Resolvable Hangs in the System

Root Chain Total Hang

Hang Hang Inst Root #hung #hung Hang Hang Resolution

ID Type Status Num Sess Sess Sess Conf Span Action

----- ---- -------- ---- ----- ----- ----- ------ ------ -------------------

6 HANG RSLNPEND 2 789 2 2 HIGH GLOBAL Terminate Process

Hang Resolution Reason: Although the number of affected sessions did not

justify automatic hang resolution initially, this previously ignored

hang was automatically resolved.

inst# SessId Ser# OSPID PrcNm Event

----- ------ ----- --------- ----- -----

1 674 21509 268777 M001 enq: WF - contention

2 789 39305 103221 M000 not in wait

The incident file on instance 2 may contain information

about session 789 with serial number 39305 and operating system

process ID 103221. The incident file may contain a short stack

and a process state dump.

Victim Information

Ignored

HangID Inst# Sessid Ser Num OSPID Fatal BG Previous Hang Count

------ ----- ------ ------- --------- -------- ------------- -------

6 2 789 39305 103221 N Existing Hang 1

*** 2018-10-23 10:32:55.067

這邊看到是 inst 1 的 M001 被砍掉 , 原因是被 inst 2 的 M000 hangs 住了 M001

在往 trc 下看確實是

-------------------------------------------------------------------------------

Chain 1:

-------------------------------------------------------------------------------

Oracle session identified by:

{

instance: 1 (lg.lg1)

os id: 268777

process id: 99, oracle@ora-36 (M001)

session id: 674

session serial #: 21509

}

is waiting for 'enq: WF - contention' with wait info:

{

p1: 'name|mode'=0x57460006

p2: '0'=0x38

p3: '0'=0x0

time in wait: 1 min 56 sec

timeout after: never

wait id: 5150

blocking: 0 sessions

wait history:

* time between current wait and wait #1: 0.000306 sec

1. event: 'db file sequential read'

time waited: 0.000270 sec

wait id: 5149 p1: 'file#'=0x6c

p2: 'block#'=0x99fa

p3: 'blocks'=0x1

* time between wait #1 and #2: 0.000638 sec

2. event: 'db file sequential read'

time waited: 0.000477 sec

wait id: 5148 p1: 'file#'=0x6c

p2: 'block#'=0x99f9

p3: 'blocks'=0x1

* time between wait #2 and #3: 0.000106 sec

3. event: 'db file sequential read'

time waited: 0.000301 sec

wait id: 5147 p1: 'file#'=0x6c

p2: 'block#'=0x9d07

p3: 'blocks'=0x1

}

and is blocked by

=> Oracle session identified by:

{

instance: 2 (lg.lg2)

os id: 103221

process id: 453, oracle@ora-35 (M000)

session id: 789

session serial #: 39305

}

which is not in a wait:

{

last wait: 26 min 36 sec ago

blocking: 1 session

wait history:

1. event: 'gc current grant 2-way'

time waited: 0.000089 sec

wait id: 6377 p1: ''=0x5

p2: ''=0x36cb

p3: ''=0x20001ea

* time between wait #1 and #2: 0.000053 sec

2. event: 'gc current grant 2-way'

time waited: 0.000138 sec

wait id: 6376 p1: ''=0x5

p2: ''=0x36ca

p3: ''=0x20001ea

* time between wait #2 and #3: 0.000419 sec

3. event: 'gc current grant 2-way'

time waited: 0.000110 sec

wait id: 6375 p1: ''=0x5

p2: ''=0x36c9

p3: ''=0x20001ea

}

Chain 1 Signature: <not in a wait><='enq: WF - contention'

Chain 1 Signature Hash: 0xee4c6e02

-------------------------------------------------------------------------------

不過看到 M000 這個其實也不用看擔心, M000 開頭的 ,是 Oracle 是mmon 的process, 用於收集AWR 相關數據 ,所以被砍掉暫時不會對系統造成影響

不過 M000 hang住被砍掉表示 AWR 或者系統使用 SYS.DBMS_STATS.GATHER_TABLE_STATS 有問題

以下是我遇過的案例

案例1.有一條 delete sql , AP執行了但未commit及rollback,卡了快一天

AP也似乎掛掉,導致這一條資源一直無法釋放, 當發現的時候是半夜DB在做 Automated Maintenance Tasks (Optimizer Statistics Gathering

),因為這條delete sql 讓 table 無法收集資訊一直hangs, 導致系統資源吃滿 session 爆量

解決: 此delete sql 砍掉, 但我這邊見鬼的砍不掉 ,pmon也無法釋放該資源 ,暫時先把Optimizer Statistics Gathering 關掉, 但過一個月後因為太久沒有 Optimizer Statistics Gathering, sql的執行計畫有一些跑掉, 這邊是重開DB 解決

要是無法重開就手動收集Statistics吧

案例2.在ORA-32701: Possible hangs up to hang ID=6 detected 發生前 EM 有大量發生 insert into wrh$_sql_bind_metadata

看來是這個 wrh$_sql_bind_metadata 出現問題

使用下面的命令：

alter system set "_awr_disabled_flush_tables" = 'wrh$_sql_bind_metadata';

該命令是動態的，不需要重啟實例，禁用多個表時參數用逗號分隔;

禁用了和上面表相關的統計信息flush ash to AWR

訂閱：文章 (Atom)

Oracle DBA 生涯日誌

2018年12月19日星期三

LGWR (ospid: 40339): terminating the instance due to error 4021 (Active Data Guard )

2018年12月6日星期四

Oracle Data guard move datafile

2018年11月12日星期一

oracle cursor: mutex X

2018年11月8日星期四

Oracle temp ORA-01122 ORA-01110 驗證失敗錯誤

2018年10月22日星期一

ORA-32701: Possible hangs up to hang ID=6 detected

文章列表

搜尋此網誌

2018年12月19日 星期三

LGWR (ospid: 40339): terminating the instance due to error 4021 (Active Data Guard )

2018年12月6日 星期四

Oracle Data guard move datafile

2018年11月12日 星期一

oracle cursor: mutex X

2018年11月8日 星期四

Oracle temp ORA-01122 ORA-01110 驗證失敗錯誤

2018年10月22日 星期一

ORA-32701: Possible hangs up to hang ID=6 detected

文章列表

搜尋此網誌

2018年12月19日星期三

2018年12月6日星期四

2018年11月12日星期一

2018年11月8日星期四

2018年10月22日星期一