2017年9月10日 星期日

Redis cluster 不斷 crash







在有一個 Redis cluster 測試環境中, 發現cluster 狀態不太穩定, 看了一下Log發現大量出現以下訊息


5009:S 11 Sep 00:09:00.086 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
5009:S 11 Sep 00:11:11.086 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
5009:S 11 Sep 00:16:05.098 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
5009:S 11 Sep 00:16:30.030 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
5009:S 11 Sep 00:21:37.687 # Cluster state changed: fail
5009:S 11 Sep 00:21:38.772 # Cluster state changed: ok
5009:S 11 Sep 00:21:38.810 * FAIL message received from 903ab5e1fb428082550d14de7b1bf97830bf8ad9 about 00b03db3d55d9c9045023d5e620c6f5d68ccab01
5009:S 11 Sep 00:21:41.368 # Start of election delayed for 754 milliseconds (rank #0, offset 33496646).
5009:S 11 Sep 00:21:41.368 # Cluster state changed: fail
5009:S 11 Sep 00:21:42.169 # Starting a failover election for epoch 72.
5009:S 11 Sep 00:21:42.248 # Failover election won: I'm the new master.
5009:S 11 Sep 00:21:42.248 # configEpoch set to 72 after successful failover
5009:M 11 Sep 00:21:42.248 # Connection with master lost.
5009:M 11 Sep 00:21:42.248 * Caching the disconnected master state.
5009:M 11 Sep 00:21:42.248 * Discarding previously cached master state.
5009:M 11 Sep 00:21:42.845 * Slave 10.22.21.186:8101 asks for synchronization
5009:M 11 Sep 00:21:42.845 * Full resync requested by slave 10.22.21.186:8101
5009:M 11 Sep 00:21:42.845 * Starting BGSAVE for SYNC with target: disk
5009:M 11 Sep 00:21:42.846 * Background saving started by pid 25548



一直出現

5009:S 11 Sep 00:16:30.030 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.

錯誤後 Redis 就開始進行節點切換 Master - Slave 


原因: 
1.慢查詢   -- 這邊看了一下 慢查詢最慢也才  0.064 秒 , 不是這個問題
2.aof 檔案太大  -- bgrewriteaof 減少 size 大小, 但過10分後就又出現錯誤訊息 ,  雖然訊息出現的頻率已降低
3.cpu 忙碌 -- 此主機當下 Idle還在 98% 所以排除這個問題
4.硬碟太慢 -- 此Redis 是安裝在VM上,  Redis cluster instance 都在同一台主機上
可以調整 aof sync 頻率

參數: appendfsync
總共有三個值可以輸入 

always
everysec
no

官方解釋為

always :將 aof_buf 緩衝區中的所有內容寫入並同步到 AOF 文件。
everysec :將aof_buf 緩衝區中的所有內容寫入到AOF 文件, 如果上次同步AOF 文件的時間距離現在超過一秒鐘, 那麼再次對AOF 文件進行同步, 並且這個同步操作是由一個線程專門負責執行的。
no :將 aof_buf 緩衝區中的所有內容寫入到 AOF 文件, 但並不對 AOF 文件進行同步, 何時同步由操作系統來決定。


要是資料安全性不用太高的話, 可以嘗試把此值設為 no




以下是 官方對於 Redis latency 的 troubleshooting與看法

https://redis.io/topics/latency