在深圳帮人做网站清溪镇网站建设公司
1.问题背景
监控告警某套pg的两个备库延迟大于300s
2.pg备库延迟告警策略
通过postgres_exporter(version 0.13.2)监控生产pg,延迟告警策略如下:
pg_replication_lag_seconds{instance!~"host_ip_18801"} > 300 and ON(instance) pg_replication_is_replica == 1 
3.问题分析
1)查看备库上pg_replication_lag_seconds指标值(确实大于告警值300s)
curl http://127.0.0.1:19187/metrics|grep pg_replication_lag_seconds% Total    % Received % Xferd  Average Speed   Time    Time     Time  CurrentDload  Upload   Total   Spent    Left  Speed0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP pg_replication_lag_seconds Replication lag behind master in seconds
# TYPE pg_replication_lag_seconds gauge
pg_replication_lag_seconds 2059.643283
100 88926    0 88926    0     0  3503k      0 --:--:-- --:--:-- --:--:-- 3618k
 
2)查看主备库日志
pg主备库日志无任何异常
3)检查主库当前lsn及备库应用到的lsn
多次查看主库当前lsn不变,从库应用lsn无延迟,说明主库无数据写入
postgres=# SELECT * FROM pg_stat_replication \gx
-[ RECORD 1 ]----+------------------------------
pid              | 26226
usesysid         | 16384
usename          | repl
application_name | xxx.xx.xx.59_18801
client_addr      | xxx.xx.xx.59
client_hostname  | 
client_port      | 60049
backend_start    | 2023-08-04 09:39:42.059505+08
backend_xmin     | 
state            | streaming
sent_lsn         | 3633/F10891E8
write_lsn        | 3633/F10891E8
flush_lsn        | 3633/F10891E8
replay_lsn       | 3633/F10891E8
write_lag        | 
flush_lag        | 
replay_lag       | 
sync_priority    | 1
sync_state       | quorum
reply_time       | 2024-05-23 16:16:03.274172+08
-[ RECORD 2 ]----+------------------------------
pid              | 15343
usesysid         | 16384
usename          | repl
application_name | xxx_xx_xx_65_18801
client_addr      | xxx_xx_xx_65
client_hostname  | 
client_port      | 53700
backend_start    | 2023-08-05 04:30:26.862464+08
backend_xmin     | 
state            | streaming
sent_lsn         | 3633/F10891E8
write_lsn        | 3633/F10891E8
flush_lsn        | 3633/F10891E8
replay_lsn       | 3633/F10891E8
write_lag        | 
flush_lag        | 
replay_lag       | 
sync_priority    | 1
sync_state       | quorum
reply_time       | 2024-05-23 16:16:03.273835+08postgres=# 
postgres=# 
postgres=# select pg_current_wal_lsn();pg_current_wal_lsn 
--------------------3633/F10891E8
(1 row)
 
4.查看对应版本postgres_exporter源码中如何监控从库延迟
查看对应版本postgres_exporter源码中如何监控从库延迟
https://github.com/prometheus-community/postgres_exporter/commit/0b6d9860abb2340412a15be1fc2b4d25e154b8c3
  

  
5.问题解决
升级postgres_exporter到0.14或以上版本
