内部構造から学ぶPostgreSQL 設計・運用計画の鉄則 ch10 (高可用化と負荷分散) (2/6)
ストリーミングレプリケーション(つづき)
【補】構築
- じっさいにdockerでmaster-standby構成を(雑に)組んだ
- ストリーミングレプリケーション環境 (docker)
- HBAはザル
- PG12での変更点
- キータ
recovery.conf
がない- かわりに
standby.signal
ファイルを作る必要があるtouch
でいい
/docker:267138f94e60:/var/lib/postgresql/data $ ls -lA standby.signal -rw------- 1 postgres postgres 0 2020-02-10 04:00 standby.signal
pg_basebackup
でmasterからベースバックアップを持ってくるときの-R
オプションは続投standby.signal
ファイルが生成される
Usage: pg_basebackup [OPTION]... Options controlling the output: ... -R, --write-recovery-conf write configuration for replication
- もう
recovery.conf
はないのに--write-recovery-conf
… -R
オプションつきでpg_basebackup
を呼び出すと、postgres.auto.conf
にprimary_conninfo
が自動生成される- プライマリ接続先情報
# Do not edit this file manually! # It will be overwritten by the ALTER SYSTEM command. primary_conninfo = 'user=replication_user passfile=''/var/lib/postgresql/.pgpass'' host=master port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any'
- これがあると
postgresql.conf
で何書いても効かないので注意する
レプリケーションの状況確認
- ストリーミングレプリケーションを安全に運用するために
サーバーログの確認
- master
- 同期レプリケーションだと出てくるやつ
master_1 | 2020-02-22 16:41:04.809 UTC [97] LOG: standby "sby" is now a synchronous standby with priority 1
- standby
standby1_1 | 2020-02-22 16:41:04.709 UTC [21] LOG: started streaming WAL from primary at 0/3000000 on timeline 1
プロセスの確認
- master
~ $ docker-compose exec master ps faxww | grep postgres | grep wal
91 ? Ss 0:00 postgres: walwriter 97 ? Ss 0:00 postgres: walsender replication_user 172.20.0.4(53194) streaming 0/301E060
- standby
~ $ docker-compose exec standby1 ps faxww | grep postgres | grep wal
21 ? Ss 0:00 \_ postgres: walreceiver streaming 0/301E060
レプリケーション遅延の確認
- プロセスの存在は確認できた
- ちゃんと動いているか確認する
- masterにて:
postgres=# SELECT * FROM pg_stat_replication; -[ RECORD 1 ]----+------------------------------ pid | 97 usesysid | 16384 usename | replication_user application_name | walreceiver client_addr | 172.22.0.3 client_hostname | client_port | 32834 backend_start | 2020-02-10 15:09:34.524869+00 backend_xmin | state | streaming sent_lsn | 0/3013310 write_lsn | 0/3013310 flush_lsn | 0/3013310 replay_lsn | 0/3013310 write_lag | flush_lag | replay_lag | sync_priority | 1 sync_state | sync reply_time | 2020-02-10 15:16:16.326192+00
- スタンバイ
walreceiver
がプライマリからデータを同期的に受け取る準備ができている
application_name | walreceiver sync_state | sync
- レプリケーション遅延
- 「同期」といっても、プライマリとスタンバイの状態が常に同じとは限らない
- masterの
synchronous_commit
設定がremote_apply
でない限りは遅延しうる
- masterの
- 「同期」といっても、プライマリとスタンバイの状態が常に同じとは限らない
- 遅延具合
sent_lsn | 0/3013310 write_lsn | 0/3013310 flush_lsn | 0/3013310 replay_lsn | 0/3013310
sent_lsn
: プライマリがスタンバイに送出したLSNwrite_lsn
: スタンバイが受け取ってバッファに書き込んだLSNflush_lsn
: スタンバイがディスクに書き込んだLSNreplay_lsn
: スタンバイがWALを適用したLSN- 今回は全部同じなので完全に同期できている
レプリケーションの管理
- プライマリが死んだら?
- 自動的に立ち直って欲しいところだが…
- PostgreSQL12時点では、手を加えてスタンバイを昇格してあげる必要あり
recovery.confのtrigger_fileを用いる方法
- standbyの設定
postgres=# SELECT name,setting,context FROM pg_settings WHERE name='promote_trigger_file'; name | setting | context ----------------------+---------+--------- promote_trigger_file | | sighup (1 row)
- 指定のパスにファイルがあるのを検出したらmasterに昇格する
- ためす
postgresql.conf
promote_trigger_file = 'trigger.file'
- reload必要
- レプリケーション状態でmasterを停止してみる
standby1_1 | 2020-02-10 15:36:49.623 UTC [21] LOG: replication terminated by primary server standby1_1 | 2020-02-10 15:36:49.623 UTC [21] DETAIL: End of WAL reached on timeline 1 at 0/3013388. standby1_1 | 2020-02-10 15:36:49.623 UTC [21] FATAL: could not send end-of-streaming message to primary: no COPY in progress standby1_1 | 2020-02-10 15:36:49.623 UTC [17] LOG: invalid record length at 0/3013388: wanted 24, got 0 master_1 | 2020-02-10 15:36:49.624 UTC [1] LOG: database system is shut down standby1_1 | 2020-02-10 15:36:49.625 UTC [34] FATAL: could not connect to the primary server: could not connect to server: Connection refused standby1_1 | Is the server running on host "master" (172.22.0.2) and accepting standby1_1 | TCP/IP connections on port 5432? streaming_replication_master_1 exited with code 0 standby1_1 | 2020-02-10 15:36:54.636 UTC [35] FATAL: could not connect to the primary server: could not translate host name "master" to address: Name or service not known standby1_1 | 2020-02-10 15:36:59.641 UTC [36] FATAL: could not connect to the primary server: could not translate host name "master" to address: Name or service not known
- standbyサーバーで所定のファイルを生成
postgres@a1ed78157801:~/data$ touch trigger.file
- ほどなくmasterに昇格する
- 若干のタイムラグあり
standby1_1 | 2020-02-10 15:37:04.649 UTC [37] FATAL: could not connect to the primary server: could not translate host name "master" to address: Name or service not known standby1_1 | 2020-02-10 15:37:09.653 UTC [38] FATAL: could not connect to the primary server: could not translate host name "master" to address: Name or service not known standby1_1 | 2020-02-10 15:37:14.653 UTC [17] LOG: promote trigger file found: trigger.file standby1_1 | 2020-02-10 15:37:14.653 UTC [17] LOG: redo done at 0/3013310 standby1_1 | 2020-02-10 15:37:14.653 UTC [17] LOG: last completed transaction was at log time 2020-02-10 15:10:05.098076+00 standby1_1 | 2020-02-10 15:37:14.661 UTC [17] LOG: selected new timeline ID: 2 standby1_1 | 2020-02-10 15:37:14.774 UTC [17] LOG: archive recovery complete standby1_1 | 2020-02-10 15:37:14.788 UTC [16] LOG: database system is ready to accept connections
pg_ctl promoteを用いる方法
standbyサーバーにログインしてコマンド実行
postgres@3a3af865b441:/$ pg_ctl promote
waiting for server to promote.... done server promoted
- ログ
master_1 | 2020-02-10 15:50:07.545 UTC [1] LOG: database system is shut down standby1_1 | 2020-02-10 15:50:07.547 UTC [22] FATAL: could not connect to the primary server: could not connect to server: Connection refused standby1_1 | Is the server running on host "master" (172.23.0.2) and accepting standby1_1 | TCP/IP connections on port 5432? streaming_replication_master_1 exited with code 0 standby1_1 | 2020-02-10 15:50:12.557 UTC [23] FATAL: could not connect to the primary server: could not translate host name "master" to address: Name or service not known standby1_1 | 2020-02-10 15:50:17.563 UTC [24] FATAL: could not connect to the primary server: could not translate host name "master" to address: Name or service not known standby1_1 | 2020-02-10 15:50:22.566 UTC [25] FATAL: could not connect to the primary server: could not translate host name "master" to address: Name or service not known standby1_1 | 2020-02-10 15:50:27.572 UTC [31] FATAL: could not connect to the primary server: could not translate host name "master" to address: Name or service not known standby1_1 | 2020-02-10 15:50:28.906 UTC [16] LOG: received promote request standby1_1 | 2020-02-10 15:50:28.906 UTC [16] LOG: redo done at 0/3000060 standby1_1 | 2020-02-10 15:50:28.909 UTC [16] LOG: selected new timeline ID: 2 standby1_1 | 2020-02-10 15:50:29.017 UTC [16] LOG: archive recovery complete standby1_1 | 2020-02-10 15:50:29.028 UTC [15] LOG: database system is ready to accept connections
- こちらはシグナル送信を起点に即昇格
【補】pg_promote
関数呼び出し
postgres@7054e6e2b74f:/$ psql -h standby1 -c "SELECT pg_promote();"
pg_promote ------------ t (1 row)
- ログ
master_1 | 2020-02-10 15:52:42.567 UTC [1] LOG: database system is shut down standby1_1 | 2020-02-10 15:52:42.568 UTC [23] FATAL: could not connect to the primary server: could not connect to server: Connection refused standby1_1 | Is the server running on host "master" (172.24.0.2) and accepting standby1_1 | TCP/IP connections on port 5432? streaming_replication_master_1 exited with code 0 standby1_1 | 2020-02-10 15:52:47.578 UTC [24] FATAL: could not connect to the primary server: could not translate host name "master" to address: Name or service not known standby1_1 | 2020-02-10 15:52:52.584 UTC [25] FATAL: could not connect to the primary server: could not translate host name "master" to address: Name or service not known standby1_1 | 2020-02-10 15:52:57.588 UTC [26] FATAL: could not connect to the primary server: could not translate host name "master" to address: Name or service not known standby1_1 | 2020-02-10 15:53:02.592 UTC [27] FATAL: could not connect to the primary server: could not translate host name "master" to address: Name or service not known standby1_1 | 2020-02-10 15:53:07.598 UTC [28] FATAL: could not connect to the primary server: could not translate host name "master" to address: Name or service not known standby1_1 | 2020-02-10 15:53:12.604 UTC [29] FATAL: could not connect to the primary server: could not translate host name "master" to address: Name or service not known standby1_1 | 2020-02-10 15:53:15.994 UTC [30] LOG: connection received: host=172.24.0.5 port=60854 standby1_1 | 2020-02-10 15:53:15.994 UTC [30] LOG: connection authorized: user=postgres database=postgres application_name=psql standby1_1 | 2020-02-10 15:53:15.998 UTC [17] LOG: received promote request standby1_1 | 2020-02-10 15:53:15.998 UTC [17] LOG: redo done at 0/3000060 standby1_1 | 2020-02-10 15:53:16.028 UTC [17] LOG: selected new timeline ID: 2 standby1_1 | 2020-02-10 15:53:16.139 UTC [17] LOG: archive recovery complete standby1_1 | 2020-02-10 15:53:16.155 UTC [16] LOG: database system is ready to accept connections standby1_1 | 2020-02-10 15:53:16.201 UTC [30] LOG: disconnection: session time: 0:00:00.207 user=postgres database=postgres host=172.24.0.5 port=60854
共通
- 昇格したstandbyのタイムラインIDは1つ繰り上がる
昇格前
standby1_1 | 2020-02-10 15:36:49.623 UTC [21] DETAIL: End of WAL reached on timeline 1 at 0/3013388.
昇格時
standby1_1 | 2020-02-10 15:37:14.661 UTC [17] LOG: selected new timeline ID: 2
設定手順の整理
- docker環境(再掲)
- ストリーミングレプリケーション環境 (docker)
Column: 循環するレプリケーション
- 全部スタンバイにできちゃう
- 設定をしくじるとプライマリがない状態になっちゃう。気をつけよう