ストリーミングレプリケーション(つづき)

ストリーミングレプリケーション(つづき)

【補】構築

じっさいにdockerでmaster-standby構成を(雑に)組んだ
- ストリーミングレプリケーション環境 (docker)
- HBAはザル
PG12での変更点
- キータ
- recovery.confがない
- かわりにstandby.signalファイルを作る必要がある
  - touchでいい

/docker:267138f94e60:/var/lib/postgresql/data $ ls -lA standby.signal 
-rw-------   1 postgres       postgres        0 2020-02-10 04:00 standby.signal

pg_basebackupでmasterからベースバックアップを持ってくるときの-Rオプションは続投
- standby.signalファイルが生成される

Usage:
  pg_basebackup [OPTION]...

Options controlling the output:
...
  -R, --write-recovery-conf
                         write configuration for replication

もうrecovery.confはないのに--write-recovery-conf…
-Rオプションつきでpg_basebackupを呼び出すと、postgres.auto.confにprimary_conninfoが自動生成される
- プライマリ接続先情報

# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
primary_conninfo = 'user=replication_user passfile=''/var/lib/postgresql/.pgpass'' host=master port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any'

これがあるとpostgresql.confで何書いても効かないので注意する

レプリケーションの状況確認

ストリーミングレプリケーションを安全に運用するために

サーバーログの確認

master
- 同期レプリケーションだと出てくるやつ

master_1    | 2020-02-22 16:41:04.809 UTC [97] LOG:  standby "sby" is now a synchronous standby with priority 1

standby

standby1_1  | 2020-02-22 16:41:04.709 UTC [21] LOG:  started streaming WAL from primary at 0/3000000 on timeline 1

プロセスの確認

master

~ $ docker-compose exec master ps faxww | grep postgres | grep wal

   91 ?        Ss     0:00 postgres: walwriter   
   97 ?        Ss     0:00 postgres: walsender replication_user 172.20.0.4(53194) streaming 0/301E060

standby

~ $ docker-compose exec standby1 ps faxww | grep postgres | grep wal

   21 ?        Ss     0:00  \_ postgres: walreceiver   streaming 0/301E060

レプリケーション遅延の確認

プロセスの存在は確認できた
ちゃんと動いているか確認する
masterにて:

postgres=# SELECT * FROM pg_stat_replication;

-[ RECORD 1 ]----+------------------------------
pid              | 97
usesysid         | 16384
usename          | replication_user
application_name | walreceiver
client_addr      | 172.22.0.3
client_hostname  | 
client_port      | 32834
backend_start    | 2020-02-10 15:09:34.524869+00
backend_xmin     | 
state            | streaming
sent_lsn         | 0/3013310
write_lsn        | 0/3013310
flush_lsn        | 0/3013310
replay_lsn       | 0/3013310
write_lag        | 
flush_lag        | 
replay_lag       | 
sync_priority    | 1
sync_state       | sync
reply_time       | 2020-02-10 15:16:16.326192+00

スタンバイwalreceiverがプライマリからデータを同期的に受け取る準備ができている

application_name | walreceiver
sync_state       | sync

レプリケーション遅延
- 「同期」といっても、プライマリとスタンバイの状態が常に同じとは限らない
  - masterのsynchronous_commit設定がremote_applyでない限りは遅延しうる
遅延具合

sent_lsn         | 0/3013310
write_lsn        | 0/3013310
flush_lsn        | 0/3013310
replay_lsn       | 0/3013310

sent_lsn: プライマリがスタンバイに送出したLSN
write_lsn: スタンバイが受け取ってバッファに書き込んだLSN
flush_lsn: スタンバイがディスクに書き込んだLSN
replay_lsn: スタンバイがWALを適用したLSN
今回は全部同じなので完全に同期できている

レプリケーションの管理

プライマリが死んだら？
自動的に立ち直って欲しいところだが…
PostgreSQL12時点では、手を加えてスタンバイを昇格してあげる必要あり

recovery.confのtrigger_fileを用いる方法

standbyの設定

postgres=# SELECT name,setting,context FROM pg_settings WHERE name='promote_trigger_file';

         name         | setting | context 
----------------------+---------+---------
 promote_trigger_file |         | sighup
(1 row)

指定のパスにファイルがあるのを検出したらmasterに昇格する
ためす

postgresql.conf

promote_trigger_file = 'trigger.file'

reload必要
レプリケーション状態でmasterを停止してみる

standby1_1  | 2020-02-10 15:36:49.623 UTC [21] LOG:  replication terminated by primary server
standby1_1  | 2020-02-10 15:36:49.623 UTC [21] DETAIL:  End of WAL reached on timeline 1 at 0/3013388.
standby1_1  | 2020-02-10 15:36:49.623 UTC [21] FATAL:  could not send end-of-streaming message to primary: no COPY in progress
standby1_1  | 2020-02-10 15:36:49.623 UTC [17] LOG:  invalid record length at 0/3013388: wanted 24, got 0
master_1    | 2020-02-10 15:36:49.624 UTC [1] LOG:  database system is shut down
standby1_1  | 2020-02-10 15:36:49.625 UTC [34] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
standby1_1  |       Is the server running on host "master" (172.22.0.2) and accepting
standby1_1  |       TCP/IP connections on port 5432?
streaming_replication_master_1 exited with code 0
standby1_1  | 2020-02-10 15:36:54.636 UTC [35] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:36:59.641 UTC [36] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known

standbyサーバーで所定のファイルを生成

postgres@a1ed78157801:~/data$ touch trigger.file

ほどなくmasterに昇格する
- 若干のタイムラグあり

standby1_1  | 2020-02-10 15:37:04.649 UTC [37] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:37:09.653 UTC [38] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:37:14.653 UTC [17] LOG:  promote trigger file found: trigger.file
standby1_1  | 2020-02-10 15:37:14.653 UTC [17] LOG:  redo done at 0/3013310
standby1_1  | 2020-02-10 15:37:14.653 UTC [17] LOG:  last completed transaction was at log time 2020-02-10 15:10:05.098076+00
standby1_1  | 2020-02-10 15:37:14.661 UTC [17] LOG:  selected new timeline ID: 2
standby1_1  | 2020-02-10 15:37:14.774 UTC [17] LOG:  archive recovery complete
standby1_1  | 2020-02-10 15:37:14.788 UTC [16] LOG:  database system is ready to accept connections

pg_ctl promoteを用いる方法

standbyサーバーにログインしてコマンド実行

postgres@3a3af865b441:/$ pg_ctl promote

waiting for server to promote.... done
server promoted

ログ

master_1    | 2020-02-10 15:50:07.545 UTC [1] LOG:  database system is shut down
standby1_1  | 2020-02-10 15:50:07.547 UTC [22] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
standby1_1  |       Is the server running on host "master" (172.23.0.2) and accepting
standby1_1  |       TCP/IP connections on port 5432?
streaming_replication_master_1 exited with code 0
standby1_1  | 2020-02-10 15:50:12.557 UTC [23] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:50:17.563 UTC [24] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:50:22.566 UTC [25] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:50:27.572 UTC [31] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:50:28.906 UTC [16] LOG:  received promote request
standby1_1  | 2020-02-10 15:50:28.906 UTC [16] LOG:  redo done at 0/3000060
standby1_1  | 2020-02-10 15:50:28.909 UTC [16] LOG:  selected new timeline ID: 2
standby1_1  | 2020-02-10 15:50:29.017 UTC [16] LOG:  archive recovery complete
standby1_1  | 2020-02-10 15:50:29.028 UTC [15] LOG:  database system is ready to accept connections

こちらはシグナル送信を起点に即昇格

【補】`pg_promote`関数呼び出し

公式
PostgreSQL12で追加されたやつ
サーバーにログインしなくていいのが嬉しい
クライアントからSQL文発行

postgres@7054e6e2b74f:/$ psql -h standby1 -c "SELECT pg_promote();"

 pg_promote 
------------
 t
(1 row)

ログ

master_1    | 2020-02-10 15:52:42.567 UTC [1] LOG:  database system is shut down
standby1_1  | 2020-02-10 15:52:42.568 UTC [23] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
standby1_1  |       Is the server running on host "master" (172.24.0.2) and accepting
standby1_1  |       TCP/IP connections on port 5432?
streaming_replication_master_1 exited with code 0
standby1_1  | 2020-02-10 15:52:47.578 UTC [24] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:52:52.584 UTC [25] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:52:57.588 UTC [26] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:53:02.592 UTC [27] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:53:07.598 UTC [28] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:53:12.604 UTC [29] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:53:15.994 UTC [30] LOG:  connection received: host=172.24.0.5 port=60854
standby1_1  | 2020-02-10 15:53:15.994 UTC [30] LOG:  connection authorized: user=postgres database=postgres application_name=psql
standby1_1  | 2020-02-10 15:53:15.998 UTC [17] LOG:  received promote request
standby1_1  | 2020-02-10 15:53:15.998 UTC [17] LOG:  redo done at 0/3000060
standby1_1  | 2020-02-10 15:53:16.028 UTC [17] LOG:  selected new timeline ID: 2
standby1_1  | 2020-02-10 15:53:16.139 UTC [17] LOG:  archive recovery complete
standby1_1  | 2020-02-10 15:53:16.155 UTC [16] LOG:  database system is ready to accept connections
standby1_1  | 2020-02-10 15:53:16.201 UTC [30] LOG:  disconnection: session time: 0:00:00.207 user=postgres database=postgres host=172.24.0.5 port=60854

共通

昇格したstandbyのタイムラインIDは1つ繰り上がる

昇格前

standby1_1  | 2020-02-10 15:36:49.623 UTC [21] DETAIL:  End of WAL reached on timeline 1 at 0/3013388.

昇格時

standby1_1  | 2020-02-10 15:37:14.661 UTC [17] LOG:  selected new timeline ID: 2

設定手順の整理

docker環境(再掲)
ストリーミングレプリケーション環境 (docker)

Column: 循環するレプリケーション

全部スタンバイにできちゃう
設定をしくじるとプライマリがない状態になっちゃう。気をつけよう

勉強日記

チラ裏

内部構造から学ぶPostgreSQL 設計・運用計画の鉄則 ch10 (高可用化と負荷分散) (2/6)

ストリーミングレプリケーション(つづき)

【補】構築

レプリケーションの状況確認

サーバーログの確認

プロセスの確認

レプリケーション遅延の確認

レプリケーションの管理

recovery.confのtrigger_fileを用いる方法

pg_ctl promoteを用いる方法

【補】`pg_promote`関数呼び出し

共通

設定手順の整理

Column: 循環するレプリケーション

ストリーミングレプリケーション(つづき)

【補】構築

レプリケーションの状況確認

サーバーログの確認

プロセスの確認

レプリケーション遅延の確認

レプリケーションの管理

recovery.confのtrigger_fileを用いる方法

pg_ctl promoteを用いる方法

【補】pg_promote関数呼び出し

共通

設定手順の整理

Column: 循環するレプリケーション

【補】`pg_promote`関数呼び出し