勉強日記

チラ裏

内部構造から学ぶPostgreSQL 設計・運用計画の鉄則 ch10 (高可用化と負荷分散) (2/6)

gihyo.jp


ストリーミングレプリケーション(つづき)

【補】構築

/docker:267138f94e60:/var/lib/postgresql/data $ ls -lA standby.signal 
-rw-------   1 postgres       postgres        0 2020-02-10 04:00 standby.signal
  • pg_basebackupでmasterからベースバックアップを持ってくるときの-Rオプションは続投
    • standby.signalファイルが生成される
Usage:
  pg_basebackup [OPTION]...

Options controlling the output:
...
  -R, --write-recovery-conf
                         write configuration for replication
  • もうrecovery.confはないのに--write-recovery-conf
  • -Rオプションつきでpg_basebackupを呼び出すと、postgres.auto.confprimary_conninfoが自動生成される
    • プライマリ接続先情報
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
primary_conninfo = 'user=replication_user passfile=''/var/lib/postgresql/.pgpass'' host=master port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any'
  • これがあるとpostgresql.confで何書いても効かないので注意する

レプリケーションの状況確認

サーバーログの確認

master_1    | 2020-02-22 16:41:04.809 UTC [97] LOG:  standby "sby" is now a synchronous standby with priority 1
  • standby
standby1_1  | 2020-02-22 16:41:04.709 UTC [21] LOG:  started streaming WAL from primary at 0/3000000 on timeline 1

プロセスの確認

  • master
~ $ docker-compose exec master ps faxww | grep postgres | grep wal
   91 ?        Ss     0:00 postgres: walwriter   
   97 ?        Ss     0:00 postgres: walsender replication_user 172.20.0.4(53194) streaming 0/301E060
  • standby
~ $ docker-compose exec standby1 ps faxww | grep postgres | grep wal
   21 ?        Ss     0:00  \_ postgres: walreceiver   streaming 0/301E060

レプリケーション遅延の確認

  • プロセスの存在は確認できた
  • ちゃんと動いているか確認する
  • masterにて:
postgres=# SELECT * FROM pg_stat_replication;

-[ RECORD 1 ]----+------------------------------
pid              | 97
usesysid         | 16384
usename          | replication_user
application_name | walreceiver
client_addr      | 172.22.0.3
client_hostname  | 
client_port      | 32834
backend_start    | 2020-02-10 15:09:34.524869+00
backend_xmin     | 
state            | streaming
sent_lsn         | 0/3013310
write_lsn        | 0/3013310
flush_lsn        | 0/3013310
replay_lsn       | 0/3013310
write_lag        | 
flush_lag        | 
replay_lag       | 
sync_priority    | 1
sync_state       | sync
reply_time       | 2020-02-10 15:16:16.326192+00
  • スタンバイwalreceiverがプライマリからデータを同期的に受け取る準備ができている
application_name | walreceiver
sync_state       | sync
  • レプリケーション遅延
    • 「同期」といっても、プライマリとスタンバイの状態が常に同じとは限らない
      • masterのsynchronous_commit設定がremote_applyでない限りは遅延しうる
  • 遅延具合
sent_lsn         | 0/3013310
write_lsn        | 0/3013310
flush_lsn        | 0/3013310
replay_lsn       | 0/3013310
  • sent_lsn: プライマリがスタンバイに送出したLSN
  • write_lsn: スタンバイが受け取ってバッファに書き込んだLSN
  • flush_lsn: スタンバイがディスクに書き込んだLSN
  • replay_lsn: スタンバイがWALを適用したLSN
  • 今回は全部同じなので完全に同期できている

レプリケーションの管理

  • プライマリが死んだら?
  • 自動的に立ち直って欲しいところだが…
  • PostgreSQL12時点では、手を加えてスタンバイを昇格してあげる必要あり

recovery.confのtrigger_fileを用いる方法

  • standbyの設定
postgres=# SELECT name,setting,context FROM pg_settings WHERE name='promote_trigger_file';

         name         | setting | context 
----------------------+---------+---------
 promote_trigger_file |         | sighup
(1 row)
  • 指定のパスにファイルがあるのを検出したらmasterに昇格する
  • ためす

postgresql.conf

promote_trigger_file = 'trigger.file'
standby1_1  | 2020-02-10 15:36:49.623 UTC [21] LOG:  replication terminated by primary server
standby1_1  | 2020-02-10 15:36:49.623 UTC [21] DETAIL:  End of WAL reached on timeline 1 at 0/3013388.
standby1_1  | 2020-02-10 15:36:49.623 UTC [21] FATAL:  could not send end-of-streaming message to primary: no COPY in progress
standby1_1  | 2020-02-10 15:36:49.623 UTC [17] LOG:  invalid record length at 0/3013388: wanted 24, got 0
master_1    | 2020-02-10 15:36:49.624 UTC [1] LOG:  database system is shut down
standby1_1  | 2020-02-10 15:36:49.625 UTC [34] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
standby1_1  |       Is the server running on host "master" (172.22.0.2) and accepting
standby1_1  |       TCP/IP connections on port 5432?
streaming_replication_master_1 exited with code 0
standby1_1  | 2020-02-10 15:36:54.636 UTC [35] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:36:59.641 UTC [36] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
  • standbyサーバーで所定のファイルを生成
postgres@a1ed78157801:~/data$ touch trigger.file
  • ほどなくmasterに昇格する
    • 若干のタイムラグあり
standby1_1  | 2020-02-10 15:37:04.649 UTC [37] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:37:09.653 UTC [38] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:37:14.653 UTC [17] LOG:  promote trigger file found: trigger.file
standby1_1  | 2020-02-10 15:37:14.653 UTC [17] LOG:  redo done at 0/3013310
standby1_1  | 2020-02-10 15:37:14.653 UTC [17] LOG:  last completed transaction was at log time 2020-02-10 15:10:05.098076+00
standby1_1  | 2020-02-10 15:37:14.661 UTC [17] LOG:  selected new timeline ID: 2
standby1_1  | 2020-02-10 15:37:14.774 UTC [17] LOG:  archive recovery complete
standby1_1  | 2020-02-10 15:37:14.788 UTC [16] LOG:  database system is ready to accept connections

pg_ctl promoteを用いる方法

standbyサーバーにログインしてコマンド実行

postgres@3a3af865b441:/$ pg_ctl promote
waiting for server to promote.... done
server promoted
  • ログ
master_1    | 2020-02-10 15:50:07.545 UTC [1] LOG:  database system is shut down
standby1_1  | 2020-02-10 15:50:07.547 UTC [22] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
standby1_1  |       Is the server running on host "master" (172.23.0.2) and accepting
standby1_1  |       TCP/IP connections on port 5432?
streaming_replication_master_1 exited with code 0
standby1_1  | 2020-02-10 15:50:12.557 UTC [23] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:50:17.563 UTC [24] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:50:22.566 UTC [25] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:50:27.572 UTC [31] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:50:28.906 UTC [16] LOG:  received promote request
standby1_1  | 2020-02-10 15:50:28.906 UTC [16] LOG:  redo done at 0/3000060
standby1_1  | 2020-02-10 15:50:28.909 UTC [16] LOG:  selected new timeline ID: 2
standby1_1  | 2020-02-10 15:50:29.017 UTC [16] LOG:  archive recovery complete
standby1_1  | 2020-02-10 15:50:29.028 UTC [15] LOG:  database system is ready to accept connections
  • こちらはシグナル送信を起点に即昇格

【補】pg_promote関数呼び出し

  • 公式
  • PostgreSQL12で追加されたやつ
  • サーバーにログインしなくていいのが嬉しい
  • クライアントからSQL文発行
postgres@7054e6e2b74f:/$ psql -h standby1 -c "SELECT pg_promote();"
 pg_promote 
------------
 t
(1 row)
  • ログ
master_1    | 2020-02-10 15:52:42.567 UTC [1] LOG:  database system is shut down
standby1_1  | 2020-02-10 15:52:42.568 UTC [23] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
standby1_1  |       Is the server running on host "master" (172.24.0.2) and accepting
standby1_1  |       TCP/IP connections on port 5432?
streaming_replication_master_1 exited with code 0
standby1_1  | 2020-02-10 15:52:47.578 UTC [24] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:52:52.584 UTC [25] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:52:57.588 UTC [26] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:53:02.592 UTC [27] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:53:07.598 UTC [28] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:53:12.604 UTC [29] FATAL:  could not connect to the primary server: could not translate host name "master" to address: Name or service not known
standby1_1  | 2020-02-10 15:53:15.994 UTC [30] LOG:  connection received: host=172.24.0.5 port=60854
standby1_1  | 2020-02-10 15:53:15.994 UTC [30] LOG:  connection authorized: user=postgres database=postgres application_name=psql
standby1_1  | 2020-02-10 15:53:15.998 UTC [17] LOG:  received promote request
standby1_1  | 2020-02-10 15:53:15.998 UTC [17] LOG:  redo done at 0/3000060
standby1_1  | 2020-02-10 15:53:16.028 UTC [17] LOG:  selected new timeline ID: 2
standby1_1  | 2020-02-10 15:53:16.139 UTC [17] LOG:  archive recovery complete
standby1_1  | 2020-02-10 15:53:16.155 UTC [16] LOG:  database system is ready to accept connections
standby1_1  | 2020-02-10 15:53:16.201 UTC [30] LOG:  disconnection: session time: 0:00:00.207 user=postgres database=postgres host=172.24.0.5 port=60854

共通

  • 昇格したstandbyのタイムラインIDは1つ繰り上がる

昇格前

standby1_1  | 2020-02-10 15:36:49.623 UTC [21] DETAIL:  End of WAL reached on timeline 1 at 0/3013388.

昇格時

standby1_1  | 2020-02-10 15:37:14.661 UTC [17] LOG:  selected new timeline ID: 2

設定手順の整理

Column: 循環するレプリケーション

  • 全部スタンバイにできちゃう
  • 設定をしくじるとプライマリがない状態になっちゃう。気をつけよう