If the connection between the master and standby drops, it will take some time for that to be noticed across an indirect network. To ensure that a dropped connection is noticed as soon as possible, you may wish to adjust the timeout settings.
The standby will notice that the connection to the master has dropped after wal_receiver_timeout milliseconds. Once the connection is dropped the standby will retry the connection to the sending server every wal_retrieve_retry_interval milliseconds. Set these parameters in the postgresql.conf file on the standby.
A sending server will notice that the connection has dropped after wal_sender_timeout milliseconds, set in the postgresql.conf file on the sender. Once the connection is dropped the standby is responsible for re-establishing the connection.
You may also wish to increase max_wal_senders to one or two more than the current number of nodes so that it will be possible to reconnect even before a dropped connection is noted. This allows a manual restart to re-establish connections more easily. If you do this, then also increase the connection limit for the replication user. Changing that setting requires a restart.
Data transfer may stop if the connection drops or the standby server or the standby system is shut down. If replication data transfer stops for any reason, it will attempt to restart from the point of the last transfer. Will that data still be available? Let's see.
For streaming replication, the master keeps a number of files that is at least equal to wal_keep_segments. If the standby database server has been down for long enough, the master will have moved on and will no longer have the data for the last point of transfer. If that should occur, then the standby needs to be reconfigured using the same procedure with which we started.
You should plan to use pg_basebackup --wal-method=stream. If you choose not to, you should note that the standby database server will not be streaming during the initial base backup. So, if the base backup is long enough, we might end up with a situation where replication will never start because the desired starting point is no longer available on the master. This is the error that you'll get:
FATAL: requested WAL segment 000000010000000000000002 has already been removed
It's very annoying, and there's no way out of it—you need to start over. So start with a very high value of wal_keep_segments. Don't guess this randomly; set it to the available disk space on pg_wal divided by 16 MB, or less if it is a shared disk. If you still get that error, then you need to increase wal_keep_segments and try again, possibly also using techniques to speed up the base backup, which are discussed in Chapter 11, Backup and Recovery.
If you can't set wal_keep_segments high enough, there is an alternative. You must configure a third server or storage pool with increased disk storage capacity, which you can use as an archive. The master will need to have an archive_command that places files on the archive server, rather than the dummy command shown in the preceding procedure, in addition to parameter settings to allow streaming to take place. The standby will need to retrieve files from the archive using restore_command, as well as streaming using primary_conninfo. Thus, both the master and standby have two modes for sending and receiving, and they can switch between them should failures occur. This is the typical configuration for large databases. Note that this means that the WAL data will be copied twice, once to the archive and once directly to the standby. Two copies are more expensive, but also more robust.