Bulk loading data into PGD clusters
Bulk loading data into PGD clusters
This guidance is specifically for environments where there's no direct access to the PGD nodes, only PGD Proxy endpoints, such as BigAnimal's distributed high availability deployments of PGD.
Without using care, bulk loading data into a PGD cluster can cause a lot of replication load on a cluster. With that in mind, this content describes a process to mitigate that replication load.
Provision or prepare a PGD cluster
You must provision a PGD cluster, either manually, using TPA, or on BigAnimal. This will be the target database for the migration. Ensure that you provision it with sufficient storage capacity to hold the migrated data.
We recommend that, when provisioning or, if needed, after provisioning, you set the following Postgres GUC variables.
GUC variable | Setting |
---|---|
maintenance_work_mem | 1GB |
wal_sender_timeout | 60min |
wal_receiver_timeout | 60min |
max_wal_size | Set to either: • A multiple (2 or 3) of your largest table or • More than one third of the capacity of your dedicated WAL disk (if configured) |
Make note of the target's proxy hostname (target-proxy) and port (target-port). You also need a user (target-user) and password (target-password) for the target cluster.
The following instructions give examples for a cluster named ab-cluster
with an ab-group
subgroup and three nodes: ab-node-1
, ab-node-2
, and ab-node3
. The cluster is accessed through a host named ab-proxy
(the target-proxy).
On BigAnimal, a cluster is configured, by default, with an edb_admin
user (the target-user) that can be used for the bulk upload.
The target-password for the target-user will be available from the BigAnimal dashboard for the cluster.
A database named bdrdb
(the target-dbname) will also have been created.
Identify your data source
You need the source hostname (source-host), port (source-port), database name (source-dbname), user , and password for your source database.
Also, you currently need a list of tables in the database that you want to migrate to the target database.
Prepare a bastion server
Create a virtual machine with your preferred operating system in the cloud to orchestrate your bulk loading.
- Use your EDB account.
- Obtain your EDB repository token from the EDB Repos 2.0 page.
- Set environment variables.
- Set the
EDB_SUBSCRIPTION_TOKEN
environment variable to the repository token.
- Set the
- Configure the repositories.
- Run the automated installer to install the repositories.
- Install the required software.
- Install and configure:
- psql
- PGD CLI
- Migration Toolkit
- LiveCompare
- Install and configure:
Use your EDB account
Go to the EDB Repos 2.0 page and log in with your EDB account. Make a note of the repository token that you will use to configure the repositories on the bastion server.
Set environment variables
Set the EDB_SUBSCRIPTION_TOKEN
environment variable to the repository token you obtained from the EDB Repos 2.0 page.
Configure repositories
The required software is available from the EDB repositories. You need to install the EDB repositories on your bastion server.
- Red Hat
- Ubuntu/Debian
Install the required software
Once the repositories are configured, you can install the required software.
Installing psql
and pg_dump
/pg_restore
/pg_dumpall
The psql command is the interactive terminal for working with PostgreSQL. It's a client application and can be installed on any operating system. Packaged with psql are pg_dump and pg_restore, command-line utilities for dumping and restoring PostgreSQL databases.
- Ubuntu
- Red Hat
To simplify logging in to the databases, create a .pgpass file for both your source and target servers:
Create the file in your home directory and change its permissions to read/write only for the owner. Ensure that your passwords are appropriately escaped in the .pgpass file. If an entry needs to contain : or \, escape this character with \.
Installing PGD CLI
PGD CLI is a command-line interface for managing and monitoring PGD clusters. It's a Go application and can be installed on any operating system.
- Ubuntu
- Red Hat
Create a configuration file for the PGD CLI:
For the example ab-cluster
:
Save it as pgd-cli-config.yml
.
See also Installing PGD CLI.
Installing Migration Toolkit
EDB's Migration Toolkit (MTK) is a command-line tool that can be used to migrate data from a source database to a target database. It's a Java application and requires a Java runtime environment to be installed.
- Ubuntu
- Red Hat
See also Installing Migration Toolkit
Installing LiveCompare
EDB LiveCompare is an application that can be used to compare two databases and generate a report of the differences. It will be used later on in this process to verify the data migration.
- Ubuntu
- Red Hat
See also LiveCompare requirements.
Set up and tune the target cluster
On the target cluster and within the regional group required, select one node to be the destination for the data.
If you have a group ab-group
with ab-node-1
, ab-node-2
, and ab-node-3
, you can select ab-node-1
as the destination node.
Set up a fence
Fence off all other nodes except for the destination node.
Connect to any node on the destination group using the psql command.
Use bdr.alter_node_option
and turn the route_fence
option to true
for each node in the group apart from the destination node:
The next time you connect with psql, you're directed to the write leader, which should be the destination node. To ensure that it is, you need to send two more commands.
Make the destination node both write and raft leader
To minimize the possibility of disconnections, move the raft and write leader roles to the destination node.
Make the destination node the raft leader using bdr.raft_leadership_transfer
. You need to specify the node and the group name that the node is a member of.:
Because you fenced off the other nodes in the group, this command triggers a write leader election in the ab-group
that elects the ab-node-1
as write leader.
Record then clear default commit scopes
You need to make a record of the default commit scopes in the cluster. The next step overwrites the settings. (At the end of this process, you need to restore them.) Run:
This command produces an output similar to::
Record these values. You can now overwrite the settings:
Prepare to monitor the data migration
Check that the target cluster is healthy.
- To check the overall health of the cluster, run
pgd -f pgd-cli-config.yml check-health
:
(When the cluster is healthy, all checks pass.)
- To verify the configuration of the cluster, run
pgd -f pgd-cli-config.yml verify-cluster
:
(When the cluster is verified, all checks.)
- To check the status of the nodes, run
pgd -f pgd-cli-config.yml show-nodes
:
To confirm the raft leader, run
pgd -f pgd-cli-config.yml show-raft
.To confirm the replication slots, run
pgd -f pgd-cli-config.yml show-replslots
.To confirm the subscriptions, run
pgd -f pgd-cli-config.yml show-subscriptions
.To confirm the groups, run
pgd -f pgd-cli-config.yml show-groups
.
These commands provide a snapshot of the state of the cluster before the migration begins.
Migrating the data
Currently, you must migrate the data in four phases:
- Transferring the “pre-data” using pg_dump and pg_restore, which exports and imports all the data definitions.
- Transfer the role definitions using pg_dumpall and psql.
- Using MTK with the
--dataonly
option to transfer only the data from each table, repeating as necessary for each table. - Transferring the “post-data” using pg_dump and pg_restore, which completes the data transfer.
Transferring the pre-data
Use the pg_dump
utility against the source database to dump the pre-data section in directory format:
Consult predatadump.log
to ensure that the dump was successful. If it fails, you can repeat the dump after resolving the issue.
Once the pre-data is dumped into the predata directory, you can load it into the target cluster using pg_restore
:
The options=
section in the connection string to the server is important. The options disable DDL locking and set the commit scope to local
, overriding any default commit scopes. Using --section=pre-data
limits the restore to the configuration that precedes the data in the dump.
Consult predatarestore.log
to ensure that the restore was successful. If it fails, you can repeat the restore after resolving the issue.
Transferring role definitions
Use the pg_dumpall
utility to dump the role definitions from the source database:
Consult rolesdump.log
to ensure that the dump was successful. If it fails, you can repeat the dump after resolving the issue.
Then load the role definitions into the target cluster:
Consult rolesrestore.log
to ensure that the restore was successful. If it fails, you can repeat the restore after resolving the issue.
Transferring the data
In this step, Migration Toolkit is used to transfer the table data between the source and target.
Edit /usr/edb/migrationtoolkit/etc/toolkit.properties
. You need to use sudo to raise your privilege to do this, that is, sudo vi /usr/edb/migrationtoolkit/etc/toolkit.properties
.
Edit the relevant values in the settings.
Ensure that the configuration file is owned by the user you intend to run the data transfer as and read-write only for its owner.
Now, select sets of tables in the source database that must be transferred together, ideally grouping them for redundancy in case of failure:
This command uses the -truncLoad
option and drops indexes and constraints before the data is loaded. It then recreates them after the loading has completed.
You can run multiple instances of this command in parallel. To do so, add an &
to the end of the command. Ensure that you write the output from each to different files (for example, mtk_1.log
, mtk_2.log
).
For example:
This sets up four processes, each transferring a particular table or sets of tables as a background process.
While this is running, monitor the lag. Log into the destination node with psql, and monitor lag with:
Once the lag is consumed, return to the shell. You can now use tail
to monitor the progress of the data transfer by following the log files of each process:
You can also consult the error logs (mtkerr_1.log
, mtkerr_2.log
, mtkerr_3.log
, mtkerr_4.log
) to troubleshoot any issues that arise.
Transferring the post-data
Make sure there's no replication lag across the entire cluster before proceeding with post-data.
Now dump the post-data section of the source database:
Then load the post-data section into the target database:
If this step fails due to a disconnection, return to monitoring lag (as described previously). Then, when no synchronization lag is present, repeat the restore.
Resume the cluster
Remove the routing fences you set up earlier on the other nodes
Connect directly to the destination node using psql. Use bdr.alter_node_option
and turn off the route_fence
option for each node in the group except for the destination node, which is already off:
Proxies can now route to all the nodes in the group.
Reset commit scopes
You can now restore the default commit scopes to the cluster to allow PGD to manage the replication load. Set default_commit_scope
for the groups to the value for the groups that you recorded in an earlier step.
The cluster is now loaded and ready for production. For more assurance, you can run the pgd -f pgd-cli-config.yml check-health
command to check the overall health of the cluster and the other PGD commands from when you checked the cluster earlier.
Verify the data migration
Use LiveCompare to compare the source and target databases. Create a configuration file for LiveCompare:
This configuration file should be saved as migrationcheck.ini
. The [First Connection]
and [Second Connection]
sections should be updated with the appropriate values, with the [First Connection]
section pointing to the source database and the [Second Connection]
section pointing to the target database. The [Output Connection]
section defines a database where a livecompare
schema will be created to store the comparison results.
Run LiveCompare using the configuration file you created:
LiveCompare will compare the source and target databases and generate a report of the differences. Review the report to ensure that the data migration was successful.
Refer to the LiveCompare documentation for more information on using LiveCompare.