It seems like every web project inevitably has a development need to clone production data locally. In this post, I’ll walk through my latest shell scripts for backing up a remote Postgres database and loading it locally.
I’ve written various scripts like this for various tech stacks over the years. This time around, it’s a Postgres database in AWS [RDS](https://aws.amazon.com/rds/postgresql/). The database is not publicly accessible, but we have an EC2 instance that individual developers (with whitelisted IPs) can [SSH tunnel](https://tableplus.com/blog/2019/08/ssh-tunnel-secure-database-connection.html) through to access it.
## Backup Script
The first script backs up a remote database, leaving an SQL dump on disk. Because this script is checked into private source control and only intended for developers on this project, I don’t mind hardcoding project-specific strings. I don’t want to include sensitive secrets, though, so I left those out and read them through environment variables:
#! /usr/bin/env bash
if [[ -z "${DB_PASSWORD}" ]]; then
echo "Please set a DB_PASSWORD environment variable in order to connect to the RDS Database."
echo "You may be able to retrieve it with:"
echo " aws secretsmanager get-secret-value --secret-id arn:aws:secretsmanager:xx:xx:xx:xx"
exit 1
fi
if [[ -z "${DB_SSH_KEY}" ]]; then
echo "Please set an DB_SSH_KEY environment variable to the path to the db-ssh-key.pem file. (you can find the file in the team password manager)"
exit 1
fi
Note the instructions to future developers about where to find the secrets. Please do this!
Next, we set up a few variables. The exported ones are for [Postgres](https://www.postgresql.org/docs/12/libpq-envars.html):
export PGPASSWORD=$DB_PASSWORD
export PGUSER=pguser
export PGDATABASE=project-db
db_host=project-db-dev.xx.xx.rds.amazonaws.com
ssh_host=xx.xx.compute.amazonaws.com
ssh_user=ec2-user
db_port=5432
date=$(date '+%Y-%m-%d.%H-%M-%S')
filename=$date.dev.dump.sql
Now comes an SSH tunnel, created using a cool trick I found on this gist from GitHub user scy. It opens a tunnel and waits (ten seconds) for something to connect to it:
ssh -f -o ExitOnForwardFailure=yes -L localhost:5433:"$PGHOST":$db_port $ssh_user@"$ssh_host" -i "$DB_SSH_KEY" sleep 10
Check out [explainshell](https://explainshell.com/explain?cmd=ssh+-f+-o+ExitOnForwardFailure%3Dyes+-L+localhost%3A5433%3A%22%24PGHOST%22%3A%24db_port+%24ssh_user%40%22%24ssh_host%22+-i+%22%24DB_SSH_KEY%22+sleep+10) for more details on those arguments.
Finally, we connect through the tunnel and dump the database:
pg_dump --no-privileges -h localhost -p 5433 --file "$filename"
# The `--no-privileges` option is important to avoid an error upon restore like:
# psql:2020-10-27.22-50-38.dev.dump.sql:575: ERROR: role "rdsadmin" does not exist
#
# ( https://www.postgresql.org/docs/12/app-pgdump.html#PG-DUMP-OPTIONS )
And that’s it! The script can be run like this:
> set -n |grep DB_ DB_PASSWORD DB_SSH_KEY > ./backup-remote-db.sh > ls *.sql 2020-10-27.23-00-08.dev.dump.sql >
## Restoration Script
The next script will take the SQL dump produced above and load it into a database running locally. This time, there will be no secrets to worry about. We’ll start by receiving an input parameter and setting a few uninteresting variables:
#! /usr/bin/env bash
if [ "$#" -ne 1 ]; then
echo "Usage:"
echo "./load-dump.sh backup.sql"
exit 1
fi
DUMP_FILE=$1
export PGHOST=localhost
export PGPORT=5432
export PGUSER=root
export PGPASSWORD=password
DATABASE_NAME=my-project-dev
Next, we terminate any existing database connections:
psql -c "SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE pg_stat_activity.datname = '${DATABASE_NAME}' AND pid <> pg_backend_pid();"
# This avoids errors like:
# ERROR: database "my-project-dev" is being accessed by other users
# DETAIL: There is 1 other session using the database.`)
Then we drop and recreate the database:
psql -c "DROP DATABASE IF EXISTS \"${DATABASE_NAME}\""
echo "SELECT 'CREATE DATABASE \"${DATABASE_NAME}\"' WHERE NOT EXISTS (SELECT FROM pg_database WHERE datname = '${DATABASE_NAME}')\gexec" | psql
psql -f "$DUMP_FILE" "$DATABASE_NAME"
And finally, we load the dump:
psql -f "$DUMP_FILE" "$DATABASE_NAME"
## Wrapping Up
That’s it! Complete versions of both scripts can be found [in this gist](https://gist.github.com/jrr/cc6feb5ed9f62aeffbcac3795edb0dd1).
These scripts are simplified a bit. Other things you might want from this kind of automation include:
- Sanitization of user data
- Convenient access to multiple named remote databases (e.g., dev/staging/prod)
I’ve found local copies of prod data to be extremely useful when analyzing reported bugs, and you’ll be hard-pressed to synthesize a more representative set of data for general testing.
Hopefully, this post will save you a little effort on future projects. Read on for a couple of related thoughts I have to share.
### Appendix 1: Where to keep those secrets?
A common annoyance in web projects is that you often need to have a few secrets available locally, but you don’t want to check them into git (and I don’t want to leave them in unversioned files in the repo because I `git clean -fdx .` all the time).
I’ve taken up a pattern of solving this with [direnv](https://direnv.net/). It goes like this:
- Wrap an extra directory around your git repo (e.g.,
~/repos/project-foo/repo-goes-here
). - Put your secrets in an
.envrc
in the outer directory (e.g.,project-foo/.envrc
):export DB_PASSWORD=foobar
- In your git repo’s
.envrc
, conditionally include it like this:if [ -f ../.envrc ]; then source_up fi
This also works for persisting other not-even-a-secret variables without polluting your team’s shared files.
For more on direnv, see Joe’s post on Project-Specific Command Line Shortcuts.
### Appendix 2: Dockerized Database
Just in case you aren’t already running your database in Docker, _you totally should_. Check out this nine-line file that pins a specific Postgres version, can be rapidly started/stopped, and doesn’t conflict with other versions installed on your system for other projects:
version: "2"
services:
db:
image: postgres:12.3
ports:
- "5432:5432"
environment:
- POSTGRES_PASSWORD=password
- POSTGRES_USER=root
Put that in a `docker-compose.yml`, run `docker-compose up`, and you’re off to the races.
You can also do this with [MySQL](https://hub.docker.com/_/mysql) and even now [MS-SQL](https://hub.docker.com/_/microsoft-mssql-server).