Shell Scripts to Clone Prod Data Locally

It seems like every web project inevitably has a development need to clone production data locally. In this post, I’ll walk through my latest shell scripts for backing up a remote Postgres database and loading it locally.

I’ve written various scripts like this for various tech stacks over the years. This time around, it’s a Postgres database in AWS RDS. The database is not publicly accessible, but we have an EC2 instance that individual developers (with whitelisted IPs) can SSH tunnel through to access it.

Backup Script

The first script backs up a remote database, leaving an SQL dump on disk. Because this script is checked into private source control and only intended for developers on this project, I don’t mind hardcoding project-specific strings. I don’t want to include sensitive secrets, though, so I left those out and read them through environment variables:


#! /usr/bin/env bash

if [[ -z "${DB_PASSWORD}" ]]; then
    echo "Please set a DB_PASSWORD environment variable in order to connect to the RDS Database."
    echo "You may be able to retrieve it with:"
    echo "  aws secretsmanager get-secret-value --secret-id arn:aws:secretsmanager:xx:xx:xx:xx"
    exit 1
fi

if [[ -z "${DB_SSH_KEY}" ]]; then
    echo "Please set an DB_SSH_KEY environment variable to the path to the db-ssh-key.pem file. (you can find the file in the team password manager)"
    exit 1
fi

Note the instructions to future developers about where to find the secrets. Please do this!

Next, we set up a few variables. The exported ones are for Postgres:


export PGPASSWORD=$DB_PASSWORD
export PGUSER=pguser
export PGDATABASE=project-db

db_host=project-db-dev.xx.xx.rds.amazonaws.com
ssh_host=xx.xx.compute.amazonaws.com
ssh_user=ec2-user
db_port=5432
date=$(date '+%Y-%m-%d.%H-%M-%S')
filename=$date.dev.dump.sql

Now comes an SSH tunnel, created using a cool trick I found on this gist from GitHub user scy. It opens a tunnel and waits (ten seconds) for something to connect to it:


ssh -f -o ExitOnForwardFailure=yes -L localhost:5433:"$PGHOST":$db_port $ssh_user@"$ssh_host" -i "$DB_SSH_KEY" sleep 10

Check out explainshell for more details on those arguments.

Finally, we connect through the tunnel and dump the database:


pg_dump --no-privileges -h localhost -p 5433 --file "$filename"

# The `--no-privileges` option is important to avoid an error upon restore like:
# psql:2020-10-27.22-50-38.dev.dump.sql:575: ERROR: role "rdsadmin" does not exist
#
# ( https://www.postgresql.org/docs/12/app-pgdump.html#PG-DUMP-OPTIONS )

And that’s it! The script can be run like this:

> set -n |grep DB_
DB_PASSWORD
DB_SSH_KEY
> ./backup-remote-db.sh
> ls *.sql
2020-10-27.23-00-08.dev.dump.sql
>

Restoration Script

The next script will take the SQL dump produced above and load it into a database running locally. This time, there will be no secrets to worry about. We’ll start by receiving an input parameter and setting a few uninteresting variables:


#! /usr/bin/env bash

if [ "$#" -ne 1 ]; then
    echo "Usage:"
    echo "./load-dump.sh backup.sql"
    exit 1
fi

DUMP_FILE=$1

export PGHOST=localhost
export PGPORT=5432
export PGUSER=root
export PGPASSWORD=password
DATABASE_NAME=my-project-dev

Next, we terminate any existing database connections:


psql -c "SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE pg_stat_activity.datname = '${DATABASE_NAME}' AND pid <> pg_backend_pid();"
# This avoids errors like:
# ERROR:  database "my-project-dev" is being accessed by other users
# DETAIL:  There is 1 other session using the database.`)

Then we drop and recreate the database:


psql -c "DROP DATABASE IF EXISTS \"${DATABASE_NAME}\""

echo "SELECT 'CREATE DATABASE \"${DATABASE_NAME}\"' WHERE NOT EXISTS (SELECT FROM pg_database WHERE datname = '${DATABASE_NAME}')\gexec" | psql

psql -f "$DUMP_FILE" "$DATABASE_NAME"

And finally, we load the dump:


psql -f "$DUMP_FILE" "$DATABASE_NAME"

Wrapping Up

That’s it! Complete versions of both scripts can be found in this gist.

These scripts are simplified a bit. Other things you might want from this kind of automation include:

  • Sanitization of user data
  • Convenient access to multiple named remote databases (e.g., dev/staging/prod)

I’ve found local copies of prod data to be extremely useful when analyzing reported bugs, and you’ll be hard-pressed to synthesize a more representative set of data for general testing.

Hopefully, this post will save you a little effort on future projects. Read on for a couple of related thoughts I have to share.

Appendix 1: Where to keep those secrets?

A common annoyance in web projects is that you often need to have a few secrets available locally, but you don’t want to check them into git (and I don’t want to leave them in unversioned files in the repo because I git clean -fdx . all the time).

I’ve taken up a pattern of solving this with direnv. It goes like this:

  1. Wrap an extra directory around your git repo (e.g., ~/repos/project-foo/repo-goes-here).
  2. Put your secrets in an .envrc in the outer directory (e.g., project-foo/.envrc):
    
    export DB_PASSWORD=foobar
    
  3. In your git repo’s .envrc, conditionally include it like this:
    
    if [ -f ../.envrc ]; then
        source_up
    fi
    

This also works for persisting other not-even-a-secret variables without polluting your team’s shared files.

For more on direnv, see Joe’s post on Project-Specific Command Line Shortcuts.

Appendix 2: Dockerized Database

Just in case you aren’t already running your database in Docker, you totally should. Check out this nine-line file that pins a specific Postgres version, can be rapidly started/stopped, and doesn’t conflict with other versions installed on your system for other projects:


version: "2"
services:
  db:
    image: postgres:12.3
    ports:
      - "5432:5432"
    environment:
      - POSTGRES_PASSWORD=password
      - POSTGRES_USER=root

Put that in a docker-compose.yml, run docker-compose up, and you’re off to the races.

You can also do this with MySQL and even now MS-SQL.