Shell Scripts to Clone Prod Data Locally

It seems like every web project inevitably has a development need to clone production data locally. In this post, I’ll walk through my latest shell scripts for backing up a remote Postgres database and loading it locally.

I’ve written various scripts like this for various tech stacks over the years. This time around, it’s a Postgres database in AWS [RDS]( The database is not publicly accessible, but we have an EC2 instance that individual developers (with whitelisted IPs) can [SSH tunnel]( through to access it.

## Backup Script

The first script backs up a remote database, leaving an SQL dump on disk. Because this script is checked into private source control and only intended for developers on this project, I don’t mind hardcoding project-specific strings. I don’t want to include sensitive secrets, though, so I left those out and read them through environment variables:

#! /usr/bin/env bash

if [[ -z "${DB_PASSWORD}" ]]; then
    echo "Please set a DB_PASSWORD environment variable in order to connect to the RDS Database."
    echo "You may be able to retrieve it with:"
    echo "  aws secretsmanager get-secret-value --secret-id arn:aws:secretsmanager:xx:xx:xx:xx"
    exit 1

if [[ -z "${DB_SSH_KEY}" ]]; then
    echo "Please set an DB_SSH_KEY environment variable to the path to the db-ssh-key.pem file. (you can find the file in the team password manager)"
    exit 1

Note the instructions to future developers about where to find the secrets. Please do this!

Next, we set up a few variables. The exported ones are for [Postgres](

export PGUSER=pguser
export PGDATABASE=project-db
date=$(date '+%Y-%m-%d.%H-%M-%S')

Now comes an SSH tunnel, created using a cool trick I found on this gist from GitHub user scy. It opens a tunnel and waits (ten seconds) for something to connect to it:

ssh -f -o ExitOnForwardFailure=yes -L localhost:5433:"$PGHOST":$db_port $ssh_user@"$ssh_host" -i "$DB_SSH_KEY" sleep 10

Check out [explainshell]( for more details on those arguments.

Finally, we connect through the tunnel and dump the database:

pg_dump --no-privileges -h localhost -p 5433 --file "$filename"

# The `--no-privileges` option is important to avoid an error upon restore like:
# ERROR: role "rdsadmin" does not exist
# ( )

And that’s it! The script can be run like this:

> set -n |grep DB_
> ./
> ls *.sql

## Restoration Script

The next script will take the SQL dump produced above and load it into a database running locally. This time, there will be no secrets to worry about. We’ll start by receiving an input parameter and setting a few uninteresting variables:

#! /usr/bin/env bash

if [ "$#" -ne 1 ]; then
    echo "Usage:"
    echo "./ backup.sql"
    exit 1


export PGHOST=localhost
export PGPORT=5432
export PGUSER=root
export PGPASSWORD=password

Next, we terminate any existing database connections:

psql -c "SELECT pg_terminate_backend( FROM pg_stat_activity WHERE pg_stat_activity.datname = '${DATABASE_NAME}' AND pid <> pg_backend_pid();"
# This avoids errors like:
# ERROR:  database "my-project-dev" is being accessed by other users
# DETAIL:  There is 1 other session using the database.`)

Then we drop and recreate the database:




And finally, we load the dump:


## Wrapping Up

That’s it! Complete versions of both scripts can be found [in this gist](

These scripts are simplified a bit. Other things you might want from this kind of automation include:

  • Sanitization of user data
  • Convenient access to multiple named remote databases (e.g., dev/staging/prod)

I’ve found local copies of prod data to be extremely useful when analyzing reported bugs, and you’ll be hard-pressed to synthesize a more representative set of data for general testing.

Hopefully, this post will save you a little effort on future projects. Read on for a couple of related thoughts I have to share.

### Appendix 1: Where to keep those secrets?

A common annoyance in web projects is that you often need to have a few secrets available locally, but you don’t want to check them into git (and I don’t want to leave them in unversioned files in the repo because I `git clean -fdx .` all the time).

I’ve taken up a pattern of solving this with [direnv]( It goes like this:

  1. Wrap an extra directory around your git repo (e.g., ~/repos/project-foo/repo-goes-here).
  2. Put your secrets in an .envrc in the outer directory (e.g., project-foo/.envrc):
    export DB_PASSWORD=foobar
  3. In your git repo’s .envrc, conditionally include it like this:
    if [ -f ../.envrc ]; then

This also works for persisting other not-even-a-secret variables without polluting your team’s shared files.

For more on direnv, see Joe’s post on Project-Specific Command Line Shortcuts.

### Appendix 2: Dockerized Database

Just in case you aren’t already running your database in Docker, _you totally should_. Check out this nine-line file that pins a specific Postgres version, can be rapidly started/stopped, and doesn’t conflict with other versions installed on your system for other projects:

version: "2"
    image: postgres:12.3
      - "5432:5432"
      - POSTGRES_PASSWORD=password
      - POSTGRES_USER=root

Put that in a `docker-compose.yml`, run `docker-compose up`, and you’re off to the races.

You can also do this with [MySQL]( and even now [MS-SQL](