In order to properly test new features and bug fixes, developers often maintain a recent copy of production-level data. This allows them to test accurately and validate the effect new changes will have on customers.
Database Dump & Restore Problem
Depending on the size of the database, this can be quite a tedious task! Above about 50 GB, I’ve seen times range from 30 minutes to an hour — even using the most effective command-line modifications.
This prompted me to run a series of experiments to determine the effect that something like a PostgreSQL database dump and restore would have on timing.
Compression Level
PostgreSQL allows you to determine the compression level of a database dump (0-9). This ultimately affects the amount of space taken up on your disk temporarily before you restore it. However, I’ve found that there was little impact on actual speed. However, it seems most of the time that running with zero compression comes out a little ahead.
Number of Jobs & PIGZ
In order to optimize timing, I tested several different parallel jobs being run on my CPU. Ultimately, this parameter `-j` is impacted based on the cores available on local hardware. Since my 2020 MBP only has 4 cores, I found that this was the highest I could go without experiencing something like CPU throttling which would then start negatively affecting time.
This led me to question: where can I get better hardware for cheap? Enter: AWS EC2 On-Demand.
Solution
AWS offers many services for cheap. One of these services is EC2 instances — virtual machines with customizable hardware specs. This feature gets charged at an hourly rate and for on-demand usage that runs infrequently, by the minute.
I started to look at how many cores I could get for the cheapest rate. If I was running these instances for less than 10 minutes, ultimately, the time I save waiting an hour otherwise on my local machine would be well worth it.
I chose to use an instance class `m4.4xlarge` since it had 8 CPU cores at $0.80 an hour for on-demand pricing. The next bump up was at $2.00 an hour so I thought this was a good balance. The full chart of pricing can be found here.
My assumptions were correct, and having run this database dump using double the cores, my PC had my time reduced from 30-40 minutes to just 4-6 depending on parameters. This was then restored using a personal connection string. However, I’ve also found success in simply copying these files locally using `scp` and then continuing the workflow from there.
Database Dump and Restore Using EC2 On-Demand
By using AWS as a means to rent out hardware, I was able to drastically reduce the time it took for me to create a database dump. This ultimately led to easier testing and saved quite a bit of time that would have otherwise been spent sitting.