We recently launched a new website, replacing the venerable old website of 9 years. So as not to completely lose the content of our old website, we decided to archive it to disk so that we would be able to resurrect it at a moment’s notice, both for historical purposes and to ensure that we would be able to retrieve any content or files we had not migrated to our new website.
Our old website was built on a custom CMS that had been written in-house. While it worked well enough for the live website, it did not offer any sort of static HTML export. Additionally, exporting the database and the CMS code seemed a poor means of archival as the CMS was built on Ruby 1.8.6–getting the entire stack running again in the future would be quite difficult.
After trying a few different utilities, including
curl, I settled on
HTTrack, a website copier.
HTTrack will download all accessible pages from a specified domain to a local directory, recursively copying all directories, images, and files. More importantly, it will also rewrite URL’s in downloaded HTML to make use of a relative link-structure in the local directory. This allows all links to local pages and resources to work, even without a web server. This differentiates HTTrack from several other utilities that will readily mirror a site, but not rewrite URL’s to be useable without being hosted on a web server with an appropriate domain configuration.
While HTTrack has a GUI available, on the command line, kicking off the process to archive a site is quite easy. Here, I’ve adjusted the total number of allowed connections and maximum transfer rate to allow me to download our website quickly:
httrack http://atomicobject.com --sockets 16 --max-rate=1024000
For our old website, this provided a directory structure corresponding to the apparent page hierarchy of our website:
To load up a full copy of the downloaded website, all that is necessary is to open a web browser and point it at
All of these files can now be zipped up and stored on a local file server, or committed to source control, for a handy copy of the old website. We don’t need to rely on Google’s cache or the Wayback Machine to try and retrieve previous copy or files.