How to Archive a Website with HTTrack

We recently launched a new website, replacing the venerable old website of 9 years. So as not to completely lose the content of our old website, we decided to archive it to disk so that we would be able to resurrect it at a moment’s notice, both for historical purposes and to ensure that we would be able to retrieve any content or files we had not migrated to our new website.

Our old website was built on a custom CMS that had been written in-house. While it worked well enough for the live website, it did not offer any sort of static HTML export. Additionally, exporting the database and the CMS code seemed a poor means of archival as the CMS was built on Ruby 1.8.6–getting the entire stack running again in the future would be quite difficult.

After trying a few different utilities, including wget and curl, I settled on HTTrack, a website copier.

HTTrack will download all accessible pages from a specified domain to a local directory, recursively copying all directories, images, and files. More importantly, it will also rewrite URL’s in downloaded HTML to make use of a relative link-structure in the local directory. This allows all links to local pages and resources to work, even without a web server. This differentiates HTTrack from several other utilities that will readily mirror a site, but not rewrite URL’s to be useable without being hosted on a web server with an appropriate domain configuration.

While HTTrack has a GUI available, on the command line, kicking off the process to archive a site is quite easy. Here, I’ve adjusted the total number of allowed connections and maximum transfer rate to allow me to download our website quickly:

httrack https://atomicobject.com --sockets 16 --max-rate=1024000

For our old website, this provided a directory structure corresponding to the apparent page hierarchy of our website:

├── atomicobject.com
│   ├── alive
│   ├── files
│   ├── images
│   ├── index.html
│   ├── javascripts
│   ├── new
│   ├── news
│   ├── pages
│   ├── stylesheets
│   └── talks
├── backblue.gif
├── fade.gif
└── index.html

To load up a full copy of the downloaded website, all that is necessary is to open a web browser and point it at index.html.

All of these files can now be zipped up and stored on a local file server, or committed to source control, for a handy copy of the old website. We don’t need to rely on Google’s cache or the Wayback Machine to try and retrieve previous copy or files.

Related Posts

A Webserver and a Slack Bot on OpenBSD: Introducing relayd

SST Makes Infrastructure Easy — Too Easy?

How to Hot Reload Kubernetes with Skaffold

Keep up with our latest posts.

Tell Us About Your Project