I’ve recently been experimenting with HTTrack, an open-source utility that makes it possible to download a full copy of any website. HTTrack is essentially a web crawler, allowing users to retrieve every page of a website merely by pointing the tool to the site’s homepage.
From the HTTrack homepage:
“[HTTrack] allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site’s relative link-structure.”
I thought I’d share my experience with it.
There are a couple of different ways to install HTTrack:
- HTTrack Website: Download and install HTTrack manually. The download contains a README with detailed directions.
- Homebrew: Users of Homebrew can easily install HTTrack with the formula `brew install httrack`.
The syntax of HTTrack is quite simple. You specify the URLs you wish to start the process from, any options you might want to add ([-option]), any filters specifying places you should ([+]) and should not ([-]) go, and end the command line by pressing Enter. HTTrack then goes off and does your bidding.
At its most basic, HTTrack can be run by specifying just a single URL:
This will unleash the program on the
http://example.com domain with default settings. HTTrack retrieves this URL, then parses the page for more links. Any links found within the page are downloaded next and parsed for additional links. The process continues on until the crawler cannot find any links it hasn’t already downloaded.
You can also add options to the basic command to customize HTTrack’s behavior. For example, you can specify forbidden URLs and directories, alter download speeds, and limit downloads to a certain filetype. HTTrack has a huge number of options, accessible via `
httrack --help` and at the project website.
Through trial and error, I came up with the following formula (broken out by line to make more readable):
httrack https://atomicobject.com \ -atomicobject.com/assets/* \ +atomicobject.com/*.css \ +atomicobject.com/*.js \ -atomicobject.com/documents/* \ -atomicobject.com/uploadedImages/* \ --path "~/httrack-copies/atomicobject/" \ --verbose \
Let’s take a detailed look at what each option in the command does:
As we saw in the basic syntax above, this points HTTrack at the site we want to copy.
A rule that begins with a minus sign indicates something that we don’t want HTTrack to download. In this case, we’ve specified three URLs not to download, because this is where all of our image and other non-HTML assets are located.
Note that each URL includes a wildcard symbol (“*”) at the end of the path. The use of the wildcard means that any file located within these three directories will match the rule, effectively disallowing the crawler from the entire directory.
A rule preceded by a plus (+) sign indicates something we do want to download.
The –path option specifies where we want HTTrack to save downloaded files. Without this option, files are downloaded to the current working directory.
The verbose option tells HTTrack to output its log to the Terminal, allowing us to monitor the program as it runs.
With the above settings, I can create a full copy of all HTML, CSS, and JS files on the Atomic website in just under four minutes. If you’re looking for an efficient tool to create a copy of a website, make sure to check out HTTrack.