Downloading the WordPress Codex
My company is shipping me off to Japan on Saturday for a month (possibly more) to train one of our partners on how best to use our system. It’s going to be an interesting experience. Not only will I be conversing regularly with people whose language I do not speak, but I’ll be completely immersed in a different culture about which I know very little. I mean, until now Japanese culture has played only two roles in my life: Pokémon and Dance Dance Revolution. Obviously, there’s much more to it, and I’m totally excited to learn all about it.
So far, there’s just one thing I’m nervous about: that dreaded plane ride.
In preparing for this trip, I’ve been creating massive to-do lists (Astrid is an amazing tool for this, by the way!). And I’ve identified some technical tasks that would be ideal for the 12+ hours I’ll be spending in the air. One of those tasks: convert a friend’s website from static HTML into a WordPress site. It’s pretty simple on the outset… However, WordPress has some complexity when it comes to templates and plug-ins, and so I’d like to have the WordPress documentation available to me while I’m offline.
The problem: WordPress documentation is a set of HTML pages, and cannot be easily downloaded. I searched around a bit, and it seems this question’s been asked before. And the answer is usually: use a tool to download a local copy of the website. The tool of choice: HTTrack. I’m on Ubuntu, so installing was easy: sudo apt-get install httrack
Once installed, I gave it a shot:
tkelley:~/> httrack "http://codex.wordpress.org/" Mirror launched on Wed, 09 May 2012 13:48:30 by HTTrack Website Copier/3.44-1+libhtsjava.so.2 [XR&CO'2010] mirroring http://codex.wordpress.org/ with the wizard help.. Done.codex.wordpress.org/ (162 bytes) - 403 Thanks for using HTTrack!
…but that exited pretty quickly. Not what I was expecting for a full site download. During the execution, HTTrack generated a log. The log ended with this:
tkelley:~/> tail hts-log.txt | tail -n2 13:52:56 Error: "Forbidden" (403) at link codex.wordpress.org/ (from primary/primary) 13:52:56 Info: No data seems to have been transfered during this session! : restoring previous one!
403 (“forbidden”) error? That’s weird… I’m able to view it in my browser and via wget/curl without problems:
tkelley:~/> curl -G codex.wordpress.org --write-out %{http_code}"\n" -s -o /dev/null
200
tkelley:~/> wget codex.wordpress.org 2>&1 | egrep HTTP
HTTP request sent, awaiting response... 200 OK
HTTrack must be sending something that WordPress doesn’t like. Of the usual suspects, I’ve found that the most common is User Agent. In this case, It turns out that HTTrack passes its own user agent of “Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)”, and… wouldn’t you know… WordPress isn’t a fan:
tkelley:~/> wget -U "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)" codex.wordpress.org 2>&1 | egrep HTTP HTTP request sent, awaiting response... 403 Forbidden
Now that we know what’s causing it, all we need to do is play make-believe. Pass a “good” user agent string (from a browser that WordPress accepts, say, Chromium) using the -F flag, and we’re good to go:
tkelley:~/> httrack "http://codex.wordpress.org/" -F "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.19 (KHTML, like Gecko) Ubuntu/11.10 Chromium/18.0.1025.168"
I honestly don’t know why WordPress would block HTTrack. After all, WordPress is licensed GPLv2, and I imagine its documentation is as well (derivative work?). So anyone should be able to download the entire thing for their own use. Perhaps it becomes a strain on their servers? If so, there are certainly better ways of blocking repeated requests from the same IP address. Anyone out there have any ideas? I’d love to hear them!









