Index
- Mirror site for local viewing
- Mirror site with external dependencies
- wget for cron job / routine tasks
- References
Mirror site for local viewing
You can use wget to spider / crawl a site and take a mirror of it for local viewing:
wget -m -p http://example.com
The mirror is placed in a new directory with the same name as the domain (“url” in the above example).
- -m
- Mirror. This is equivalent to
-r -N -l inf --no-remove-listing
, which means that it is recursive, checks timestamps (won’t overwrite files unless they’ve changed), recurses to an infinite depth/level and won’t remove the temporary directory listings file (.listing). - -p
- Download “page requisites” i.e. all files necessary for proper display.
I sometimes like to convert pages for local viewing, which changes links in the markup so that they point to the local pages:
wget -m -p -k -K -E http://example.com
- -k
- Convert links for local viewing.
- -K
- Backup original when converting.
- -E
- Add .html extension if not already present.
Interesting options
- -np
- No parent. Do not ascend to parent directory.
- -w=[n]
- Wait [n] seconds betwee requests.
- –random-wait
- Wait a random period of time between 0 and 2*[n] seconds. This is to prevent site analysing access log to block wget.
- -R [filelist]
- Reject (aka ignore) files in [filelist] of comma-separated filenames e.g. file1,file2. Wildcards can be used. Only the filename (e.g. http://one/two/filename) is matched.
- -X [dirlist]
- Reject (aka ignore) directories in [dirlist] of comma-separated directory names e.g. dir1,dir2. Wildcards can be used. Each directory specific in [dirlist] must be the full path, e.g. /en/content or */content.
- –no-parent
- Only download folder and sub content.
- -e robots=off
- Ignore robots.txt.
Mirror site with external dependencies
If you’re mirroring a site which has css/javascript/etc references to a different domain, you can use -p option to grab all files necessary to properly display the pages (e.g. css/javascript) along with the -D and -H options to mirror across the domains.
wget -m -p -H -D example.org,example.net example.com
The options used are as follows:
- -m
- Mirror.
- -p
- Download “page requisites” i.e. all files necessary for proper display.
- -D [domains]
- Set domains to be follows.
- -H
- Span across hosts when recursive (which mirroring is, because
-m
is equivalent to-r -N -l inf --no-remove-listing
- see Mirror site for local viewing above).
NOTE that I’m unsure how -H and -D interact. It seemed that using ‘-D domain1,domain2’ didn’t actually result in resources being retrieved from those domains without the -H option being present, but whether adding the -H option then results in any domain being visited is unclear.
wget for cron job / routine tasks
The following will grab a page silently, so is useful for e.g. routine tasks that will be executed periodically by cron (Drupal uses a command in this form for its maintenance tasks):
wget -O - -q -t 1 http://example.com/whatever
- -O -
- Output results to stdout.
- -q
- Be quiet.
- -t 1
- Timeout after 1 second.