wget notes

Index

Mirror site for local viewing

You can use wget to spider / crawl a site and take a mirror of it for local viewing:

wget -m -p http://example.com

The mirror is placed in a new directory with the same name as the domain (“url” in the above example).

-m
Mirror. This is equivalent to -r -N -l inf --no-remove-listing, which means that it is recursive, checks timestamps (won’t overwrite files unless they’ve changed), recurses to an infinite depth/level and won’t remove the temporary directory listings file (.listing).
-p
Download “page requisites” i.e. all files necessary for proper display.

I sometimes like to convert pages for local viewing, which changes links in the markup so that they point to the local pages:

wget -m -p -k -K -E http://example.com
-k
Convert links for local viewing.
-K
Backup original when converting.
-E
Add .html extension if not already present.

Interesting options

-np
No parent. Do not ascend to parent directory.
-w=[n]
Wait [n] seconds betwee requests.
–random-wait
Wait a random period of time between 0 and 2*[n] seconds. This is to prevent site analysing access log to block wget.
-R [filelist]
Reject (aka ignore) files in [filelist] of comma-separated filenames e.g. file1,file2. Wildcards can be used. Only the filename (e.g. http://one/two/filename) is matched.
-X [dirlist]
Reject (aka ignore) directories in [dirlist] of comma-separated directory names e.g. dir1,dir2. Wildcards can be used. Each directory specific in [dirlist] must be the full path, e.g. /en/content or */content.
–no-parent
Only download folder and sub content.
-e robots=off
Ignore robots.txt.

Mirror site with external dependencies

If you’re mirroring a site which has css/javascript/etc references to a different domain, you can use -p option to grab all files necessary to properly display the pages (e.g. css/javascript) along with the -D and -H options to mirror across the domains.

wget -m -p -H -D example.org,example.net example.com

The options used are as follows:

-m
Mirror.
-p
Download “page requisites” i.e. all files necessary for proper display.
-D [domains]
Set domains to be follows.
-H
Span across hosts when recursive (which mirroring is, because -m is equivalent to -r -N -l inf --no-remove-listing - see Mirror site for local viewing above).

NOTE that I’m unsure how -H and -D interact. It seemed that using ‘-D domain1,domain2’ didn’t actually result in resources being retrieved from those domains without the -H option being present, but whether adding the -H option then results in any domain being visited is unclear.

wget for cron job / routine tasks

The following will grab a page silently, so is useful for e.g. routine tasks that will be executed periodically by cron (Drupal uses a command in this form for its maintenance tasks):

wget -O - -q -t 1 http://example.com/whatever
-O -
Output results to stdout.
-q
Be quiet.
-t 1
Timeout after 1 second.

References

Last modified: 31/07/2015 Tags:

This website is a personal resource. Nothing here is guaranteed correct or complete, so use at your own risk and try not to delete the Internet. -Stephan

Site Info

Privacy policy

Go to top