You can use wget to generate a list of the URLs on a website.

Spider, writing URLs to urls.txt, filtering out common media files (css, js, etc..):

wget --spider -r 2>&1 | grep '^--' | awk '{ print $3 }' | grep -v '\.\(css\|js\|png\|gif\|jpg\|JPG\)$' > urls.txt

Note that this gives a list that duplicates URLs.

If you mirror instead of spider you seem to get a more comprehensive list without duplicates:

wget -m 2>&1 | grep '^--' | awk '{ print $3 }' | grep -v '\.\(css\|js\|png\|gif\|jpg\|JPG\)$' > urls.txt

This will download all pages of the site into a directory with the same name as the domain.

