I'm creating a script to archive websites but I can't decide on something,
I want to recursively download every part of the site so not just a single page and whatever it needs to display correctly but some sites like Jow Forums use different domain names for different stuff like images but if I enable wget to span hosts i will end up trying to download the whole internet.
What should I do?
Archiving websites
wget -r -H -D4chan.org,4cdn.org,boards.Jow Forums.org
yeah that's what i use for my specific Jow Forums thread script but i want a generic one for other websites
wget -r example.com
you are welcome
Those google fonts, wordpress plugins, and gravatars are known as "the botnet". Be like RMS and only browse webpages emailed to you as a PDF.
Another option: github.com
that doesn't even span hosts
so in Jow Forums for example you would download only a single html file with no images
You have to work it out for each site since every site is different. Or you could write your own logic to parse each html file, then feed them individually into wget using page-requisites.
the same problem exist, if i parse it myself how do i know which hosts contain parts of the website and which are completely unrelated
so, like that is how you hack Jow Forums, lets hack Jow Forums then, and make Jow Forums great again :)
that's completely irrelevant
>script to archive websites
>download every part of the site so not just a single page