Archiving websites

I'm creating a script to archive websites but I can't decide on something,
I want to recursively download every part of the site so not just a single page and whatever it needs to display correctly but some sites like Jow Forums use different domain names for different stuff like images but if I enable wget to span hosts i will end up trying to download the whole internet.
What should I do?

Attached: 82.jpg (500x375, 159K)

Other urls found in this thread:

github.com/JonasCz/save-for-offline
example.com/ass
example.com/tits
example.com/cock
gnu.org/software/wget/manual/html_node/Types-of-Files.html
twitter.com/SFWRedditVideos

wget -r -H -D4chan.org,4cdn.org,boards.Jow Forums.org

yeah that's what i use for my specific Jow Forums thread script but i want a generic one for other websites

wget -r example.com

you are welcome

Those google fonts, wordpress plugins, and gravatars are known as "the botnet". Be like RMS and only browse webpages emailed to you as a PDF.

Another option: github.com/JonasCz/save-for-offline

that doesn't even span hosts
so in Jow Forums for example you would download only a single html file with no images

You have to work it out for each site since every site is different. Or you could write your own logic to parse each html file, then feed them individually into wget using page-requisites.

the same problem exist, if i parse it myself how do i know which hosts contain parts of the website and which are completely unrelated

so, like that is how you hack Jow Forums, lets hack Jow Forums then, and make Jow Forums great again :)

Attached: bYOn88w.png (1200x1074, 973K)

that's completely irrelevant
>script to archive websites
>download every part of the site so not just a single page

You only need to spider the site for html, forget everything else, and generate a simple list of URLs. Then feed each URL into wget using page-requisites and span hosts, but NOT with a mirror / recursive option. Wget will then only download each individual URL and the css / images needed to display it. It will not recuse any other links.

that sounds perfect but will wget be able to connect all the downloads into one coherent site?
I use --convert-links but that won't work if i run wget multiple times

For example

wget -e robots=off --content-on-error --no-check-certificate --span-hosts --page-requisites --convert-links URLs.txt


Now you just need to fill URLs.txt with every URL of the site you want to save, eg

example.com/ass
example.com/tits
example.com/cock
etc


and wget will download everything needed to display each link. But honestly this is more work than just working out the site structure and using regex filters to restrict wget to only the links that match;

gnu.org/software/wget/manual/html_node/Types-of-Files.html

So for Jow Forums for example you could just put

--accept-regex="Jow Forums.org|4cdn.org"


or if you wanted all media from any site but nothing else;

--accept-regex="css|jpg|gif|png"


etc

Oh URLS.txt needs to be --input-file=URLs.txt to work, sorry

yeah but site1 downloaded from URLs.txt that has a link to site2 won't link to the downloaded one but the online instead right ?
is wget clever enough to fix it automatically with --input-file and --convert-links ?

I don't know anymore, I gave up archiving websites years ago. But from someone who spent years archiving using wget, here's my advice; you need to work out the site structure and use the regex filters. It's not hard, and is much more powerful than the other options. Also forget about converting the links, you should be using the warc option to store the data.

even if i store the warc data how can i browse the mirror if the hrefs and stuff point to the online site
that a lot of the help by the way

>you need to work out the site structure and use the regex filters
that means looking at the html data on each site?
i know it would be too much work and i will end up not archiving anything

have it prompt for each new domain encountered, or make a whitelist of common CDN or other service domains.

My peenus weenus ha ha ha!

ITT: archiving websites
Well my answer is: my peenus weenus hahaha!

Peenus weenus

It's very simple. Give me an example site and I can show you how I would do it.

well, a Jow Forums thread would be a great example since even a vim script to make the single line readable takes too much to complete but besides i don't see any problems with the automatic solution you said before
Also FUCK THE MODERN WEB

This thread, with both preview and full-sized images in same folder as html.

why would you want that specific file structure ?
(i'm not )

Here you go

wget --adjust-extension --no-check-certificate -U "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.17 Safari/537.36" -e robots=off --retry-connrefused -m -H -p -P "./output/" --accept-regex="\.css|\.jpg|\.png|\.gif|\.webm" "with both preview and full-sized images in same folder as html.

Add --no-directories

Based

Oh I forgot --convert-links if you want it browseable

>-H
>no -D
wouldn't that download the whole internet ?

What about JS dependent content? I have used PhatnomJS and some shitty qt web browsers for web scrapers on some tasks.

No because I specified regex filters only to download css and images

>--accept-regex="\.css|\.jpg|\.png|\.gif|\.webm"

It will only follow links that match those filters. It would download images and css from anywhere on the internet if Jow Forums linked to them, but Jow Forums doesn't.

You're fucked. Even httrack shits itself on js

yeah but that's a whitelist, an obscure site might use some file extension for something that i might want to archive
The first solution is perfect though isn't it?

>an obscure site might use some file extension for something that i might want to archive

Yes, so you adapt the regex for each site. There is no one size fits all.

>There is no one size fits all.
>The first solution is perfect though isn't it?
>i don't see any problems with the automatic solution you said before

As a purist I always favored regex but sure, use the other solution if you want. Just keep in mind there are many advanced cases where it will fail or fall into recursion loops.

>cases where it will fail
the only way i see it failing is if half the site's html is on one domain and the other on a different one
>fall into recursion loops
how?
you get the urls, that can't fall into an infinite loop and then you get all pages requisites which can't fall into one either

That works pretty good... =) It just needs sed to rewrite links but otherwise works without a hitch.

So whats the site you're trying to archive op?

>you get the urls

Easy for a small site, but very hard for a large or dynamically generated site. As you scale up you will have to embrace filters.

Should i maybe just not span hosts and then check if a site i downloaded is broken and add -D exceptions until it works?
i don't know too much html and i assume i'll need to parse many more languages to get such a thing going automatically correctly

Peenus Weenus :)

i don't have something specific in mind, if i did i would just adjust --domains=domain-list for it specifically

i don't have something specific in mind, if i did i would just adjust --domains=domain-list for it specifically
The idea is to have a script that i can pull up with a keybinding, paste the url and it would download whatever site i want at that time

what the fuck is that meme kid?

A dirty little secret is 95% of the time something quick and dirty is usually good enough, with maybe a handful of special case workarounds. For specific websites! The whole internet would demand doing things right. :/

yeah that's what i'll do
every archived site is going to have it's own download script for cron to run anyway so i can just adjust that
i didn't know about warc, i'm going to download that too
thanks a lot man!