Archiving websites

Question

Archiving websites

Jace Myers

I'm creating a script to archive websites but I can't decide on something,
I want to recursively download every part of the site so not just a single page and whatever it needs to display correctly but some sites like Jow Forums use different domain names for different stuff like images but if I enable wget to span hosts i will end up trying to download the whole internet.
What should I do?

Attached: 82.jpg (500x375, 159K)

September 15, 2019 - 16:20

Other urls found in this thread:

github.com/JonasCz/save-for-offline
example.com/ass
example.com/tits
example.com/cock
gnu.org/software/wget/manual/html_node/Types-of-Files.html
twitter.com/SFWRedditVideos

Nathaniel Nelson

wget -r -H -D4chan.org,4cdn.org,boards.Jow Forums.org

September 15, 2019 - 16:26

Asher Torres

yeah that's what i use for my specific Jow Forums thread script but i want a generic one for other websites

September 15, 2019 - 16:27

Wyatt Young

wget -r example.com

you are welcome

September 15, 2019 - 16:27

Kevin Young

Those google fonts, wordpress plugins, and gravatars are known as "the botnet". Be like RMS and only browse webpages emailed to you as a PDF.

Another option: github.com/JonasCz/save-for-offline

September 15, 2019 - 16:28

Jason Turner

that doesn't even span hosts
so in Jow Forums for example you would download only a single html file with no images

September 15, 2019 - 16:29

Eli Scott

You have to work it out for each site since every site is different. Or you could write your own logic to parse each html file, then feed them individually into wget using page-requisites.

September 15, 2019 - 16:29

Logan Williams

the same problem exist, if i parse it myself how do i know which hosts contain parts of the website and which are completely unrelated

September 15, 2019 - 16:30

Aaron Perez

so, like that is how you hack Jow Forums, lets hack Jow Forums then, and make Jow Forums great again :)

Attached: bYOn88w.png (1200x1074, 973K)

September 15, 2019 - 16:31

Mason James

that's completely irrelevant
>script to archive websites
>download every part of the site so not just a single page

September 15, 2019 - 16:32

Connor Wood

You only need to spider the site for html, forget everything else, and generate a simple list of URLs. Then feed each URL into wget using page-requisites and span hosts, but NOT with a mirror / recursive option. Wget will then only download each individual URL and the css / images needed to display it. It will not recuse any other links.

September 15, 2019 - 16:34

Kevin Kelly

that sounds perfect but will wget be able to connect all the downloads into one coherent site?
I use --convert-links but that won't work if i run wget multiple times

September 15, 2019 - 16:37

Luis Diaz

For example

wget -e robots=off --content-on-error --no-check-certificate --span-hosts --page-requisites --convert-links URLs.txt

Now you just need to fill URLs.txt with every URL of the site you want to save, eg

example.com/ass
example.com/tits
example.com/cock
etc

and wget will download everything needed to display each link. But honestly this is more work than just working out the site structure and using regex filters to restrict wget to only the links that match;

gnu.org/software/wget/manual/html_node/Types-of-Files.html

September 15, 2019 - 16:39

Kevin Gray

So for Jow Forums for example you could just put

--accept-regex="Jow Forums.org|4cdn.org"

or if you wanted all media from any site but nothing else;

--accept-regex="css|jpg|gif|png"

etc

September 15, 2019 - 16:43

Josiah Watson

Oh URLS.txt needs to be --input-file=URLs.txt to work, sorry

September 15, 2019 - 16:44

Josiah Hernandez

yeah but site1 downloaded from URLs.txt that has a link to site2 won't link to the downloaded one but the online instead right ?
is wget clever enough to fix it automatically with --input-file and --convert-links ?

September 15, 2019 - 16:47

Kayden Anderson

I don't know anymore, I gave up archiving websites years ago. But from someone who spent years archiving using wget, here's my advice; you need to work out the site structure and use the regex filters. It's not hard, and is much more powerful than the other options. Also forget about converting the links, you should be using the warc option to store the data.

September 15, 2019 - 16:51

Landon Butler

even if i store the warc data how can i browse the mirror if the hrefs and stuff point to the online site
that a lot of the help by the way

September 15, 2019 - 16:57

Joseph Cooper

>you need to work out the site structure and use the regex filters
that means looking at the html data on each site?
i know it would be too much work and i will end up not archiving anything

September 15, 2019 - 17:02

Xavier Cruz

have it prompt for each new domain encountered, or make a whitelist of common CDN or other service domains.

September 15, 2019 - 17:05

Jackson Lopez

My peenus weenus ha ha ha!

ITT: archiving websites
Well my answer is: my peenus weenus hahaha!

Peenus weenus

September 15, 2019 - 17:08

Caleb Roberts

It's very simple. Give me an example site and I can show you how I would do it.

September 15, 2019 - 17:10

Angel Young

well, a Jow Forums thread would be a great example since even a vim script to make the single line readable takes too much to complete but besides i don't see any problems with the automatic solution you said before
Also FUCK THE MODERN WEB

September 15, 2019 - 17:15

Zachary Diaz

This thread, with both preview and full-sized images in same folder as html.

September 15, 2019 - 17:16

Parker Price

why would you want that specific file structure ?
(i'm not )

September 15, 2019 - 17:21

Henry Morales

Here you go

wget --adjust-extension --no-check-certificate -U "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.17 Safari/537.36" -e robots=off --retry-connrefused -m -H -p -P "./output/" --accept-regex="\.css|\.jpg|\.png|\.gif|\.webm" "with both preview and full-sized images in same folder as html.

Add --no-directories

September 15, 2019 - 17:23

Zachary Clark

Based

September 15, 2019 - 17:24

Mason Wilson

Oh I forgot --convert-links if you want it browseable

September 15, 2019 - 17:25

Cooper Campbell

>-H
>no -D
wouldn't that download the whole internet ?

September 15, 2019 - 17:27

Nathaniel Russell

What about JS dependent content? I have used PhatnomJS and some shitty qt web browsers for web scrapers on some tasks.

September 15, 2019 - 17:29

Ethan Myers

No because I specified regex filters only to download css and images

>--accept-regex="\.css|\.jpg|\.png|\.gif|\.webm"

It will only follow links that match those filters. It would download images and css from anywhere on the internet if Jow Forums linked to them, but Jow Forums doesn't.

September 15, 2019 - 17:29

Austin Lee

You're fucked. Even httrack shits itself on js

September 15, 2019 - 17:31

Daniel Jenkins

yeah but that's a whitelist, an obscure site might use some file extension for something that i might want to archive
The first solution is perfect though isn't it?

September 15, 2019 - 17:35

John Watson

>an obscure site might use some file extension for something that i might want to archive

Yes, so you adapt the regex for each site. There is no one size fits all.

September 15, 2019 - 17:38

Isaiah Stewart

>There is no one size fits all.
>The first solution is perfect though isn't it?
>i don't see any problems with the automatic solution you said before

September 15, 2019 - 17:39

Aaron Murphy

As a purist I always favored regex but sure, use the other solution if you want. Just keep in mind there are many advanced cases where it will fail or fall into recursion loops.

September 15, 2019 - 17:43

Matthew Nguyen

>cases where it will fail
the only way i see it failing is if half the site's html is on one domain and the other on a different one
>fall into recursion loops
how?
you get the urls, that can't fall into an infinite loop and then you get all pages requisites which can't fall into one either

September 15, 2019 - 17:47

Jackson Wilson

That works pretty good... =) It just needs sed to rewrite links but otherwise works without a hitch.

September 15, 2019 - 17:50

Brayden Hughes

So whats the site you're trying to archive op?

September 15, 2019 - 17:53

David Davis

>you get the urls

Easy for a small site, but very hard for a large or dynamically generated site. As you scale up you will have to embrace filters.

September 15, 2019 - 17:54

Christopher Wilson

Should i maybe just not span hosts and then check if a site i downloaded is broken and add -D exceptions until it works?
i don't know too much html and i assume i'll need to parse many more languages to get such a thing going automatically correctly

September 15, 2019 - 17:58

Easton Long

Peenus Weenus :)

September 15, 2019 - 17:59

Mason Diaz

i don't have something specific in mind, if i did i would just adjust --domains=domain-list for it specifically

September 15, 2019 - 18:00

Angel Bell

i don't have something specific in mind, if i did i would just adjust --domains=domain-list for it specifically
The idea is to have a script that i can pull up with a keybinding, paste the url and it would download whatever site i want at that time

September 15, 2019 - 18:01

Daniel Green

what the fuck is that meme kid?

September 15, 2019 - 18:02

Bentley Morales

A dirty little secret is 95% of the time something quick and dirty is usually good enough, with maybe a handful of special case workarounds. For specific websites! The whole internet would demand doing things right. :/

September 15, 2019 - 18:04

Ethan Russell

yeah that's what i'll do
every archived site is going to have it's own download script for cron to run anyway so i can just adjust that
i didn't know about warc, i'm going to download that too
thanks a lot man!

September 15, 2019 - 18:08

1 2 ... 5 Next

Archiving websites

Last threads