Web image crawler?

I found the only website that hosts high-quality images of Gustave Dore's Divine Comedy gravures.

The thing is, there's a lot of them and you have to click several times to get to the actual image, so it's pretty slow to go through - and something like DownThemAll wouldn't work since they're not on a single page.

How would I go about automating this process? Downloading the entire website, maybe?

Would any of the programs listed here work?
medium.com/@octoparse/top-20-web-crawler-tools-to-scrape-the-websites-9088a4b6618d

Attached: 567208689[1].jpg (1015x1256, 396K)

Other urls found in this thread:

gravures.ru/photo/gjustav_dore/bozhestvennaja_komedija_ad/30
expositions.bnf.fr/orsay-gustavedore/index.htm
expositions.bnf.fr/orsay-gustavedore/albums/enfer/index.htm
gallica.bnf.fr/ark:/12148/bpt6k10448149/f167.item.zoom
gallica.bnf.fr/iiif/ark:/12148/bpt6k10448149/f167/full/full/0/native.jpg
github.com/Dorialexander/Pyllica
ugetdm.com/)
pastebin.com/myKhtHHi
api.bnf.fr/api-iiif-de-recuperation-des-images-de-gallica
pastebin.com/rujLZsCw
mega.nz/#F!3moTAQbZ!7uDeg92q1omIqwl9SmK9Xw
mega.nz/#F!jqo1zaaS!VAZRddSOKr-2CmyQjIX1bQ
gallica.bnf.fr/ark:/12148/bpt6k10448149/f25.item).
pastebin.com/mnPDch3d
pastebin.com/N9pJdXHE
twitter.com/SFWRedditVideos

Here's the website, by the way:
gravures.ru/photo/gjustav_dore/bozhestvennaja_komedija_ad/30

>.ru

I aint clicking that shit, cia nigger

Attached: 1518886256305.gif (390x158, 1.83M)

Then find me an american site that hosts that shit, Terry.

Did you check out wget? You can go into all the pages and check if it has images and go into those links. Add in some rules to avoid other links and it will go down the barrel searching for it.

Forgot to mention use lynx along with that. Should make your life much easier.

I'm a winfag and command prompt-illiterate... I'll still give it a try, though.

That's a nice task. I may give it a try in a few hours.

If you do, do you mind uploading the images somewhere afterwards? I'm not sure if I'll be able to figure it out.

Jow Forums should do :^)

>requires registration to view the full versions
Then upload all of those to the Internet Archive for posterity OP

Wouldn't it be easier to upload them all at once on MEGA or imgur or something?

You might also want to check the dedicated page about Gustave Doré at the National French Library

expositions.bnf.fr/orsay-gustavedore/index.htm

Hell specific :
expositions.bnf.fr/orsay-gustavedore/albums/enfer/index.htm

I have a leatherbound version of Inferno and it has these illustrations in it and I love them. Dore really captures the look so well.

Disregard the greentext; have a sample image for you all

Attached: 806014016.jpg (1015x1267, 377K)

Nice! Great detail, but they're cropped for some reason?

Attached: dor_032[1].jpg (1260x835, 434K)

Here's an image from the russian website for comparison.

Attached: 28470342[1].jpg (1291x1015, 382K)

Disregard that, I'm retarded - the cropped version is a preview. There's a zoom-in option to the side.

Click at the printer on the sidebar to retrieve the fullsize picture
You can even get more if you follow "l'image sur Gallica"

Wow, there's a preview of the entire book in batshit insane quality... but once you download the entire thing, it's in shit quality...?
And it only lets you download the high quality images in cropped segments? What the fuck?
gallica.bnf.fr/ark:/12148/bpt6k10448149/f167.item.zoom

Attached: L'Enfer_de_Dante_Alighieri_avec_[...]Dante_Alighieri_bpt6k10448149_167.jpg (892x444, 101K)

You can get the hd version by modifying the url

gallica.bnf.fr/ark:/12148/bpt6k10448149/f167.item.zoom

gallica.bnf.fr/iiif/ark:/12148/bpt6k10448149/f167/full/full/0/native.jpg

I think it does that because the images are in such high quality that it lazy loads it as a bunch of smaller images when you view it.

Nice! It works. Now if only Purgatorio and Paradiso were uploaded there as well...

Wait a minute, they are. Navigating a website in french is a bit difficult, admittedly.
If this thread is up by the time I'm done painstakingly downloading every image, I'll upload a pdf here and on archive.org so no one ever has to go through this shit again.

There are python scripts already made to do what you want to do automatically.
I'll help you a bit it's easier for me since its my native language

github.com/Dorialexander/Pyllica

I think pyllicalabsjpg might be useful to you it can download everypage you just have to provide the id (like /12148/btv1b86000454/) in the actionpyllicalabsjpg.py

Write a scraper in python using the standard scraping modules. Requests, BeautifulSoup, Selenium, maybe even PyAutoGui as a last resort if there's something very hard. There's always a way. But you should be able to get this done with requests ans BeautifulSoup alone. Don't forget to install the lxml module, which is the parser for BeautifulSoup.

you also have to modify the 3000 in pyllicalabsjpg.py to full

I wouldn't really know how to use those... as I said before, I'm a winfag and command prompt-illiterate.
I'm just going through the pages and copying the numbers of the pages with illustrations on them - then I'm gonna just modify all the URLs at once in notepad, so they look like the one in Kind of primitive, but it should work.
Small problem though - I'll have a long-ass list of JPGs at the end, but I'll still have to open and download all of them manually.

>I'll have a long-ass list of JPGs at the end, but I'll still have to open and download all of them manually.
Nevermind that, the GUI version of youtube-dl that I use can download jpegs - I'm all set.

AutoHotKey would be great for this. Alternatively Selenium

>Nevermind that, the GUI version of youtube-dl that I use can download jpegs - I'm all set.

Nevermind that again - since all the jpegs have the same name it doesn't work.
I'm trying chrome url downloader addon now. It works, but it messed up the order of the images...

I tried installing wget, but when I try using it keeps asking for LIBEAY32.dll even though it's in the system folder with all the rest of the necessary fucking files.

how much galleries you want to download?

Just write a scraper in python.

if there aren't many gallica galleries i could try and get file list and OP could download them via wget
wget -nd -i FILENAME
-nd allows to download multiple files with same name

Alright, the chrome addon sucks, I'm trying some open source download manager called uget now (ugetdm.com/) - it works.
The only problem is it fucks up the extensions and names files like:
native.jpg2
native.jpg3
native.jpg4
instead of a normal god damn filename.
I don't know how to write anything, I'm not a programmer.

Neither am I, I'm a musician. I just read a book

Fuck it, here's the list of inferno illustration jpegs from the french site. I can't figure out a way to download them in the proper order, so if anyone wouldn't mind sending me zip file of them that would be great. We could do this for the other two books as well after I sort through them and upload another pastebin.
pastebin.com/myKhtHHi

send links to other 2 books

A bit of a late question, but how did you know how exactly to modify the url?
I tried inspecting the page's code but couldn't find a .jpg or anything.

The website have a french documentation you can find it here :

api.bnf.fr/api-iiif-de-recuperation-des-images-de-gallica

OP, status report. Where are you now? Do you still need help?

nvm, see it:
Let's see what could be done.

downloading
at page 25

Purgatorio:
pastebin.com/rujLZsCw

uploading inferno to mega

> at page 25
So you're doing it?
Anyway, here's the bash one-liner:
while read line ; do name=$(echo $line | tr '/' ' ' | awk '{print $7}') ; echo $line ; echo $name ; wget -O $name.jpg $line ; sleep 5 ; done < jpegs.list

jpegs.list is converted pastebin list, which is unusable due to crlf, had to convert it with
tr -d '\015' < rujLZsCw > jpegs.list
Let me know if you want any help.

not OP but i downloaded that pastebin and uploading on mega

I was just about to post "I can't find Paradiso on the website wtf" but I realized it was included in the second book, so those two pastebins should be everything.
One day I'll learn how to do shit like this , then upload all of Dore's stuff from gallica to archive.org, but not tonight. I have so much work I'm neglecting right now.

So in 401 and onwards is Paradiso.

mega.nz/#F!3moTAQbZ!7uDeg92q1omIqwl9SmK9Xw

uploading second one

mega.nz/#F!jqo1zaaS!VAZRddSOKr-2CmyQjIX1bQ

second one
split paradiso yourself

Awesome! Thanks, I wouldn't have known about that website without you (assuming you're the french guy).
The order on inferno is messed up for some reason though - just like when I tried to download it. Guess it wasn't the addon's fault.
I don't know how that's possible - for example, "f83.jpg" in the MEGA is "f25" on the website (gallica.bnf.fr/ark:/12148/bpt6k10448149/f25.item).
I'll try to sort them out and upload a .pdf, though.

sorry
file order is not correct
didn't use that bash one liner

That might mean some of the illustrations could missing... I didn't include the cover for example, but it's there in the MEGA (f491.jpg). Maybe it replaced one of the illustrations?

Nevermind, I'd included two useless lines (000.jpg) in the pastebin. The number of files is correct.

pastebin.com/mnPDch3d

the problem is i downloaded files as native.jpg ...jpg1 ...jpg2
but my rename solution ordered it as jpg jpg1 jpg10 jpg11

pastebin.com/N9pJdXHE

basically first 10-11 pages are not in correct order
on both books

Just finished organizing Inferno. Will upload .pdf after I'm done reinstalling Acrobat.

Just saw Purgatorio and Paradiso are in the wrong order as well. Fuck me...

you have to put
f105.jpg
f197.jpg
f305.jpg
f463.jpg
f595.jpg
f607.jpg
f625.jpg
f633.jpg
between f23.jpg and f35.jpg

How pricey was it? Is it in Italian?

Alright, I probably won't upload the pdfs tonight - I'll need to crop and color correct the pages.
Look around later on the /ic/ archive later for posts with Dante/Gustave/whatever. It should be on archive.org as well.
Cheers, tonight was the beginning of the first easy-to-find, high quality collection of these gravures.