How can I scrape all images from a Jow Forums thread Jow Forumsoys?

How can I scrape all images from a Jow Forums thread Jow Forumsoys?

Attached: beatsMe.jpg (300x200, 9K)

Other urls found in this thread:

github.com/Jow
a.4cdn.org/g/thread/66419517.json'
i.4cdn.org/g/'
sourceforge.net/projects/fourchan-dl/
twitter.com/NSFWRedditGif

I miss they days when at least two thirds of Jow Forums was able to do things like this in their sleep, and the other third was smart enough to keep their mouths shut and learn by lurking.

Attached: don'taskstupidquestions1454019846317.png (1280x720, 1.31M)

We're not all gone, user.

Attached: screenshot.png (1920x1080, 355K)

Python + beautiful soup + for loop.

t.StackOverflow fail.

This guy gets it.

i wish to learn. (not FAAGGOT OP)

You guys are so mean. I just don't want to configure a wget and figure it out b.c I'm lazy :(

Just google "Jow Forums thread scraper"...

If you saw the source you wouldn't think that. I know almost nothing about Python or good programming practices. It's a ~120 line script using Banner style tabs dependent on a deprecated urllib.request.urlretrieve and I never did get the chance to see if my hack job OS identification worked. Other anons have posted single line shell scripts that did the same thing and I wasn't happy until I built one on my own.

While I'm at it I'll mention the duplicate image thing only works on a per thread basis. If you have two identical images in two different threads it won't ID the duplicate because their on-server filenames are different. Image comparison simply isn't feasible when your reaction pic folder have 10000+ files.

comparing dimensions then size to the bit?
or too many false positives?

It's against Global Rule 14, so I wouldn't risk getting banned.

How do you link to rules, again? >>>rules/global/14

Let's try that again:
>>>/rules/global/14
>>>/global/rules/14

Attached: 72ece854-a917-4014-8f20-30f2448c3944.jpg (689x122, 18K)

Too many false positives. Gotta get into image comparison.
I have it worked out to a logical process but getting it all written would take forever. Each step requires me searching "how to do XXX in python3" into a search engine and reading into it. I manage one line every 10 minutes if I'm lucky. But if you were interested here's what I've got so far.

>Pull external image, we'll call it foo
>Get dimensions and filesize of foo
>Narrow list of all images in shitpost folder by size & dimensions to match foo, we'll call each of these possible matches bar
>We have narrowed the possibilities from 22,000 to 250
>For each bar, run a very small comparison. Compare a pixel area in the center of each image that's ~10% of the original width and height and eliminate any bar that doesn't match.
>We're down to 15 bars
>From here ramp up the size of comparison by 25% until you go through all the bars
>you'll either have a perfect match in which case you disregard foo, or nothing and you keep foo.

It can be done but I consider that a huge sum of computing effort when I can just pull something from a local repo to check for duplicates once a month.

Ah I see an interesting way to go about it but probably not worth the effort. The proof of concept is more than enough for me

based floens

Attached: Screenshot_20180620-033629.png (720x1280, 224K)

once narrowed by filesize, why not just md5?

store the md5 of each image saved in a sqlite database with that as the index. when downloading check if that base64-encoded md5 that Jow Forums includes with each image already exists in the database, if so it's a duplicate so skip it

Exact opposite reason of . Any jpg compression or stripped data or altered exif an image receives after some idiot posts it to tumblr or reddit or wherever and it finds its way back here will give it a different md5. That and I think once I got into megabyte sized images hash sums in general got really time consuming.

Make your own script. It's an easy and fun programming exercise. I'm about to add MD5 hash comparisons to mine.

No problem on the GNU operating system.
wget -nv -e robots=off -ER html,s.jpg -rHD i.4cdn.org

We're having fun and autism here, don't ruin it.

Use a search engine which respects your privacy.

just use the json api dude

so you want compare files based on visual similarity but still compare filesize as first step? That's retarded mix because same picture saved as jpeg twice with compression of 95% and 96% will be basically identical while having totally different size.
also:
>Compare a pixel area in the center of each image
this is also bullshit. That means you'll have the same green square for 99% of different rare pepes for example, or white square on shitload of BW images.

magic

>wtf
In what way is that different than casting a spell?

>That's retarded mix because same picture saved as jpeg twice with compression of 95% and 96% will be basically identical while having totally different size.
It's a retarded mix that can be fine tuned user. That's why we have these discussions.

>That means you'll have the same green square for 99% of different rare pepes for example, or white square on shitload of BW images.
You missed the part where I said comparing a square or rectangle that's equal to about 10% of the image's dimensions for the first run. If you do the math you'll realize that this cuts 99% of the image comparison out. a 100x100 pixel image would have a 10x10 square checked. This means 100 pixels in a 10000 pixel image. You'd have to do 100 of these to match the time it takes to compare the entirety of a single image, and in that time it can get rid of every image that isn't pepe before doing a full comparison on all the rest. Call it bullshit if you like but it will in all likelihood finish in the time it takes a full comparison to go through 2, maybe 3 images.

findimagedupes on GNU/Linux finds 'visually' similiar images. Saved a meme in two different dimensions? findimagedupes prints them for you. Saved a low quality jpg and a high png? - You guess right.

Attached: 1406361538901.jpg (900x675, 298K)

install gentoo

Damn this is pretty small
This is my attempt, hope you have curl installed
#!/bin/bash
for x in $(curl $1 | grep -o i.4cdn.org/[^\"]* | uniq -d)
do
wget $x
done

retard here, can you explain what the -ER and -rHD flags do? I can't find them in the wget online manual

You can use wget to behave like curl (-1 dependency in your script) like this:
wget -qO-

I'd just like to interject for moment. What you're refering to as GNU, is in fact, GNU/Linux, or as I've recently taken to calling it, GNU plus Linux. GNU is not an operating system unto itself, but rather another free component of a fully functioning Linux system made useful by the GNU corelibs, shell utilities and vital system components comprising a full OS as defined by POSIX. Many computer users run a modified version of the Linux system every day, without realizing it. Through a peculiar turn of events, the version of Linux which is widely used today is often called GNU, and many of its users are not aware that it is basically the Linux system, developed by the GNU Project. There really is a GNU, and these people are using it, but it is just a part of the system they use. GNU is nothing but userland: the programs in the system that allocates the programs that you run to the machine's resources. The userland is an essential part of an operating system, but useless without a kernel; it can only function in the context of a complete operating system. GNU is normally used in combination with the Linux operating system: the whole system is basically Linux with GNU added, or GNU/Linux. All the so-called Linux distributions are really distributions of GNU/Linux!

That are combined options (-rHD as shortcut for -r -H -D).

Fucking amazing breh
Thanks for the tip

oh, you literally compare pixels one by one...
kek, I quit

1) there is an api, and a doc to it: github.com/Jow Forums/4chan-API

basically you can download a structured file (json), which gives you basically a list of posts

2) filter out the posts you don't need (if you look at the docs, every image post has a 'tim' field)

3) reconstruct the image link (look at the doc how an image link looks), and then

4) download and save all the files

example:
open up your fav repls, I use python 2.7 for ex because it happens to be installed

import urllib, json
url = 'a.4cdn.org/g/thread/66419517.json'
response = urllib.urlopen(url)
data = json.loads(response.read())

# filter out the posts that actually have images
imposts = filter(lambda e: 'tim' in e, data["posts"])

# get a list of the images
ims = map(lambda e: 'i.4cdn.org/g/' + str(e['tim']) + e['ext'], imposts)

# make a function that saves the pix
def imsav(url, name):
r = urllib.urlopen(url)
o = open(name, 'wb')
o.write(r.read())
o.close()

##save all the images
map(lambda i: imsav(*i), ims)

fourchan-dl still works and makes it easy to review the images you downloaded, you just need to make small tweaks to compile it with Qt 5 on Linux: sourceforge.net/projects/fourchan-dl/

My only gripe is that it doesn't seem to be maintained and it can't readily download images from other chans

if you look at the API there is a md5 hash already provided for you

I'm kinda reluctant to learn Python because i come from an embedded background where C rules all, and I hear Python is slow
desu I still want to try it out

Attached: 3fb85ce1580af24d83f6bde17301e0fe85ed913c7c0d704e81e9a0116cb82e49.png (1063x1063, 204K)

just use c then, it's the same shit, you have fwriter, some urllib and some json parser too, iirc.

I just used python because it happened to be installed at the moment.

Why doesn't Jow Forums use this? This would easily defeat my shitty bash script for spamming images.

Attached: 00-00.jpg (320x240, 32K)

Are you sure you're scaling to a point where the performance difference will be noticeable? Downloading a bunch of files is definitely not CPU-bound, and Python libraries like aiohttp make it easy to parallelize the actual "download the images" part within a single thread

Because the spam filter is a joke right now.

Attached: 00-01.jpg (320x240, 32K)

who's paying for the processing power?

however, they already use imagedna or someshit, so I suppose it would be trivial. maybe they don't have a license for it beyond cp.