How can I scrape all images from a Jow Forums thread Jow Forumsoys?

Question

How can I scrape all images from a Jow Forums thread Jow Forumsoys?

Tyler Wilson

Attached: beatsMe.jpg (300x200, 9K)

June 19, 2018 - 23:58

Other urls found in this thread:

github.com/Jow
a.4cdn.org/g/thread/66419517.json'
i.4cdn.org/g/'
sourceforge.net/projects/fourchan-dl/
twitter.com/NSFWRedditGif

Michael Ward

I miss they days when at least two thirds of Jow Forums was able to do things like this in their sleep, and the other third was smart enough to keep their mouths shut and learn by lurking.

Attached: don'taskstupidquestions1454019846317.png (1280x720, 1.31M)

June 20, 2018 - 00:04

Thomas Bailey

We're not all gone, user.

Attached: screenshot.png (1920x1080, 355K)

June 20, 2018 - 00:06

Ian Thomas

Python + beautiful soup + for loop.

t.StackOverflow fail.

June 20, 2018 - 00:06

Gavin Nelson

This guy gets it.

June 20, 2018 - 00:07

Brayden Baker

i wish to learn. (not FAAGGOT OP)

June 20, 2018 - 00:10

Gabriel Walker

You guys are so mean. I just don't want to configure a wget and figure it out b.c I'm lazy :(

June 20, 2018 - 00:12

Colton Stewart

Just google "Jow Forums thread scraper"...

June 20, 2018 - 00:14

Grayson Reyes

If you saw the source you wouldn't think that. I know almost nothing about Python or good programming practices. It's a ~120 line script using Banner style tabs dependent on a deprecated urllib.request.urlretrieve and I never did get the chance to see if my hack job OS identification worked. Other anons have posted single line shell scripts that did the same thing and I wasn't happy until I built one on my own.

June 20, 2018 - 00:14

Adam Williams

While I'm at it I'll mention the duplicate image thing only works on a per thread basis. If you have two identical images in two different threads it won't ID the duplicate because their on-server filenames are different. Image comparison simply isn't feasible when your reaction pic folder have 10000+ files.

June 20, 2018 - 00:19

Jeremiah Thomas

comparing dimensions then size to the bit?
or too many false positives?

June 20, 2018 - 00:24

Hunter Perez

It's against Global Rule 14, so I wouldn't risk getting banned.

How do you link to rules, again? >>>rules/global/14

June 20, 2018 - 00:26

Landon Collins

Let's try that again:
>>>/rules/global/14
>>>/global/rules/14

June 20, 2018 - 00:27

Joseph Stewart

Attached: 72ece854-a917-4014-8f20-30f2448c3944.jpg (689x122, 18K)

June 20, 2018 - 00:31

Hunter Edwards

Too many false positives. Gotta get into image comparison.
I have it worked out to a logical process but getting it all written would take forever. Each step requires me searching "how to do XXX in python3" into a search engine and reading into it. I manage one line every 10 minutes if I'm lucky. But if you were interested here's what I've got so far.

>Pull external image, we'll call it foo
>Get dimensions and filesize of foo
>Narrow list of all images in shitpost folder by size & dimensions to match foo, we'll call each of these possible matches bar
>We have narrowed the possibilities from 22,000 to 250
>For each bar, run a very small comparison. Compare a pixel area in the center of each image that's ~10% of the original width and height and eliminate any bar that doesn't match.
>We're down to 15 bars
>From here ramp up the size of comparison by 25% until you go through all the bars
>you'll either have a perfect match in which case you disregard foo, or nothing and you keep foo.

It can be done but I consider that a huge sum of computing effort when I can just pull something from a local repo to check for duplicates once a month.

June 20, 2018 - 00:32

Robert Richardson

Ah I see an interesting way to go about it but probably not worth the effort. The proof of concept is more than enough for me

June 20, 2018 - 00:35

Adam Evans

based floens

Attached: Screenshot_20180620-033629.png (720x1280, 224K)

June 20, 2018 - 00:36

Ayden Perry

once narrowed by filesize, why not just md5?

June 20, 2018 - 00:37

Colton Jones

store the md5 of each image saved in a sqlite database with that as the index. when downloading check if that base64-encoded md5 that Jow Forums includes with each image already exists in the database, if so it's a duplicate so skip it

June 20, 2018 - 00:40

Jeremiah Bennett

Exact opposite reason of . Any jpg compression or stripped data or altered exif an image receives after some idiot posts it to tumblr or reddit or wherever and it finds its way back here will give it a different md5. That and I think once I got into megabyte sized images hash sums in general got really time consuming.

June 20, 2018 - 00:41

Logan Murphy

Make your own script. It's an easy and fun programming exercise. I'm about to add MD5 hash comparisons to mine.

June 20, 2018 - 00:49

James Russell

No problem on the GNU operating system.
wget -nv -e robots=off -ER html,s.jpg -rHD i.4cdn.org

June 20, 2018 - 00:51

Robert Evans

We're having fun and autism here, don't ruin it.

June 20, 2018 - 00:53

Julian Kelly

Use a search engine which respects your privacy.

June 20, 2018 - 00:53

Colton Bennett

just use the json api dude

June 20, 2018 - 00:54

Adam Garcia

so you want compare files based on visual similarity but still compare filesize as first step? That's retarded mix because same picture saved as jpeg twice with compression of 95% and 96% will be basically identical while having totally different size.
also:
>Compare a pixel area in the center of each image
this is also bullshit. That means you'll have the same green square for 99% of different rare pepes for example, or white square on shitload of BW images.

June 20, 2018 - 00:57

Logan Martinez

magic

June 20, 2018 - 00:57

Aaron Jenkins

>wtf
In what way is that different than casting a spell?

June 20, 2018 - 00:58

Lincoln Perez

>That's retarded mix because same picture saved as jpeg twice with compression of 95% and 96% will be basically identical while having totally different size.
It's a retarded mix that can be fine tuned user. That's why we have these discussions.

>That means you'll have the same green square for 99% of different rare pepes for example, or white square on shitload of BW images.
You missed the part where I said comparing a square or rectangle that's equal to about 10% of the image's dimensions for the first run. If you do the math you'll realize that this cuts 99% of the image comparison out. a 100x100 pixel image would have a 10x10 square checked. This means 100 pixels in a 10000 pixel image. You'd have to do 100 of these to match the time it takes to compare the entirety of a single image, and in that time it can get rid of every image that isn't pepe before doing a full comparison on all the rest. Call it bullshit if you like but it will in all likelihood finish in the time it takes a full comparison to go through 2, maybe 3 images.

June 20, 2018 - 01:08

Hunter James

findimagedupes on GNU/Linux finds 'visually' similiar images. Saved a meme in two different dimensions? findimagedupes prints them for you. Saved a low quality jpg and a high png? - You guess right.

June 20, 2018 - 01:17

Eli Richardson

Attached: 1406361538901.jpg (900x675, 298K)

June 20, 2018 - 01:18

Ayden King

install gentoo

June 20, 2018 - 01:23

Jackson Ross

Damn this is pretty small
This is my attempt, hope you have curl installed
#!/bin/bash
for x in $(curl $1 | grep -o i.4cdn.org/[^\"]* | uniq -d)
do
wget $x
done

June 20, 2018 - 01:23

Chase Gutierrez

retard here, can you explain what the -ER and -rHD flags do? I can't find them in the wget online manual

June 20, 2018 - 01:24

Nathan Richardson

You can use wget to behave like curl (-1 dependency in your script) like this:
wget -qO-

June 20, 2018 - 01:26

Ryan Ramirez

I'd just like to interject for moment. What you're refering to as GNU, is in fact, GNU/Linux, or as I've recently taken to calling it, GNU plus Linux. GNU is not an operating system unto itself, but rather another free component of a fully functioning Linux system made useful by the GNU corelibs, shell utilities and vital system components comprising a full OS as defined by POSIX. Many computer users run a modified version of the Linux system every day, without realizing it. Through a peculiar turn of events, the version of Linux which is widely used today is often called GNU, and many of its users are not aware that it is basically the Linux system, developed by the GNU Project. There really is a GNU, and these people are using it, but it is just a part of the system they use. GNU is nothing but userland: the programs in the system that allocates the programs that you run to the machine's resources. The userland is an essential part of an operating system, but useless without a kernel; it can only function in the context of a complete operating system. GNU is normally used in combination with the Linux operating system: the whole system is basically Linux with GNU added, or GNU/Linux. All the so-called Linux distributions are really distributions of GNU/Linux!

June 20, 2018 - 01:27

Oliver Sanchez

That are combined options (-rHD as shortcut for -r -H -D).

June 20, 2018 - 01:28

Bentley Nelson

Fucking amazing breh
Thanks for the tip

June 20, 2018 - 01:28

Mason Martinez

oh, you literally compare pixels one by one...
kek, I quit

June 20, 2018 - 01:30

Carter Miller

1) there is an api, and a doc to it: github.com/Jow Forums/4chan-API

basically you can download a structured file (json), which gives you basically a list of posts

2) filter out the posts you don't need (if you look at the docs, every image post has a 'tim' field)

3) reconstruct the image link (look at the doc how an image link looks), and then

4) download and save all the files

example:
open up your fav repls, I use python 2.7 for ex because it happens to be installed

import urllib, json
url = 'a.4cdn.org/g/thread/66419517.json'
response = urllib.urlopen(url)
data = json.loads(response.read())

# filter out the posts that actually have images
imposts = filter(lambda e: 'tim' in e, data["posts"])

# get a list of the images
ims = map(lambda e: 'i.4cdn.org/g/' + str(e['tim']) + e['ext'], imposts)

# make a function that saves the pix
def imsav(url, name):
r = urllib.urlopen(url)
o = open(name, 'wb')
o.write(r.read())
o.close()

##save all the images
map(lambda i: imsav(*i), ims)

June 20, 2018 - 01:40

Charles Taylor

fourchan-dl still works and makes it easy to review the images you downloaded, you just need to make small tweaks to compile it with Qt 5 on Linux: sourceforge.net/projects/fourchan-dl/

My only gripe is that it doesn't seem to be maintained and it can't readily download images from other chans

June 20, 2018 - 01:44

Jaxon Cox

if you look at the API there is a md5 hash already provided for you

June 20, 2018 - 01:44

Ryan Nelson

I'm kinda reluctant to learn Python because i come from an embedded background where C rules all, and I hear Python is slow
desu I still want to try it out

Attached: 3fb85ce1580af24d83f6bde17301e0fe85ed913c7c0d704e81e9a0116cb82e49.png (1063x1063, 204K)

June 20, 2018 - 01:54

Oliver Williams

just use c then, it's the same shit, you have fwriter, some urllib and some json parser too, iirc.

I just used python because it happened to be installed at the moment.

June 20, 2018 - 01:59

Kayden Cruz

Why doesn't Jow Forums use this? This would easily defeat my shitty bash script for spamming images.

Attached: 00-00.jpg (320x240, 32K)

June 20, 2018 - 02:03

Leo Myers

Are you sure you're scaling to a point where the performance difference will be noticeable? Downloading a bunch of files is definitely not CPU-bound, and Python libraries like aiohttp make it easy to parallelize the actual "download the images" part within a single thread

June 20, 2018 - 02:04

Jeremiah Diaz

Because the spam filter is a joke right now.

Attached: 00-01.jpg (320x240, 32K)

June 20, 2018 - 02:04

Nicholas Morales

who's paying for the processing power?

June 20, 2018 - 02:20

Christian Green

however, they already use imagedna or someshit, so I suppose it would be trivial. maybe they don't have a license for it beyond cp.

June 20, 2018 - 02:21

1 2 ... 5 Next

How can I scrape all images from a Jow Forums thread Jow Forumsoys?

Last threads