HTML Scraping

Question

HTML Scraping

Dylan Anderson

Take a look at this webpage. Should be easy enough to scrape the prices, right?
goatbots.com/prices-boosters
Wrong. After loading, the page makes an AJAX request that returns a giant image containing the prices (pic related) and style information that specifies the proper offset for each little div in the price table. I assume the intent is specifically to prevent HTML scraping.
Is this next level autism? I can still automate collection of the prices, but with the additional work of using OCR on the image. Meanwhile they've turned a few characters into an additional 40 KB request. Is the goal here to waste CPU cycles?

Attached: autism.png (1368x720, 20K)

July 28, 2018 - 17:32

Other urls found in this thread:

goatbots.com/card/treasure-chest-booster
twitter.com/NSFWRedditImage

Bentley Rivera

>Is this next level autism?
It's pretty high up there.

Why not just scrape off one of the major sites? TCG, Goldfish, etc? Do you specifically want this site?

July 28, 2018 - 17:45

Gavin Barnes

I am a lurker, but I am curious, what are you doing and for what purpose

July 28, 2018 - 17:51

Alexander Foster

That's hilarious dude I love it.

July 28, 2018 - 17:54

Henry Rodriguez

Great way to detter script kiddies from automatically scrapping your site

Attached: 1532503287981.gif (340x308, 1.97M)

July 28, 2018 - 17:57

Austin Evans

This is a specific vendor on Magic Online, so I don't think TCGPlayer is relevant, and Goldfish prices are often out of date because online prices change rapidly (and they don't list Goatbots prices at all).

I was just looking for a way to keep track of the Treasure Chest buy price, because I win them and want to know when I should sell them, and then I wanted to solve the puzzle of "why don't the prices appear in the HTML?".

July 28, 2018 - 17:59

Luke Nelson

>posts his own site
fuck off shill

July 28, 2018 - 18:26

Landon Sullivan

That would be pretty dumb, considering the intersection of "people who would care about Goatbots" (Magic Online players) and "people who haven't already heard of Goatbots" is miniscule.

July 28, 2018 - 18:43

James Fisher

OP just get the html, render it, convert to an image, pass it through a program which converts image to text

July 28, 2018 - 20:29

Brayden Allen

For a while my website had completely randomized divs to prevent scraping.

But eventually I figured the extra maintenance wasn't worth it.

July 28, 2018 - 20:31

Chase Flores

jesus fucking christ.

July 28, 2018 - 20:46

David Young

tl;dr ya. this guy. you could use puppeteer, take a screen grab of the output and a sufficiently good column aware ocr and pray I guess.

this website is literally garbage though. to go out of the way to make such a poorly designed webapp. lmao

July 28, 2018 - 20:47

Wyatt Phillips

>I assume the intent is specifically to prevent HTML scraping.
don't attribute to malice what can be explained by ignorance

July 28, 2018 - 20:48

Ryan Jenkins

nah, this isn't ignorant. this is deliberate bullshit.
who the fuck is generating a raster server side and some css to layout a fucking table?

July 28, 2018 - 20:50

David Roberts

This is masterful web dev

>Is this a trap?
>No, they must be stooopid
lel you'd make a great strategist wouldn't you...

July 28, 2018 - 20:51

Mason Sanchez

I also had random text that you have to filter through.

I still do that when I paste my email address with overly complex css.
Because the css was so random there was no way you could filter out the text in consistent manner.

July 28, 2018 - 20:52

Thomas Collins

I'm almost compelled to actually make a webscraper for this site just to put this fag into an arms race of pure autism.

July 28, 2018 - 20:52

Joseph Gutierrez

Now I'm thinking about it. I'm considering making infinite text traps for shits and giggles.

July 28, 2018 - 21:00

William Smith

Why not get the prices from the card pages directly?

goatbots.com/card/treasure-chest-booster

July 28, 2018 - 21:07

Robert Bell

that wouldn't stop even the most basic scrape

July 28, 2018 - 21:12

Brody Sanders

It would make sorting through the data much harder.

July 28, 2018 - 22:14

Isaac Wood

Tcgplayer is protected by incapsula

July 28, 2018 - 22:23

Robert Baker

That's what I'm doing now. I was poking around looking for some kind of API instead of scraping that page and saw that the OP page was making an AJAX request, and then uncovered this autism.

July 28, 2018 - 22:28

1 2 3 Next

HTML Scraping

Last threads