Google scrapes every page on Amazon every second of of every minute of every hour of every day

Question

Google scrapes every page on Amazon every second of of every minute of every hour of every day

Julian Jones

Google scrapes every page on Amazon every second of of every minute of every hour of every day.
I wrote a python script (admittedly in an amateurish fashion) to scrape an Amazon URL for ONE of my company’s (my company as in the company I work for) products and got my entire office building blacklisted from running script against Amazon. Why is it that Google’s web crawling doesn’t overload their servers but me taking literally less than 1kb of text data from Amazon is enough for them to block me for “overloading” their site?
Shouldn’t I be allowed to scrape our own products off Amazon? Is it just because “fuck you, little guy”?
Does anyone know if there is a way around this? I tried adding a line into the code to disguise the traffic as coming from a browser but still got the same 503 error.
Any scraping experts here who know best practices for what I’m doing?

Attached: F2E26977-8AF7-422B-A3E9-0711BC4EEEE6.png (1920x1080, 571K)

July 2, 2019 - 16:29

Owen Morris

VPN or Tor before every script run ?

July 2, 2019 - 16:33

Adam Clark

learn to use the official api

or

learn to impersonate a browser better

and

stop sending requests every 5ms

July 2, 2019 - 16:33

William Carter

>Is it just because “fuck you, little guy”?
Absolutely. Google scraping them is fine - they absolutely need Amazon products to show up in Google search results - but you? What the fuck do you have to offer them?

July 2, 2019 - 17:27

Evan Gray

It is impossible to state how much difference good code and bad code have on processes like that.

July 2, 2019 - 17:34

Lucas Richardson

>why won't they let me trash their site?? wtf??? google does it and they only bring massive amounts of traffic to their site
>this is EXACTLY the same

July 2, 2019 - 17:45

Lucas Ward

Hey Jeff, how’s it going?

July 2, 2019 - 18:28

Carson Rogers

maybe start with headless chrome

July 2, 2019 - 19:42

Levi Gray

Try using a real browser with selenium, it usually works for me.

July 2, 2019 - 19:47

Nathaniel Ward

Try using a rotating proxy. So that request go from different ip each time. This way amazon won't be able to track you.

July 2, 2019 - 19:51

Michael Morales

Why?
• The traffic google will send them is obviously more valuable to them than whatever you're doing.
• Google isn't scraping 'every page, every second, every minute'. Amazon's high-priority, so it a change won't go unnoticed by Google for more than a day or so, but a least a few hours should be expected between re-scrapes.
• Google and Amazon most likely have come to a formal agreement regarding this data-traffic, allowing Google to get it's info, without raping all the bandwidth.
• Amazon has an API for getting almost all the information you could possibly want from the webpage, and *really* wants you to use it rather than screwing with their page-view metrics.

July 2, 2019 - 20:23

Jack Jenkins

You should probably disclose in the User-Agent that you're actually a bot and follow whatever rules they have in their robots.txt
You are not good enough to pass off as a normal user and you will get fucked by them if that's what you're trying to do.
Also stop sending a request every cycle you dingus

July 2, 2019 - 20:27

Angel Richardson

>got my entire office building blacklisted from running script against Amazon
Run your script from AWS

July 2, 2019 - 20:31

Kevin Taylor

This is why these cunts need to be broken up.

July 2, 2019 - 20:34

Gavin Walker

Google will most definitely NOT scrape data from Amazon. Amazon will give them a live pipe to their data instead.

July 2, 2019 - 21:46

Jace Barnes

a: literally do this from a browser (as a userscript or add on or something, so your requests look like they're from a browser)
b: space your requests out properly (importantly, don't do things exactly every ___ interval)

July 2, 2019 - 21:49

William Parker

>scrapes every page on Amazon every second of of every minute of every hour of every day
They don't have such pwoer or need.

They probably don't scrape more than once per month on most sites.

July 2, 2019 - 23:34

Ethan Parker

>my company as in the company I work for
It's obvious you don't own the company...

Attached: 1561045145751.jpg (752x620, 138K)

July 2, 2019 - 23:38

Jack Edwards

At least half of the people here claim to be self employed

July 2, 2019 - 23:39

Joshua Lopez

>claim to be self employed
While claiming unemployment, NEET-bucks, or both.

July 3, 2019 - 00:06

Jonathan Hughes

Impossible because there is none?

July 3, 2019 - 00:13

Christian Flores

ahhh ha...

July 3, 2019 - 00:54

William Johnson

This is retarded.
Probably this.

July 3, 2019 - 01:06

1 2 3 Next

Google scrapes every page on Amazon every second of of every minute of every hour of every day

Last threads