Web scraping. How legal is it?

Question

Web scraping. How legal is it?

Jason Turner

I am looking for info and opinions about the current legal status of web scrapping. Let's hear it.

Attached: 4875962348752.png (500x883, 216K)

September 2, 2019 - 15:20

Jacob Baker

if you don't scrap like a retard and you aren't a business there's a low probably people will even notice

September 2, 2019 - 15:30

Jackson Young

What does "scrap like a retard" even mean? You mean without proxy or vpn? or do you mean the amount of data scrapped?

September 2, 2019 - 15:35

Cooper Bailey

Dumb shit like leaving the default user agent headers, then the admin sees 1000 log messages with the user agent "python requests 4.20", or making an absurd amount of requests that no human would ever do
t. speaking from experience

September 2, 2019 - 16:03

Josiah Ward

Another mistake would be getting an api key or something similar then changing ip. Looks fishy as fuck if you pay attention

>hmm looks like a human logged in via an American ip
>then all of a sudden we got thousands of requests from an Indian datacenter ip

September 2, 2019 - 16:06

Jayden Hill

So, if for example I send an http request (with request headers mimicking chrome) for files that load every time a standard web-browser loads the page and nothing more, can that be detected as scrapping and even if it can be detected, would it be illegal to access those public documents that any web browser could load?

September 2, 2019 - 16:23

Elijah Cooper

Basically impossible to detect and not illegal

Youd have to quite badly ddos a site for it to even approach being illegal. Otherwise on certain services you might violate their ToS by scraping but, since its so hard to detect they probably won't do shit.

September 2, 2019 - 16:27

Jack Perez

Does the ToS have any weight whatsoever in court? Technically you don't need to "visit" the site to scrape it, so you would never see any pop up or domain.com/TermsOfService.

September 2, 2019 - 16:35

Leo Campbell

No service would ever waste their time bringing a ToS violation to court. They just ban you from the service and that's that.

September 2, 2019 - 16:37

Benjamin Hill

They ban your IP? Or are there other ways?

September 2, 2019 - 16:41

Josiah Evans

It's not that easy. If you are overloading the system there are ways to detect you as you will not parse any of the code you download, so even something as simple as "body::after { background: url("/not-a-bot") }" will catch you.

September 2, 2019 - 16:46

Jaxon Garcia

Make sure to check for robot.txt and you're good to go

September 2, 2019 - 16:46

Ryder Phillips

selenium

September 2, 2019 - 16:59

Parker Bailey

and if they disallow any part of what i am scrapping... what then?
i am not scrapping html. or css. also my requests are slower than a browser's requests.

September 2, 2019 - 17:00

Jace Fisher

I'm sure they must get plenty of false positives when autists like me use umatrix

September 2, 2019 - 17:00

Noah Baker

>if they disallow any part of what I am scrapping?
What kind of question is this? Comply, obviously.
You ain't a nigger, are you?

September 2, 2019 - 18:46

Landon Brooks

channel it through tor

September 2, 2019 - 18:51

Jacob Turner

>How legal is it?
why cares? nothing is illegal until you're in prison.

September 2, 2019 - 19:06

Zachary Howard

recent legal battles have favored the scraper
it's still trespassing and you can still get sued
if you scrape at a human rate no one will ever stop you
if you try to sell the data or in some way sell a product connected to the data, you better be working behind an LLC or you're getting raped
-t.actual professional scraper

Attached: 1559916896237.jpg (645x500, 62K)

September 2, 2019 - 19:22

Kayden Torres

what are you trying to accomplish? more often than not you don't really need to worry about legality

I both work for a company that does scraping and run a small business that relies on scraped data, which I repackage and sell as a subscription newsletter. making dank bucks and haven't been sued yet.

September 2, 2019 - 19:26

Easton Murphy

>Comply obviously!
The point is that there are laws. Just because I want my neighbor to not exist doesn't mean I can enforce it. All I am asking is what part of scrapping is actually against written law and what part is simply wishful thinking from their part. So I don't go against the law.

September 2, 2019 - 20:44

Caleb Taylor

if it's a smaller site you could probably just convince them to give you a data dump for a one time (((fee)))

September 2, 2019 - 20:50

Carter Walker

Well. That's the best part. The data I am downloading is copyrighted, but not by the site I am downloading it from. Kinda like youtube. Now that I think about it, are youtube downloaders illegal? Because that's in the ballpark of what I am doing.

September 2, 2019 - 20:51

Jeremiah Walker

that answer was for you.

September 2, 2019 - 20:54

Dylan Scott

You don't comply with robots.txt
You do crime

Don't be a criminal!

September 2, 2019 - 21:01

Liam Nelson

robots.txt is advisory. You don't need to follow it

Anyway, been crawling and downloading around 7TB of yiff from around all the Internet, and despite sometimes hammering websites and including my contact data in the useragent, nobody seems to care.
However, thats probably because I'm doing a lot of "If-Modified" requests that usually don't even reach backend.

Attached: IMG_20190903_002635_812.jpg (1232x1280, 252K)

September 2, 2019 - 22:26

Austin Collins

I write crawlers for a living, AMA
We only need proxies for less than a dozen websites (out of a hundred), most websites don't monitor their traffic at all.
Recaptcha is another concern.

September 2, 2019 - 23:03

Easton Gutierrez

>are youtube downloaders illegal?

when you watch a youtube video, you download its content to RAM, so it can't possibly be illegal. Redistributing it would be.

September 2, 2019 - 23:10

Christian Wright

Unless you start redistributing copyrighted material, there is nothing illegal about that.

September 2, 2019 - 23:16

Chase Miller

that's what I thought too, but I really wanted to make sure

September 2, 2019 - 23:44

Luke Gomez

I work for a company that does webscraping, it's perfectly legal. They've been doing it for years. The information is publicly accessible, no where is it written that you can only access it through a browser (and if it were, food luck proving my selenium client isn't a browser). You can even use their public APIs and use cloudflare solvers.

The thing that gets you IP banned is doing 20k requests in a minute. Headers only matter for access, like putting a token into them or specifying the content type.

September 2, 2019 - 23:45

Colton Harris

Does a video stream (like youtube downloaders) count as one request or the requests pile up while streaming a video to a hard drive (from the servers perspective)?

September 2, 2019 - 23:51

Noah Reyes

I was thinking about web scraping sites with streaming audio to compile into something like a general internet radio

September 2, 2019 - 23:54

Easton Lopez

It counts as one huge request, what I said applied to single-part requests like html and json. If you're scraping video from a company with virtually infinite resources like google, they'll obviously be able to tell what you're doing unless it's something like a frontend to youtube where the user requests a video. Newpipe exists and works, so you're good there. But if you're downloading batchloads of videos, yeah I doubt they'll let it keep going long before you hit a captcha. There are solvers for those (see buster), but life will be suffering.

September 3, 2019 - 00:06

Chase Martin

>robots.txt is advisory

Attached: 1542925089494.jpg (848x480, 47K)

September 3, 2019 - 00:14

Cameron Robinson

If you aren't a "security" business working for the government*.

September 3, 2019 - 00:15

Anthony Mitchell

>I write crawlers for a living
What? There's market for crawlers? Who buys them?
Isn't writing crawler a trivial task?

September 3, 2019 - 00:16

Carson Jackson

>isn't writing a crawler a trivial task?
For getting basic html from a static page? Sure. But you're way out of your water if you think that's how normal webpages work these days.
>fuck solving a captcha to get a token to get another token to open a websocket to finally maybe get data

September 3, 2019 - 00:44

Noah Diaz

youtube-dl grabs tons of shit from YouTube just fine for me. I've downloaded thousands of videos with it.

September 3, 2019 - 00:55

Jaxson Reed

I don't just write them I manage clusters, databases and stuff.
Many times it's not trivial, authentication, rendering, proxies, captcha, retry policy, some things to think about. But a lot of the times it is trivial.

September 3, 2019 - 01:14

Lincoln Smith

I'm sure there's a python library for that.
So... Who buys crawlers? For what purpose?

September 3, 2019 - 01:24

Logan Scott

nobody buys crawlers, they buy data.

September 3, 2019 - 01:24

Bentley Martin

What kind of data?

September 3, 2019 - 01:42

Charles Brown

social media, news, stock prices, competitor's retail prices, all kinds of shit.

September 3, 2019 - 01:48

Sebastian Martinez

>social media
Like, posts containing specific words?

September 3, 2019 - 01:50

Ryan Brown

like everything ever posted by any slightly famous person.

September 3, 2019 - 01:53

Ryan Richardson

Can anybody explain the process of web/screen-scraping? Wikipedias expelation isn't that good

September 3, 2019 - 02:07

Grayson Rogers

ho I like that youtube downloader example? Does that count as scrapping? How would that on other sites work

September 3, 2019 - 02:10

1 2 ... 5 Next

Web scraping. How legal is it?

Last threads