Web scraping. How legal is it?

I am looking for info and opinions about the current legal status of web scrapping. Let's hear it.

Attached: 4875962348752.png (500x883, 216K)

if you don't scrap like a retard and you aren't a business there's a low probably people will even notice

What does "scrap like a retard" even mean? You mean without proxy or vpn? or do you mean the amount of data scrapped?

Dumb shit like leaving the default user agent headers, then the admin sees 1000 log messages with the user agent "python requests 4.20", or making an absurd amount of requests that no human would ever do
t. speaking from experience

Another mistake would be getting an api key or something similar then changing ip. Looks fishy as fuck if you pay attention

>hmm looks like a human logged in via an American ip
>then all of a sudden we got thousands of requests from an Indian datacenter ip

So, if for example I send an http request (with request headers mimicking chrome) for files that load every time a standard web-browser loads the page and nothing more, can that be detected as scrapping and even if it can be detected, would it be illegal to access those public documents that any web browser could load?

Basically impossible to detect and not illegal

Youd have to quite badly ddos a site for it to even approach being illegal. Otherwise on certain services you might violate their ToS by scraping but, since its so hard to detect they probably won't do shit.

Does the ToS have any weight whatsoever in court? Technically you don't need to "visit" the site to scrape it, so you would never see any pop up or domain.com/TermsOfService.

No service would ever waste their time bringing a ToS violation to court. They just ban you from the service and that's that.

They ban your IP? Or are there other ways?

It's not that easy. If you are overloading the system there are ways to detect you as you will not parse any of the code you download, so even something as simple as "body::after { background: url("/not-a-bot") }" will catch you.

Make sure to check for robot.txt and you're good to go

selenium

and if they disallow any part of what i am scrapping... what then?
i am not scrapping html. or css. also my requests are slower than a browser's requests.

I'm sure they must get plenty of false positives when autists like me use umatrix

>if they disallow any part of what I am scrapping?
What kind of question is this? Comply, obviously.
You ain't a nigger, are you?

channel it through tor

>How legal is it?
why cares? nothing is illegal until you're in prison.

recent legal battles have favored the scraper
it's still trespassing and you can still get sued
if you scrape at a human rate no one will ever stop you
if you try to sell the data or in some way sell a product connected to the data, you better be working behind an LLC or you're getting raped
-t.actual professional scraper

Attached: 1559916896237.jpg (645x500, 62K)

what are you trying to accomplish? more often than not you don't really need to worry about legality

I both work for a company that does scraping and run a small business that relies on scraped data, which I repackage and sell as a subscription newsletter. making dank bucks and haven't been sued yet.

>Comply obviously!
The point is that there are laws. Just because I want my neighbor to not exist doesn't mean I can enforce it. All I am asking is what part of scrapping is actually against written law and what part is simply wishful thinking from their part. So I don't go against the law.

if it's a smaller site you could probably just convince them to give you a data dump for a one time (((fee)))

Well. That's the best part. The data I am downloading is copyrighted, but not by the site I am downloading it from. Kinda like youtube. Now that I think about it, are youtube downloaders illegal? Because that's in the ballpark of what I am doing.

that answer was for you.

You don't comply with robots.txt
You do crime

Don't be a criminal!

robots.txt is advisory. You don't need to follow it

Anyway, been crawling and downloading around 7TB of yiff from around all the Internet, and despite sometimes hammering websites and including my contact data in the useragent, nobody seems to care.
However, thats probably because I'm doing a lot of "If-Modified" requests that usually don't even reach backend.

Attached: IMG_20190903_002635_812.jpg (1232x1280, 252K)

I write crawlers for a living, AMA
We only need proxies for less than a dozen websites (out of a hundred), most websites don't monitor their traffic at all.
Recaptcha is another concern.

>are youtube downloaders illegal?

when you watch a youtube video, you download its content to RAM, so it can't possibly be illegal. Redistributing it would be.

Unless you start redistributing copyrighted material, there is nothing illegal about that.

that's what I thought too, but I really wanted to make sure

I work for a company that does webscraping, it's perfectly legal. They've been doing it for years. The information is publicly accessible, no where is it written that you can only access it through a browser (and if it were, food luck proving my selenium client isn't a browser). You can even use their public APIs and use cloudflare solvers.

The thing that gets you IP banned is doing 20k requests in a minute. Headers only matter for access, like putting a token into them or specifying the content type.

Does a video stream (like youtube downloaders) count as one request or the requests pile up while streaming a video to a hard drive (from the servers perspective)?

I was thinking about web scraping sites with streaming audio to compile into something like a general internet radio

It counts as one huge request, what I said applied to single-part requests like html and json. If you're scraping video from a company with virtually infinite resources like google, they'll obviously be able to tell what you're doing unless it's something like a frontend to youtube where the user requests a video. Newpipe exists and works, so you're good there. But if you're downloading batchloads of videos, yeah I doubt they'll let it keep going long before you hit a captcha. There are solvers for those (see buster), but life will be suffering.

>robots.txt is advisory

Attached: 1542925089494.jpg (848x480, 47K)

If you aren't a "security" business working for the government*.

>I write crawlers for a living
What? There's market for crawlers? Who buys them?
Isn't writing crawler a trivial task?

>isn't writing a crawler a trivial task?
For getting basic html from a static page? Sure. But you're way out of your water if you think that's how normal webpages work these days.
>fuck solving a captcha to get a token to get another token to open a websocket to finally maybe get data

youtube-dl grabs tons of shit from YouTube just fine for me. I've downloaded thousands of videos with it.

I don't just write them I manage clusters, databases and stuff.
Many times it's not trivial, authentication, rendering, proxies, captcha, retry policy, some things to think about. But a lot of the times it is trivial.

I'm sure there's a python library for that.
So... Who buys crawlers? For what purpose?

nobody buys crawlers, they buy data.

What kind of data?

social media, news, stock prices, competitor's retail prices, all kinds of shit.

>social media
Like, posts containing specific words?

like everything ever posted by any slightly famous person.

Can anybody explain the process of web/screen-scraping? Wikipedias expelation isn't that good

ho I like that youtube downloader example? Does that count as scrapping? How would that on other sites work