Web Scraping General

Can we get a web automation thread going?

I never really see this discussed on Jow Forums, but I am sure there are plenty of people who are also interested in this type of programming.

I enjoy using Python and Selenium to build webscrapers, account automation, and stuff like that. What types of stuff is Jow Forums automating?

Attached: selenium_logo_320x260-300x260.jpg (300x260, 13K)

Other urls found in this thread:

github.com/zalando/zalenium
twitter.com/SFWRedditGifs

seriously? all there is to it are html parsers and regular expressions

If you teach me how I'll give you 3 cisco courses of content for free

I scrape latest manga chapters because I like using comicrack

if you know a bit of java, and css selectors you're there with jsoup.

do you know how to do selenium with docker containers? I want to break the few web scrapers I have out into containers to make the logging cleaner

Can I ask why you guys bother? Is there some profit in this? Are you scraping for content that interests you personally? What's the deal?

Me personally I've been writing my own crawler to self host a search engine. All part of my quest to de-google myself. I don't technically scrape, just crawl, scan for keywords in the doc, and index them.

So I'm a student with a netacad account. I want to be able to review my course material 5 years from now but don't have faith I'll have perpetual access to the content. If I scrape the course content it's like I own 3 Cisco textbooks i can keep and share. I downloaded a website scraper program, but all the actual course content wasn't retrieved.

>scrape content
>use content to train algorithim
>???

you could download the HTML manually

Yeah, I could. It will take a long time but since I can't scrape I guess I should just get started.

You should consider making a torrent of it. free sharing of information and all.

I would do that. But it's just more likely to happen if I don't have to spend hours and hours manually downloading it and stitching it together.

github.com/zalando/zalenium

Attached: DC1EF409-0A25-4E86-B976-8F941553DC37.jpg (683x1024, 366K)

nice, thank you

You deserve it.
Knowledge is to be shared so that we all can grow.

Attached: E11E326F-3D12-4F37-AC21-AF19080763F1.jpg (1242x694, 481K)

Webscrapers are great man, but web robots in general are cool too. Most ideas require the ability to automate HTTP GET and POST requests and parse web data. I've been working on a couple things in C recently. Having used jsoup I really prefer the elegance and power of curl.

Use restassured if you want to do that

I had an idea to build a web interface for a web scraper and sell it as a service

probably want to build it either containerized or serverless

I wrote an automated scraper to download images from a few certain popular social media image platforms, for content that is of a sexual interest to me.

lo and behold a startup is born

So a upgraded selenium IDE?

I don't think it's a particularly revolutionizing idea.

I first manually gave it a few usernames of accounts to scrape, and then it also finds from those accounts, any other reference to other accounts. These are treated as "relations".

If for example User A and User B are related, then there's a chance that User B contains images of User A, which is... useful.

By default I use Python, specifically lxml's xpath and request. Its enough for 90% of online content.

I've written a handful of RSS aggregators but for the most part I just use it to build image scrapers from time to time.

yeah, web-based for businesses trying to integrate old systems or something

i'm sure it already exists but either way it's a good practice, I figure if I were to figure out how to build an office or g-suite plugin it might actually sell pretty good numbers

Idk man I'm trying to avoid Java as much as possible. But we'll see. Thanks for the resource. This isn't something that normies need access to so I prefer the simplicity of just having it as one .c file with one library that I could probably run on a toaster from tty

Anyone work with headless automation?
I really want to get into it but need a good idea where to learn/start

you're building a directed graph, check out Gephi if you want to visualize it.

I was referring to downloading ass pics off instagram as a decent idea

scraper for what exactly?

>Gephi
I'll check it out, thanks

Yeah, it's pretty nice. I'm pretty much a data hoarder, and I used to spend hours every day just manually downloading this shit, so i just decided to just spend some time to automate it, and it's definitely been worth it.

Yeah, basically. Scrape sites of interest and related sites for content and then sort it for later or just consume and delete. I like to pile up useful reference material on a file server at home, to illustrate the former case, and pull daily from a collection of podcasts, vidcasts, and articles about my hobbies to my phone via termux which I usually delete after watching or listening to over my lunch hour.

Obviously, you can simply use search engine(s) and a few other pieces of software to manually accomplish the same thing, but this saves a fairly significant chunk of time when you regularly check more than a few sites for content.

see

I used to use pure php, and then used phantomjs.

Been a while since I did some menial work though.

This is one of the easiest tasks anyone can do if you are simply "automating", but it's a hassle if you are doing it for testing your website, rather just fucking do it manually on an installed browser.

Can't they just migrate the backend?

you have no idea how many companies don't want to spend the money on that but they'll spend maybe 1/3 or 1/4 on something like this to just "make it work".

There's more money in middleware software than anything, that's why SAP is such a giant.

Attached: 1533864420724.png (322x256, 27K)