How do you use regex to parse html?

How do you use regex to parse html?

Attached: regex_examples.png (700x496, 146K)

Other urls found in this thread:

stackoverflow.com/a/1732454/2378146
github.com/PuerkitoBio/goquery.
html.spec.whatwg.org/multipage/parsing.html#parsing
twitter.com/AnonBabble

>How do you use regex to parse html?
don't do that

H̸̡̪̯ͨ͊̽̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̝͖ͭ̏ͮ͟O̮̪̝͍ͮM̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ

stackoverflow.com/a/1732454/2378146

if absolutely need to parse html, use something like goquery,
github.com/PuerkitoBio/goquery.
example:
func ExampleScrape() {
// Request the HTML page.
res, err := http.Get("metalsucks.net")
if err != nil {
log.Fatal(err)
}
defer res.Body.Close()
if res.StatusCode != 200 {
log.Fatalf("status code error: %d %s", res.StatusCode, res.Status)
}

// Load the HTML document
doc, err := goquery.NewDocumentFromReader(res.Body)
if err != nil {
log.Fatal(err)
}

// Find the review items
doc.Find(".sidebar-reviews article .content-block").Each(func(i int, s *goquery.Selection) {
// For each item found, get the band and title
band := s.Find("a").Text()
title := s.Find("i").Text()
fmt.Printf("Review %d: %s - %s\n", i, band, title)
})
}

haha le EPIC zalgo meme, upvoted my r/stackoverflow friends :^)

Writing an HTML parser (and validator) is a great exercise for anyone trying to learn programming. No regex required.

>The cannot hold
gets me every time

Just use an HTML parser. Regex sucks for this task.

Good luck with that - the spec is ridiculously complex.
HTML is not XML, it's a really fault tolerant langauge and that's what makes it so difficult to properly parse.

Look at it: html.spec.whatwg.org/multipage/parsing.html#parsing Do you think that is "a great exercise for anyone trying to learn programming"? That's a great exercise for a fucking whole team of experiences programmers.

HTML isn't a regular language, so it's literally impossible.

you're mad. parsing html isn't as complex as you're making it out to be.

grep can parse binary files, which aren't a language, so why is it impossible?

>grep can parse binary files
No it can't. Learn what parsing is

$ dd if=/dev/urandom bs=1K count=10 > random.bin
$ grep '[a-zA-Z0-9]' random.bin
Binary file random.bin matches
$ cat random.bin

you're going down the wrong path.
try nokogiri or whatever the equivalent is in the language you're using

Pattern matching isn't parsing, son.

Attached: 1523809474937.png (500x700, 270K)

But it is a rite of passage

HTML requires recursion, so it can't be parsed purely with regex. You'd be best off using a recursive descent parser, since HTML is LL1, as far as I'm aware.
That being said, it would be pretty easy to combine regex with a stack to accomplish the recursion. You could also probably cut down a tree with a scalpel. Why would you? who knows?

>what is regex.h?

It's literally impossible OP, and you risk summoning extradimensional horrors

>Matching is parsing

If you can match, you can parse. Binary isn't a regular language either.

>If you can match, you can parse
Imagine being this retarded

Dear OP,

Regex (without extensions) can only parse regular languages. HTML is a context-free language. Please take CS classes, thank you.

True. Can you write a regex expression that will match if and only if the string is valid HTML?

Never, ever, ever do that. Instead use some XML parser and learn xpath, much better and comfier than shitty-ass regex patterns.

I wrote one in C for wiby search (which I also built). It was a fun exercise and improved my C skills quite a bit. The reason was that I found the html parsers out there more annoying to utilize than to just figure it out myself.

HTML is not guaranteed to be valid XML