Ok so I am having a bit of trouble with regex

Ok so I am having a bit of trouble with regex.


I have a string:
Basketball04!

I need to capture everything prior to any html tag there. I anticipated using ([^\s]+), but because the html tags are sometimes present, that captures the tags.

if I use ([^\s]+)(]*>), I get a situation in which I get two groups: {Basketball04!} and {}

Which is most curious to me, becuse I would have thought I'd get {Basketball04!} {} which is precisely what I actually need. (unless of course I can
make it not capture the second group at all..)

Attached: maxresdefault.jpg (1280x720, 80K)

Other urls found in this thread:

grymoire.com/Unix/Regular.html
regex101.com/
twitter.com/AnonBabble

(\S+?)<
Learn about the difference between greedy and non-greedy quantifiers. Also [^\s] = \S

'^[^

Or, for a shorter way of accomplishing the same thing: ([^

Beat me to it, and got dubs. Just fucking kill me

Also, I'm proud that nobody has started chimping out about "OMG HTML regex" yet

i was just about to post a chimpout but instead i will say good luck

use perl for its superior regex

The thing is, either one of two things will happen. Either:
>The task is simple enough that using regular expressions to extract the information will be faster and actually simpler, or
>The task is complex enough that the person will learn the lesson the proper way--by trying and failing

To me, this one sounds like a very simple task

These days Ruby does PCRE better than the language PCRE was named after

Here's perhaps my favorite regex that happens to do exactly what you need.
(.(?!))*.[/code

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the nerves of the sentient whilst you observe, your psyche withering in the onslaught of horror.

Rege̿̔̉x-based HTML parsers are the cancer that is killing Jow Forums it is too late it is too late we cannot be saved the trangession of a chi͡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of regex parsers for HTML will instantly transport a programmer's consciousness into a world of ceaseless screaming, he comes, the pestilent slithy regex-infection will devour your HTML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fight he com̡e̶s, ̕h̵is un̨ho͞ly radiańcé destroying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo͟ur eye͢s̸ ̛l̕ike liquid pain, the song of re̸gular expression parsing will extinguish the voices of mortal man from the sphere I can see it can you see ͚̖͔̙î̩́t͎̩̱͔́̋̀ it is beautiful the final snuffing of the lies of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL IS LOST the pon̷y he comes he c̶̮omes he comes the ichor permeates all MY FACE MY FACE ᵒh god no NO NOO̼OO NΘ stop the an*̶̙̤͑̾̾ͫg͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͨe̠s ͎a̧͈͖r̽̾̈́͒͑e not rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ IS̯͈͕̹̘̱ͮ̂ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̝͖ͭ̏ͮ͟O̮̪̝͍ͮM̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ

The superior .NET regex can in fact parse HTML with balancing groups. Dead meme.

This is still good shit, but people take it way too seriously and often cite it as gospel to criticize using regular expressions without really understanding the problems.
For example, if you're trying to do an actual *parsing* task, e.g. "match text within a block that's nested in a ", then you should definitely not use regular expressions.
However, if you're wanting to extract the URLs from a bunch of "" tags, then regular expressions should be fine.

/$/g

JavaScript is your Daddy

OP here. Thanks, you guys. I've basically hidden from regex for a while, but holy shit does it ever make my projects work more smoothly, so I'm finally trying to wrap my head in it. I will get there.

So can ruby, or any other engine that supports recursion, but the syntax will look so shitty, and it will be so error prone, that it's usually a bad idea. At least if you're writing code that you're sharing with anybody.

On another note, I basically learned regular expressions by "parsing" HTML to mass rip videos from porn sites so I often get defensive when people say not to do that.

What the fuck is that syntax? Don't you need to escape the second / ? Also, why are you (apparently) trying to match closing tags when that's not what OP asked for at all?

Good luck bro. If you use IRC, #regex on freenode used to be a friendly place, at least when I lurked there a few years back. It may be a good way to ask questions if you need it.

Yeah you have to escape the second /. Then what did OP ask?

>What the fuck is that syntax?
JS syntax, / and /g are not part of the expression.
This matches everything between < and >. The opposite of what OP asked.

>syntax
It looked confusing because some engines use s/../../g syntax to do global substitutions.

thanks for that, I do use IRC, though not as much as I used to. Used to post on stackoverflow, but you basically get badgered there instead of taught.

>Copypasting from SO
The absolute state of Jow Forums

Well it's just a half of that because it's not replacing anything.

my $regex__cfg_names = qr"^(\/\*\+\*\/\s+\/\*\|)(.*)\*\/\r?$";


I miss working

Parse? Mayhap, but you can definitely tokenize it using most scripting language's "regexp" and easily throw the tokenized output through a primitive state machine to extract what you need.

The real problem is the name regexp is just really shitty. Which is why I call it explicitly regexp and never a regular expression.

Why the fuck wouldn't you just use a browser engine and have it output the src attribute for the images in JSON?

OP here. If you want the actual context of this, the regex is being used in a python script, which pulls the html source from a series of pages in a forum thread.

Certain replies are nested or quoted, which adds the padded closing tags, which I dont need in my output. Once scraped, the list is sorted to remove duplicates. I was manually doing that, at first, but the and other closing elements were causing tons of duplicates to be missed. By not capturing the closing tags, now I am not getting the missed dupes. Yay science.

Again, why the fuck wouldn't you just use a browser engine for this? Or at least a library designed for parsing HTML? Yay science? More like yay a waste of God damn time.

it's supposed to match anything that ends with a closing html tag hence the $

because you dont know the whole scope of what the script is for, and doing what you suggested is both autistic and unrelated to the task.

I dont need the html. I need data from the threads pages, and the data is formatted specifically. So it is retarded to do anything other than take the pages, make one large ass string of thier html, and skim off the formatted text into a list. I can than input that list into a database.

No, what's autistic is wasting several hours and going onto an anonymous forum to get help with a simple problem you could have resolved by using a library, a browser engine or a simple fucking Google search. The fact you were even spoonfed your solution disgusts me.
>you don't know the w-whole scope!
What kind of fucking 10iq rebuttal is this? No one gives a shit about your scope, don't waste time autistically optimizing something you're clearly doing in a retarded way in the first place. Use a library next time.

Keep ignoring that retard, user. People like him are the ones responsible for """enterprise""" java and using the left-pad library in NPM. It is autistic as fuck to pull in a fucking browser engine to extract consistently formatted text from a larger chunk of consistently formatted text.

>using regex to parse html

This website is useful
grymoire.com/Unix/Regular.html

>regex
whoa it's like im in 1997

you could have just used beautifulsoup4.

this forum doesn't happen to be bitcointalk? if so... lol good luck with that

you mad bro? You sound mad, you mad though bro?

or the built in html parsing lib

ask cloudflare, they know regex

Epic

Inase your serious, that's just famous ancient pasta, probably the only well known pasta to originate on stackoverflow

*In case

You mean 1968

How so , user?

Nigger, perl is all regex

It's autistic as fuck to waste your time using regex to parse HTML instead of using a library or browser engine. Sorry, retard, but you're the irregular, the outcast who thinks it's okay to waste fucking time doing stupid shit like this instead of getting work done. You'd probably paint your house with a fucking toothbrush if you thought it could give you a slightly more even coat.

If you need help : regex101.com/
It's actually pretty good.

Listen here you dumb nigger. What do you think would solve OP's problem faster, a single fucking regular expression based on clearly visible patterns (he's filtering text that's already broken into strings), or pulling in a DOM parser and figuring out the entire hierarchy of the website, then traversing it? Not to mention needing to acquire the non-portable and less useful knowledge of how to use whichever of the many DOM parsing libraries are available. It's better to use regular expressions for simple text processing. Just because your autistic ass assumes "html parsing" as soon as you see a "

This is something you could do with python's element tree or the sax parser. Hell you even have full blown doms. Etree supports xpath as well. If you need safe etree, consider defusedxml which just wraps expat. Html specifically could probably be better parsed with meme shit like beautifulsoup

XPath 2.0 or even css selectors are a lot easier to reason about and more terse than some shitty regexp pleb.

See
>People like you are the opposite of getting work done
And yet people like me created some of the most popularly used programs of the past decade, while people like you have nothing to show for yourselves. Performance doesn't matter outside of systems or programs that work with incredibly large datasets. I'm going to pull that 10mb library and take 10 seconds to find the functionality I need in the documentation for it. Job done in 5 minutes compared to this retard who probably spent an entire day writing some half-baked python script to scrape a forum.

I dont see why people care so much about 10MiBs of dependencies.

I think it's kind of cool now that dependent code is so easy to integrate outside of what is ambiently available on a system that people can easily pull 400 libs in no time flat. Bundling software has never been easier.