I need to capture everything prior to any html tag there. I anticipated using ([^\s]+), but because the html tags are sometimes present, that captures the tags.
if I use ([^\s]+)(]*>), I get a situation in which I get two groups: {Basketball04!} and {}
Which is most curious to me, becuse I would have thought I'd get {Basketball04!} {} which is precisely what I actually need. (unless of course I can make it not capture the second group at all..)
(\S+?)< Learn about the difference between greedy and non-greedy quantifiers. Also [^\s] = \S
Colton Mitchell
'^[^
Eli Long
Or, for a shorter way of accomplishing the same thing: ([^
Adam Reyes
Beat me to it, and got dubs. Just fucking kill me
Also, I'm proud that nobody has started chimping out about "OMG HTML regex" yet
Adam Gonzalez
i was just about to post a chimpout but instead i will say good luck
Austin Price
use perl for its superior regex
Chase Bennett
The thing is, either one of two things will happen. Either: >The task is simple enough that using regular expressions to extract the information will be faster and actually simpler, or >The task is complex enough that the person will learn the lesson the proper way--by trying and failing
To me, this one sounds like a very simple task
William Nelson
These days Ruby does PCRE better than the language PCRE was named after
Kayden Martinez
Here's perhaps my favorite regex that happens to do exactly what you need. (.(?!))*.[/code
Jack Peterson
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the nerves of the sentient whilst you observe, your psyche withering in the onslaught of horror.
Cooper Flores
Rege̿̔̉x-based HTML parsers are the cancer that is killing Jow Forums it is too late it is too late we cannot be saved the trangession of a chi͡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of regex parsers for HTML will instantly transport a programmer's consciousness into a world of ceaseless screaming, he comes, the pestilent slithy regex-infection will devour your HTML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fight he com̡e̶s, ̕h̵is un̨ho͞ly radiańcé destroying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo͟ur eye͢s̸ ̛l̕ike liquid pain, the song of re̸gular expression parsing will extinguish the voices of mortal man from the sphere I can see it can you see ͚̖͔̙î̩́t͎̩̱͔́̋̀ it is beautiful the final snuffing of the lies of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL IS LOST the pon̷y he comes he c̶̮omes he comes the ichor permeates all MY FACE MY FACE ᵒh god no NO NOO̼OO NΘ stop the an*̶̙̤͑̾̾ͫg͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͨe̠s ͎a̧͈͖r̽̾̈́͒͑e not rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ IS̯͈͕̹̘̱ͮ̂ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̝͖ͭ̏ͮ͟O̮̪̝͍ͮM̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ
Robert Moore
The superior .NET regex can in fact parse HTML with balancing groups. Dead meme.
Cameron Martinez
This is still good shit, but people take it way too seriously and often cite it as gospel to criticize using regular expressions without really understanding the problems. For example, if you're trying to do an actual *parsing* task, e.g. "match text within a block that's nested in a ", then you should definitely not use regular expressions. However, if you're wanting to extract the URLs from a bunch of "" tags, then regular expressions should be fine.
Kevin Butler
/$/g
JavaScript is your Daddy
Aiden Morgan
OP here. Thanks, you guys. I've basically hidden from regex for a while, but holy shit does it ever make my projects work more smoothly, so I'm finally trying to wrap my head in it. I will get there.
Christopher Long
So can ruby, or any other engine that supports recursion, but the syntax will look so shitty, and it will be so error prone, that it's usually a bad idea. At least if you're writing code that you're sharing with anybody.
On another note, I basically learned regular expressions by "parsing" HTML to mass rip videos from porn sites so I often get defensive when people say not to do that.
John Jones
What the fuck is that syntax? Don't you need to escape the second / ? Also, why are you (apparently) trying to match closing tags when that's not what OP asked for at all?
Good luck bro. If you use IRC, #regex on freenode used to be a friendly place, at least when I lurked there a few years back. It may be a good way to ask questions if you need it.
Cameron Jackson
Yeah you have to escape the second /. Then what did OP ask?
Jaxon King
>What the fuck is that syntax? JS syntax, / and /g are not part of the expression. This matches everything between < and >. The opposite of what OP asked.
Jason Bennett
>syntax It looked confusing because some engines use s/../../g syntax to do global substitutions.
Christopher Perez
thanks for that, I do use IRC, though not as much as I used to. Used to post on stackoverflow, but you basically get badgered there instead of taught.
Christian Hill
>Copypasting from SO The absolute state of Jow Forums
Thomas Young
Well it's just a half of that because it's not replacing anything.
Easton Cooper
my $regex__cfg_names = qr"^(\/\*\+\*\/\s+\/\*\|)(.*)\*\/\r?$";
I miss working
Jacob Clark
Parse? Mayhap, but you can definitely tokenize it using most scripting language's "regexp" and easily throw the tokenized output through a primitive state machine to extract what you need.
The real problem is the name regexp is just really shitty. Which is why I call it explicitly regexp and never a regular expression.
Asher Young
Why the fuck wouldn't you just use a browser engine and have it output the src attribute for the images in JSON?
James Roberts
OP here. If you want the actual context of this, the regex is being used in a python script, which pulls the html source from a series of pages in a forum thread.
Certain replies are nested or quoted, which adds the padded closing tags, which I dont need in my output. Once scraped, the list is sorted to remove duplicates. I was manually doing that, at first, but the and other closing elements were causing tons of duplicates to be missed. By not capturing the closing tags, now I am not getting the missed dupes. Yay science.
Brody Taylor
Again, why the fuck wouldn't you just use a browser engine for this? Or at least a library designed for parsing HTML? Yay science? More like yay a waste of God damn time.
Chase Hall
it's supposed to match anything that ends with a closing html tag hence the $
Adrian Ward
because you dont know the whole scope of what the script is for, and doing what you suggested is both autistic and unrelated to the task.
I dont need the html. I need data from the threads pages, and the data is formatted specifically. So it is retarded to do anything other than take the pages, make one large ass string of thier html, and skim off the formatted text into a list. I can than input that list into a database.
Mason Baker
No, what's autistic is wasting several hours and going onto an anonymous forum to get help with a simple problem you could have resolved by using a library, a browser engine or a simple fucking Google search. The fact you were even spoonfed your solution disgusts me. >you don't know the w-whole scope! What kind of fucking 10iq rebuttal is this? No one gives a shit about your scope, don't waste time autistically optimizing something you're clearly doing in a retarded way in the first place. Use a library next time.
Sebastian Campbell
Keep ignoring that retard, user. People like him are the ones responsible for """enterprise""" java and using the left-pad library in NPM. It is autistic as fuck to pull in a fucking browser engine to extract consistently formatted text from a larger chunk of consistently formatted text.
this forum doesn't happen to be bitcointalk? if so... lol good luck with that
Jason James
you mad bro? You sound mad, you mad though bro?
Jackson Wright
or the built in html parsing lib
Angel Bennett
ask cloudflare, they know regex
Cameron Thompson
Epic
Blake Bell
Inase your serious, that's just famous ancient pasta, probably the only well known pasta to originate on stackoverflow
Andrew Hall
*In case
Levi Reed
You mean 1968
Easton Bell
How so , user?
Landon Carter
Nigger, perl is all regex
Evan Lee
It's autistic as fuck to waste your time using regex to parse HTML instead of using a library or browser engine. Sorry, retard, but you're the irregular, the outcast who thinks it's okay to waste fucking time doing stupid shit like this instead of getting work done. You'd probably paint your house with a fucking toothbrush if you thought it could give you a slightly more even coat.
Hunter Evans
If you need help : regex101.com/ It's actually pretty good.
Christopher Brown
Listen here you dumb nigger. What do you think would solve OP's problem faster, a single fucking regular expression based on clearly visible patterns (he's filtering text that's already broken into strings), or pulling in a DOM parser and figuring out the entire hierarchy of the website, then traversing it? Not to mention needing to acquire the non-portable and less useful knowledge of how to use whichever of the many DOM parsing libraries are available. It's better to use regular expressions for simple text processing. Just because your autistic ass assumes "html parsing" as soon as you see a "
Joshua Barnes
This is something you could do with python's element tree or the sax parser. Hell you even have full blown doms. Etree supports xpath as well. If you need safe etree, consider defusedxml which just wraps expat. Html specifically could probably be better parsed with meme shit like beautifulsoup
Liam King
XPath 2.0 or even css selectors are a lot easier to reason about and more terse than some shitty regexp pleb.
Luis Jenkins
See >People like you are the opposite of getting work done And yet people like me created some of the most popularly used programs of the past decade, while people like you have nothing to show for yourselves. Performance doesn't matter outside of systems or programs that work with incredibly large datasets. I'm going to pull that 10mb library and take 10 seconds to find the functionality I need in the documentation for it. Job done in 5 minutes compared to this retard who probably spent an entire day writing some half-baked python script to scrape a forum.
Charles Perry
I dont see why people care so much about 10MiBs of dependencies.
I think it's kind of cool now that dependent code is so easy to integrate outside of what is ambiently available on a system that people can easily pull 400 libs in no time flat. Bundling software has never been easier.