The dangers of single line regular expressions

85 points
1/20/1970
11 days ago
by thunderbong

Comments


ufmace

Seems to me this is more about the danger of passing anything derived from user input into the TEMPLATE side of a templating engine. Why in the world would you ever do that?!?

Obviously if you pass data into the variable side of the engine, you hardly have to worry about it at all, since it's already going into a place that was designed for handling arbitrary and possibly-hostile input and been battle-tested at doing it correctly in Production for many years. If you pass it into the template side, you're betting that you can be as good as dozens of templating engine writers working for a decade at doing that, in exchange for, well, I can't really think of any possible legitimate advantage for doing that.

11 days ago

klysm

What if you want to allow users to regex search their documents?

11 days ago

ses1984

Do it on the client side?

Do it in a sandbox and have aggressive timeouts.

11 days ago

klysm

> Do it on the client side?

Not practical given a large amount of documents.

> Do it in a sandbox and have aggressive timeouts.

Sure! I was just replying to this:

> the danger of passing anything derived from user input into the TEMPLATE side of a templating engine. Why in the world would you ever do that?!?

9 days ago

neilk

In my experience `$` does reliably mean end of string for regular expressions, unless you specifically ask for "multiline" mode.

Ruby seems to be in multiline mode all the time?

    $ python -c 'import re; print "yes" if re.match(r"^[a-z ]+$", "foobar") else "no"'
    yes
    $ python -c 'import re; print "yes" if re.match(r"^[a-z ]+$", "foo\nbar") else "no"'
    no
    $ python -c 'import re; print "yes" if re.match(r"^[a-z ]+$", "foo\nbar", re.M) else "no"'
    yes

    $ perl -le 'print "foobar" =~ /^[a-z ]+$/ ? "yes" : "no"'
    yes
    $ perl -le 'print "foo\nbar" =~ /^[a-z ]+$/ ? "yes" : "no"'
    no
    $ perl -le 'print "foo\nbar" =~ /^[a-z ]+$/m ? "yes" : "no"'
    yes

    $ node -e 'console.log(/^[a-z ]+$/.test("foobar") ? "yes" : "no")'
    yes           
    $ node -e 'console.log(/^[a-z ]+$/.test("foo\nbar") ? "yes" : "no")'
    no            
    $ node -e 'console.log(/^[a-z ]+$/m.test("foo\nbar") ? "yes" : "no")'
    yes

    $ ruby -e 'if "foobar" =~ /^[0-9a-z ]+$/i then puts "yes" else puts "no" end'
    yes
    $ ruby -e 'if "foo\nbar" =~ /^[0-9a-z ]+$/i then puts "yes" else puts "no" end'
    yes
EDIT: this is documented behavior for Ruby. What other languages call multiline mode is the default; you're supposed to use \A and \Z instead. They do have an `/m` but it only affects the interpretation of `.`

https://docs.ruby-lang.org/en/master/Regexp.html#class-Regex...

11 days ago

dwheeler

False. "$" does NOT mean end-of-string in Perl, Python, PHP, Ruby, Java, or .NET. In particular, a trailing newline (at least) is accepted in those languages.

A $ does mean end-of-string in Javascript, POSIX, Rust (if using its usual package), and Go.

I'm working with the OpenSSF best practices working group to create some guidance on this stuff. It's a very common misconception. Stay tuned.

If anyone knows of vulnerabilities caused by thus, let me know.

11 days ago

sjrd

`$` does mean end of input in Java, unless you explicitly ask for multiline mode. In the latter case it means `(?=$|\n)` if also in Unix-lines mode, and the horrible `(?=$|(?<!\r)\n|[\r\u0085\u2028\u2029])` otherwise.

I wrote a compiler from Java regex to JavaScript RegExp, in which you'll find that particular compilation scheme [1].

Edit: also quoting from [2]:

> By default, the regular expressions ^ and $ ignore line terminators and only match at the beginning and the end, respectively, of the entire input sequence. If MULTILINE mode is activated then ^ matches at the beginning of input and after any line terminator except at the end of input. When in MULTILINE mode $ matches just before a line terminator or the end of the input sequence.

[1] https://github.com/scala-js/scala-js/blob/eb160f1ef113794999...

[2] https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pa...

11 days ago

sjrd

OK it seems they changed the doc since. In the docs for JDK 21 we read instead [1]:

> If MULTILINE mode is not activated, the regular expression ^ ignores line terminators and only matches at the beginning of the entire input sequence. The regular expression $ matches at the end of the entire input sequence, but also matches just before the last line terminator if this is not followed by any other input character. Other line terminators are ignored, including the last one if it is followed by other input characters.

Looks like I have some code to fix.

[1] https://docs.oracle.com/en%2Fjava%2Fjavase%2F21%2Fdocs%2Fapi...

11 days ago

thaumaturgy

    $ php -v
    PHP 8.2.7 (cli) (built: Jun  9 2023 19:37:27) (NTS)
    $ php -r 'var_dump(preg_match("/^[a-z0-9 ]+\$/", "hello world"));'
    int(1)
    $ php -r 'var_dump(preg_match("/^[a-z0-9 ]+\$/", "hello\nworld"));'
    int(0)
    $ php -r 'var_dump("hello\nworld");'
    string(11) "hello
    world"
    ...
    $ php -v
    PHP 7.2.26-1+0~20191218.33+debian8~1.gbpb5a34b (cli) (built: Dec 18 2019 16:09:52) ( NTS )
    $ php -r 'var_dump(preg_match("/^[a-z0-9 ]+\$/", "hello world"));'
    int(1)
    $ php -r 'var_dump(preg_match("/^[a-z0-9 ]+\$/", "hello\nworld"));'
    int(0)
    $ php -r 'var_dump("hello\nworld");'
    string(11) "hello
    world"
I'm not sure which version of PHP had the behavior you describe, or whether it misbehaves under more specific conditions, but preg_match() is one of the more commonly-used regex functions, all of which share the same engine. The behavior here seems to be "correct" for at least the last 5 years, for varying interpretations of "correct".

edit: https://3v4l.org/N4o8D suggests that the behavior here is identical for all versions of PHP from 4.3 to 8.3.6.

11 days ago

jasonlotito

So, you are technically correct when you say PHP accepts a trailing newline, but it doesn't mean it refutes the comment and the context we are discussing.

This is easily demonstrated with an example.

    <?php
    var_dump(preg_match("/^[a-z0-9 ]+\$/", "hello\n"));
    int(1)
versus

    <?php
    var_dump(preg_match("/^[a-z0-9 ]+\$/", "hello\nworld"));
    int(0)
versus

    <?php
    var_dump(preg_match("/^[a-z0-9 ]+\$/", "hello\n\n"));
    int(0)
Which all makes sense, as by default PHP doesn't operate in multiline mode. So, by default, PHP is not going to fall prey to the same problem being discussed here. In addition, the first \n would be apart of the first line it's on, so including it as a part of the string would make sense. More to the point, in this context, $ does mean end of the string in PHP. You can prove otherwise by getting the 2nd and 3rd example above to output a 1 instead of a 0 without going into multiline mode.
11 days ago

dwheeler

I think we're using different definitions for "end of string".

In PHP, the following is considered true:

> var_dump(preg_match("/^[a-z0-9 ]+\$/", "hello\n"));

That is clear proof that "$" does NOT just match the end of the string; it also accepts an extra newline at the end of the string. In PHP you need to use \z if you want to match the end of the string, or use the "D" flag when using "$".

That definition of "$" is often reasonable when you read files a line-at-a-time from a file, which is why Perl changed its definition. However, PHP is often used for server-side web applications. In this case, you are often NOT reading a line-at-a-time from a file. In such cases, allowing an extra newline at the end could be disastrous. The MediaWiki code (written in PHP) deals with this by adding the "D" flag when it uses "$", but I'm not sure it always uses it, and I doubt all PHP programs use this flag when they should.

10 days ago

phyzome

Interesting that a trailing newline is accepted. Not as bad as what's in the post, at least. Definitely worth breaking out which languages do which of those, though! Python, for instance, only accepts a trailing newline but not additional chars beyond that.

I don't think Java should be in your first list, though? Pattern.matches("^foo$", "foo\n") returns false.

11 days ago

dwheeler

Which version of Java (JDK) are you using? Which implementation?

If that's true, then I fear the answer for Java may vary. The O'Reilly book on Regular Expressions, and the JDK documentation for version 21, say clearly that $ permits an optional \n at the end. The Java 8 documentation is murky, and maybe Java 8 is different.

10 days ago

xarope

yes that's correct. I came from perl and python, and got caught out a few times in Go(lang).

11 days ago

js2

The potential trouble with $ (even in single-line mode) is that it matches the end of a string BOTH with AND without a newline at the end. If you're using it to ensure the string has no newline before doing something with it, this can lead to trouble.

  $ python3 -c 'import re; print("yes" if re.search(r"^foo$", "foo") else "no")'
    yes

  $ python3 -c 'import re; print("yes" if re.search(r"^foo$", "foo\n") else "no")'
    yes

  $ python3 -c 'import re; print("yes" if re.search(r"\Afoo\Z", "foo") else "no")'
    yes

  $ python3 -c 'import re; print("yes" if re.search(r"\Afoo\Z", "foo\n") else "no")'
    no
Even if the newline is not problematic, using \A and \Z makes your intentions clearer to the reader, especially if you add re.X and place comments into the pattern.

Asides:

1. Based on syntax, you appear to be testing with python2.

2. With python, re.match is implicitly anchored to the start, so the ^ is redundant. Use re.search or omit the ^.

11 days ago

medstrom

Correct me if I'm wrong, but if you extract a capture group (^foo$), you would get "foo" without the "\n", right?

If so, it is not "matching the end of a string" at all. Just end of line. That's exactly as expected in single-line mode, so it's good. May mismatch your expectations in multi-line mode though.

11 days ago

js2

That's right. It all depends on what you're doing with the input string after the match. The point is to be aware of the nuance and to communicate that clearly in the code in cases where it matters.

11 days ago

interroboink

Yeah, my takeaway from this was more "the dangers of Ruby" rather than "the dangers of single line regular expressions" (:

I think the simplest fix would be to use "\Z" rather than "$", which means "match end of input" rather than "end of line." This is also Perl-compatible. So weird that the "$" default meaning is different in Ruby.

I guess one could argue that Ruby's way is better since "$" has a fixed meaning, rather than being context-dependent.

> Ruby seems to be in multiline mode all the time?

Ruby does have a "/m" for multiline mode, but it just makes "." match newline, rather than changing the meaning of "$", it seems.

[1] https://ruby-doc.org/3.2.2/Regexp.html#class-Regexp-label-An...

[2] https://perldoc.perl.org/perlre#Metacharacters

11 days ago

Borg3

In case of ruby, best would be to actually use result of match for futher computation like this:

if !m=/^[a-z0-9 ]+$/match(str) return "Bad Input" end str=m[0]

11 days ago

neilk

looks like we both updated our answers as we looked up the docs :)

11 days ago

brobinson

Note that Ruby also has \z which is what you generally want instead of \Z.

(\Z allows a trailing newline, \z does not)

11 days ago

dwheeler

You want \Z in Python, and \z in most other languages, to match on end of string. But in some languages $ really does match end of string. As always, you must check your docs.

11 days ago

sfink

Alternatively, don't validate and then use the original. Instead, pull out the acceptable input and use that.

Even better, compare that to the original and fail validation if they're not identical, but that requires maintaining a higher level of paranoia than may be reasonable to expect.

11 days ago

Akronymus

> Alternatively, don't validate and then use the original. Instead, pull out the acceptable input and use that.

Parse don't validate https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

11 days ago

sfink

Heh. I wrote up my comment, and then thought "hey, I bet that's what that 'Parse don't validate' article meant, the one I never quite got around to reading." So I pulled it up — great article! — but then didn't post the link because it uses the type system to record the results of the parse. Whereas here, you'd probably parse from a string into another string.

But philosophically I agree, that's exactly the relevant advice.

11 days ago

Akronymus

parsing from a string to a string runs the risk of erroneously assigning the original value to the new string. Which kinda defeats the whole parsing, not validating.

What would work is having a small object holding a readonly string which parses the original on creation, then becomes immutable.

11 days ago

wrsh07

This was interesting and new to me, but as other commenters indicate, part of the problem is that we're trying to find the bad thing rather than trying to verify it is the good thing

There's a related concept of "failing open vs failing closed" (fail open: fire exit, fail closed: ranch gate)

In Jurassic park (amazing book/film to understand system failures), when the power goes out, the fence is functionally an open gate

In this case, we shouldn't assume that we can enumerate all possible bad strings (even with a regex)

11 days ago

tsimionescu

I don't think this is a good example, because the regex does just that: it doesn't try to filter out bad input, it specifically only accepts known good input. If the regex did what it was meant to do, only allowing strings composed of ascii letters and numbers, and space, than the code would have not been exploitable.

11 days ago

floxy

Still seems like that is broken. Shouldn't they be escaping whatever control characters? Like if your user wanted to highlight "Now 75% off". Seems like it is reasonable to want to allow that.

11 days ago

tsimionescu

That's a completely different problem: it may be too closed. But it's definitely not a fail open system. It's a fail closed system with a bug.

11 days ago

wrsh07

Whoops you are absolutely right!! Good point, I totally misread the if/else.

11 days ago

floxy

Yeah but the real bug is the trying to roll-your-own, instead of using a `ERB.escape_tainted_input` method or somesuch. Either that method doesn't exist, which seems like major mis-feature, or the author didn't know about it, or didn't want to use it.

11 days ago

[deleted]
11 days ago

wodenokoto

I think it is a surprise that a partial match return true.

But I guess this is why Python has so many ways of matching a pattern against a string (match, find, findall, I think - they are hard to remember)

11 days ago

roywiggins

I have to look it up every time.

11 days ago

ec109685

Escape the output based on the context a string is being used in versus trying to sanitize for all use cases on input.

This will guarantee that you’re safe no matter how a piece of content is used tomorrow (just need a new escaping function for that content type), and prevent awkward things like not letting users use “unsafe” strings as input. JSX and XHP are example templating systems that understand context and escape appropriately.

If a user wants their title to be “hello%0a%3C%25%3D%20File.open%28%27flag.txt%27%29.read%20%25%3E”, so be it.

Use input validation / parsing to ensure data types aren’t violated, but not as an output safety mechanism.

11 days ago

Jerrrry

>If a user wants their title to be “hello%0a%3C%25%3D%20File.open%28%27flag.txt%27%29.read%20%25%3E”, so be it.

that's a good way to horizontally propagate/reflect XSS and other Code As Data vulnerabilities.

better to strip the known-bad/problematic characters

https://en.wikipedia.org/wiki/Code_as_data

11 days ago

ec109685

The known problematic characters are different in json, xml, css, html content, html attributes, MySQL, etc. Unless you have output escaping, it is hard to ensure everything gets caught, no matter how the data enters the system.

11 days ago

tsimionescu

Sure, but there is a common set of safe characters that are guaranteed not to cause problems in any of these: the set described by the regex [a-zA-Z0-9 -]. If you can limit user input to this set, you'll drastically reduce the risk of code injection regardless of the stack below you.

11 days ago

phyzome

And that's how you end up pissing off users with apostrophes in their names.

11 days ago

Terr_

"Alright. If you’re gonna go ahead with it, I want to make sure you get one thing right. It’s “O’Neill,” with two L’s. There is another Colonel O’Neil with only one L and he has no sense of humor at all."

11 days ago

Jerrrry

one apostrophe? sure. more than two in a row? no.

Ku' 'Laangah't is valid.

11 days ago

tsimionescu

The output is not the problem here, it is the input. And, if you can get away with, accepting a small set of known-safe characters is much safer than accepting any character and hoping it will be properly escaped at every level.

When the user hands you a string and you then pass this down to other bits of code, you can't know if it will be used in an SQL query, a regex, in an error message that will be rendered into HTML, etc.

Ideally all layers of your code would handle user input with the utmost care, but that is often very hard to achieve. If you take user input and use it in a regex, it's easy to regex-escape it, but it's much harder to remember that now this whole regex is user input and can't be safely used to, say, construct an SQL query. And even if you remember to properly escape it in the SQL query, it may show up in the returned result, and now if you display that result, you need to be careful to escape it before passing it to some HTML engine.

But then none of this works if you did intend to have some SQL syntax in the regex, or some HTML snippets in the DB: you'd need to make all of these technologies aware of which parts of the expressions are safe and which are tainted by user input.

And this is all just to prevent code injection type attacks. I haven't even discussed more subtle attacks, like using Unicode look-like characters to confuse other users.

11 days ago

int_19h

> accepting a small set of known-safe characters is much safer than accepting any character and hoping it will be properly escaped at every level

It's also how you end up with apps that people can't use because they reject their perfectly valid legal name, address etc.

11 days ago

tsimionescu

Absolutely, that's why I said "if you can get away with it". There are situations where this is ok - for example, a company user ID. Even for names, it is often perfectly fine to require someone to spell their name with Latin characters.

For example, even if your name is officially 鳥山 in Japan, you will have to spell it out as Toriyama when you leave Japan, both in formal and informal settings, on paper just as much as in electronic forms, since no one would be able to understand it otherwise. And similarly, if your name is Smith, in Japan you will often have to spell it (and sometimes even pronounce it) スミス.

11 days ago

ec109685

What was actually invalid about that input? Why shouldn’t that html escaped string be shown as is to the user?

I guess I would tweak my first comment and say input filtering is not enough. You must do output filtering to truly be safe.

11 days ago

tsimionescu

I don't get how you use the term "output" here. However you put it, the problem was feeding the user's input to ERB.new(). The most general solution would have been something that accepted any string, but escaped it properly for ERB.new, I think we all agree on that. But that escaping still needs to be done on the input, not the output of ERB.new. If they were able to inject code there, the output doesn't even matter: you've already lost by the time you get a return, the malicious payload has already run, it doesn't matter what its output was.

11 days ago

ec109685

My point was that the web page that the researcher compromised takes input from user and creates a neon version of it.

It should be just fine to pass in any character to the site, so a regex deny list is the wrong approach.

11 days ago

librasteve

Raku (perl6) was a chance for Larry Wall to fix some of the limitations of the perl regex syntax, as you would expect from the perl heritage, it behaves similarly.

    ~ > raku -e 'say "foobar"   ~~ /^ <[a..z ]> +$/ ?? "yes" !! "no"'    
    yes
    ~ > raku -e 'say "foo\nbar" ~~ /^ <[a..z ]> +$/ ?? "yes" !! "no"'  
    no
    ~ > raku -e 'say "foo\nbar" ~~ /^^<[a..z ]>+$$/ ?? "yes" !! "no"'
    yes
- ^^ and $$ are the raku flavour of multiline mode

- ~~ the smartmatch operator binds the regex to the matchee and much more

- character classes are now <[...]> (plain [...] does what (...) does in math)

- perl's triadic x ? y : z becomes x ?? y !! z

We can have whitespace in our regexen now (and comments and multiline regexen)

    my $regex =  rx/ \d ** 4            #`(match the year YYYY) 
                 '-'
                 \d ** 2                # ...the month MM 
                 '-'
                 \d ** 2 /;             # ...and the day DD 
 
    say '2015-12-25'.match($regex);     # OUTPUT: «「2015-12-25」␤»
11 days ago

btilly

Perl has supported whitespace and comments in regular expressions since approximately forever. Just use the /x modifier. All that Raku did was make that flag a default.

The same thing is available in many other languages. They copied it when they copied from Perl. For example Python's https://docs.python.org/3/library/re.html#flags documents that re.X, also called re.VERBOSE, does the same exact thing.

The fact that people don't use it is because few people care to learn regular expressions well enough to even know that it is an option. One of my favorite examples of astounding people with this was when I was writing a complex stored procedure in PostgreSQL. I read https://www.postgresql.org/docs/current/functions-matching.h.... I looked for flags. And yup, there is an x flag. It turns on "extended syntax". Which does the same exact thing. I needed a complex regular expression that I knew my coworkers couldn't have written themselves. So I commented the heck out of it. They couldn't believe that that was even a thing that you could do!

11 days ago

librasteve

I think it's fair to say(?) that Larry's adoption of regex as a first class aspect of perl was one of its unique strengths. My opinion is that none of the subsequent languages (Python, Go, Rust, etc) that incorporated regex really embraced it - was more of a bolt on. So there has been a syntactic barrier to incorporate or improve regex within those languages. Not so with Larry's raku which had the vision to build on the perl basis and to address many of the inconsistencies that have been baked in elsewhere.

4 days ago

librasteve

I was a full time perl coder with plenty of regex back in the day and never appreciated that (sorry) - so maybe making it the default is a good call

PS. raku has added quite a lot to the regex facilities we are familiar with, not least a straight line to using them in Grammars with rule and token methods that give you control over handling of whitespace in the target

4 days ago

jlv2

More like "the danger of thinking you can trivially validate user-supplied input" before evaluating the string.

11 days ago

cratermoon

Even non-trivially validating it can go wrong. See Log4Shell, e.g.

The bigger problem here is executing user input.

11 days ago

cedws

I pretty much always consider regex expressions as the wrong solution. They're notoriously hard to get right.

There's a whole lot of faulty expressions out there for validating email addresses. I prefer to do less validation and let it fail. If the email address is wrong, whatever service you're using for sending emails will just reject it. If you really do need to validate email addresses, use something somebody else wrote that does it properly.

If you're working with some exotic format for which there isn't already an open source library, do what this guy says: parse it, don't try to validate it with regex: https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

11 days ago

tsimionescu

Regex works very well for what it was originally designed: describing/validating regular languages. It can work ok if your language is simple and almost regular. They work very badly for validating non-regulars languages, even when extensions are added Perl-style to support that. And, unfortunately, most structured formats you might care to valdiate are in fact not regular languages at all.

Email addresses in particular are surprisingly complicated and far from being regular languages. I don't know how commonly real servers support the full feature set, but even if they just support non-ascii names they quickly become a pain.

11 days ago

htek

Here, you have good advice: "I ... consider regex expressions as the wrong solution. They're notoriously hard to get right."

However, the conclusion of "use something somebody else wrote that does it properly", while valid, is asking a lot. As regex is hard to get right, don't assume the code you find on the web or book or via some other means works correctly.

My rule is if I didn't write it and can't wrap my head around the code to convince myself it is the right solution, I don't use it. And as I think others have written, there are some interactive online tests for regex expressions that can help.

11 days ago

cedws

I think the average developer has a better chance of finding a robust, battle tested library to do what they need than cooking up some regex of their own. Preferably, the library does not use regex at all and checks data more intelligently.

11 days ago

scarmig

Sometimes valid email addresses will be rejected as invalid, and sometimes invalid email addresses are still successfully delivered. Validation guarantees nothing, and at most it should be a UI cue.

11 days ago

ezekg

Ruby 4 should do what every other sane programming language does and require users to opt into multi-line mode via the /m flag.

The fact that Ruby has this behavior at all is a major security issue.

11 days ago

banish-m4

I once had to explain this class of security vulnerability to IC5-IC7 senior engineers.

0. There is no universal regex language but many.

1. Perl-like ones (Ruby, Perl, and PCRE1/2) contain additional hidden traps.

2. You must vigorously match untrusted input to assume it to include invalid unicode, control characters, and other oddities.

3. You should replicate frontend and backend validations to ensure they are always exactly consistent and correct, preferably through fuzzing and/or property testing.

11 days ago

hombre_fatal

If this is sufficient for rendering the text as neon:

    @neon = "Glow With The Flow"
    erb :'index'
What exactly is `@neon = ERB.new(params[:neon]).result(binding)` even supposed to be doing?

Why wouldn't it just be:

    @neon = params[:neon]
    erb :'index'
11 days ago

sublinear

> Hire me for a penetration test

When does the blogspam end?

11 days ago

_wire_

Aye.

"Consider every ambiguity of technology as a personal marketing opportunity."

The true topic at hand is that text substitution in scripted services is an eternal hazard of code injection.

The point that regex-based input sanitization doesn't work because everyone misunderstands the token semantics for string termination is made to look like a marvelous mitigation, but this teaching on regex is distracting from an unavoidable hazard of scripting.

Good news for the contractor: he appears like Jesus to shine the Lord's light on the sin of the fathers while dancing by the moral hazard of the priesthood.

Elsewhere another instance of the OP is a service provider pushing business solutions based on the ease of use of scripted service frameworks ("Input sanitization is as simple as a regex!)

These hazards are going to get much worse as AI merges the causes of and solutions to these ambiguities into the same semantic mush.

11 days ago

JonChesterfield

Regular expressions make me sad about our industry.

If you read the early papers, you get a very clear language for pattern matching on sequences. They have really nice properties - the compilation to finite automata gives you decidable equality and decidable minimisation. As in you can compile equivalent regex to exactly the same state machine however they were expressed.

At some point perl happened and that seems to have sent us down a path to encoding the regular expression in an illegible subset of ascii. The backtracking implementation cost us negation and intersection. What should be linear time matching becomes exponential.

Emacs will let you write regex in s-expressions at which point they're much easier to read. Everywhere else has gone with "looks like Perl but has different semantics, which we kind of document, be lucky".

I started writing tests to check that regex I'd begrudgingly converted to the perl style behaved the same under different engines and the divergence is rough. Granted I was parsing regex with regex which is possibly a path to insanity but things like a literal [ were a real puzzle to match on different implementations.

I don't know that the horrible syntax on semantic beauty is due to perl but it looks likely from a superficial standpoint.

11 days ago

jacobolus

If you read the early papers you get a very limiting mathematical tool of mainly theoretical interest. At some point perl happened and regular expressions became a ubiquitous practical tool saving programmers collectively millions of hours of labor.

11 days ago

wlesieutre

Swift's RegexBuilder DSL from a couple years ago gets away from the illegible subset of ASCII.

Easy to explode into a lot lines, but I'd rather have a 50 line RegexBuilder implementation than try to keep track of what the equivalent single-line version is doing. Especially if you ever have to come back to it later and understand it again.

And if you ever make revisions in RegexBuilder you have useful diffs instead of "the one line that does everything is different than before."

https://developer.apple.com/documentation/regexbuilder

Are there similar tools in any other languages?

11 days ago

JonChesterfield

An alternative to seeking better language APIs.

Parsing regex then pretty-printing the parse tree as s-expressions is very legible. You can also print the parse tree as the original syntax. Postfix will work better for some people, I like the lispy look for parse trees.

Most regex are similar syntax over a parse tree with different parts missing, if you keep track of roughly what features the current engine has in your head the sema checking a real compiler should do could be deferred or incomplete.

Some coding standards will want redundant escapes because that is considered more readable, could put that logic in the pretty-printer.

That's sort of suggesting using your IDE to translate the thing back and forth on the fly instead of persuading colleagues to stop writing in the obfuscated format.

11 days ago

phyzome

Am I just unusual in really liking the usual regex syntax? (I mean, other than how every engine has a slightly different variation on it.) This might just be a matter of familiarity, but I find the s-expression versions harder to read, despite having worked in a Lisp for more than 10 years.

11 days ago

[deleted]
11 days ago

SEXMCNIGGA21381

[dead]

11 days ago

SEXMCNIGGA14889

[dead]

11 days ago

SEXMCNIGGA12951

[dead]

11 days ago

SEXMCNIGGA28586

[dead]

11 days ago

SEXMCNIGGA44425

[dead]

11 days ago

SEXMCNIGGA30416

[dead]

11 days ago

SEXMCNIGGA43497

[dead]

11 days ago

SEXMCNIGGA29703

[dead]

11 days ago

SEXMCNIGGA38513

[dead]

11 days ago

SEXMCNIGGA32201

[dead]

11 days ago

SEXMCNIGGA13303

[dead]

11 days ago

SEXMCNIGGA29895

[dead]

11 days ago

SEXMCNIGGA42210

[dead]

11 days ago

SEXMCNIGGA20788

[dead]

11 days ago

SEXMCNIGGA504

[dead]

11 days ago

SEXMCNIGGA16843

[dead]

11 days ago

SEXMCNIGGA33514

[dead]

11 days ago

SEXMCNIGGA28139

[dead]

11 days ago

SEXMCNIGGA25198

[dead]

11 days ago

SEXMCNIGGA1743

[dead]

11 days ago

SEXMCNIGGA31264

[dead]

11 days ago

SEXMCNIGGA23391

[dead]

11 days ago

SEXMCNIGGA36211

[dead]

11 days ago

SEXMCNIGGA32583

[dead]

11 days ago

SEXMCNIGGA14797

[dead]

11 days ago

SEXMCNIGGA39900

[dead]

11 days ago

SEXMCNIGGA10869

[dead]

11 days ago

SEXMCNIGGA19328

[flagged]

11 days ago

SEXMCNIGGA45936

[flagged]

11 days ago

SEXMCNIGGA33243

[flagged]

11 days ago

SEXMCNIGGA16803

[flagged]

11 days ago

SEXMCNIGGA1172

[flagged]

11 days ago

SEXMCNIGGA28062

[flagged]

11 days ago

SEXMCNIGGA9176

[flagged]

11 days ago

SEXMCNIGGA41422

[flagged]

11 days ago

SEXMCNIGGA45308

[flagged]

11 days ago

SEXMCNIGGA26727

[flagged]

11 days ago

SEXMCNIGGA21359

[flagged]

11 days ago

SEXMCNIGGA32308

[flagged]

11 days ago

SEXMCNIGGA49464

[flagged]

11 days ago

SEXMCNIGGA39240

[flagged]

11 days ago

SEXMCNIGGA24930

[flagged]

11 days ago

SEXMCNIGGA9569

[flagged]

11 days ago

SEXMCNIGGA5590

[flagged]

11 days ago

SEXMCNIGGA19655

[flagged]

11 days ago

SEXMCNIGGA2138

[flagged]

11 days ago

SEXMCNIGGA12400

[flagged]

11 days ago

SEXMCNIGGA16062

[flagged]

11 days ago

SEXMCNIGGA1022

[flagged]

11 days ago

SEXMCNIGGA18899

[flagged]

11 days ago

SEXMCNIGGA25508

[flagged]

11 days ago

SEXMCNIGGA8974

[flagged]

11 days ago

SEXMCNIGGA15625

[flagged]

11 days ago

SEXMCNIGGA11248

[flagged]

11 days ago

SEXMCNIGGA27280

[flagged]

11 days ago

SEXMCNIGGA20391

[flagged]

11 days ago

SEXMCNIGGA40759

[flagged]

11 days ago

SEXMCNIGGA19034

[flagged]

11 days ago

SEXMCNIGGA39672

[flagged]

11 days ago

SEXMCNIGGA34645

[flagged]

11 days ago

SEXMCNIGGA27030

[flagged]

11 days ago

SEXMCNIGGA39861

[flagged]

11 days ago