Regex Performance

I was intrigued by a recent comment from a Microsoft Hotmail developer on the ptifalls they've run into while upgrading Hotmail to .NET 2.0:

Regular Expressions can be very expensive. Certain (unintended and intended) strings may cause RegExes to exhibit exponential behavior. We've taken several hotfixes for this. RegExes are so handy, but devs really need to understand how they work; we've gotten bitten by them.

I'm definitely guilty of this. When I throw a regex together, I never worry about performance; I know the target strings will generally be far too small to ever cause a problem. But you may not need a large string to cause a major performance bottleneck – it's entirely possible to formulate regular expression patterns that consume exponentially more CPU time for each additional input character, as noted in this python mailing list posting:

I'm acutely aware of that one because it burns people regularly. These aren't cases of hostile input, they're cases of innocently "erroneous" input. After maybe a year of experience, people using a backtracking regexp engine usually figure out how to write a regexp that doesn't go resource-crazy when parsing strings that *do* match. Those are the inputs the program expects. But all inputs can suffers errors, and a regexp that works well when the input matches can still go nuts trying to match a non-matching string, consuming an exponential amount of time trying an exponential number of futile backtracking possibilities.

The example provided in that email is a pattern of

(x+x+)+y

Which means, in regex-ese:

  • One or more of the character X
  • One or more of the character X
  • One or more of the previous two matches combined
  • Followed by a single character Y

I tried this pattern in RegexBuddy's debugger first, with a simple 20 character test string:

xxxxxxxxxxxxxxxxxxxx

regexbuddy-debug-screenshot.png

This is no problem in RegexBuddy, because it has an advanced regular expression engine that throttles regex expressions before they hang your machine. Now let's try it in .NET and see what happens:

Dim pattern As String = "(x+x+)+y"
Dim sw As New Stopwatch
sw.Start()
Regex.Match("xxxxxxxxxxxxxy", pattern)
sw.Stop()
Console.Write("Valid match took ")
Console.Write(sw.ElapsedMilliseconds)
Console.WriteLine(" ms")
Dim s As String = "xx"
For n As Integer = 1 To 24
sw.Start()
Regex.Match(s, pattern)
sw.Stop()
Console.Write("Invalid match of ")
Console.Write(s.Length)
Console.Write(" chars took ")
Console.Write(sw.ElapsedMilliseconds)
Console.WriteLine(" ms")
s = s + "x"
Next

Here's the output from this console app:

Valid match took 0 ms
Invalid match of 2 chars took 0 ms
Invalid match of 3 chars took 0 ms
Invalid match of 4 chars took 0 ms
Invalid match of 5 chars took 0 ms
Invalid match of 6 chars took 0 ms
Invalid match of 7 chars took 0 ms
Invalid match of 8 chars took 0 ms
Invalid match of 9 chars took 0 ms
Invalid match of 10 chars took 1 ms
Invalid match of 11 chars took 2 ms
Invalid match of 12 chars took 3 ms
Invalid match of 13 chars took 6 ms
Invalid match of 14 chars took 13 ms
Invalid match of 15 chars took 25 ms
Invalid match of 16 chars took 50 ms
Invalid match of 17 chars took 100 ms
Invalid match of 18 chars took 202 ms
Invalid match of 19 chars took 403 ms
Invalid match of 20 chars took 808 ms
Invalid match of 21 chars took 1624 ms
Invalid match of 22 chars took 3257 ms
Invalid match of 23 chars took 6432 ms
Invalid match of 24 chars took 12857 ms
Invalid match of 25 chars took 25657 ms

Looks like n-squared behavior to me.

I can see where the Hotmail devs would be freaking out about this, considering matching against only 20 characters eats almost a full second of server CPU time! It's too bad the .NET regex engine doesn't implement some kind of throttling or exception behavior when the cost of the regex grows too large.

The author of RegexBuddy calls this catastrophic backtracking, and he has a page describing how to avoid catastrophic backtracking in your regular expressions:

The solution is simple. When nesting repetition operators, make absolutely sure that there is only one way to match the same match. If repeating the inner loop 4 times and the outer loop 7 times results in the same overall match as repeating the inner loop 6 times and the outer loop 2 times, you can be sure that the regex engine will try all those combinations.

The essential part of the customer example he gives is converting a dot (any character) match to a [^,rn] (not comma, not carriage return, not newline) match. In other words – be as specific as possible!

Related posts

Parsing Html The Cthulhu Way

Among programmers of any experience, it is generally regarded as A Bad Ideatm to attempt to parse HTML with regular expressions. How bad of an idea? It apparently drove one Stack Overflow user to the brink of madness: You can't parse [X]HTML with regex. Because HTML can&

By Jeff Atwood ·
Comments

The Problem With URLs

URLs are simple things. Or so you'd think. Let's say you wanted to detect an URL in a block of text and convert it into a bona fide hyperlink. No problem, right? Visit my website at http://www.example.com, it's awesome! To locate

By Jeff Atwood ·
Comments

The Visual Studio IDE and Regular Expressions

The Visual Studio IDE supports searching and replacing with regular expressions, right? Sure it does. It's right there in grey and black in the find and replace dialog. Just tick the "use Regular expressions" checkbox and we're off to the races. However, you'

By Jeff Atwood ·
Comments

Excluding Matches With Regular Expressions

Here's an interesting regex problem [http://techreport.com/forums/viewtopic.php?t=34782]: > I seem to have stumbled upon a puzzle that evidently is not new, but for which no (simple) solution has yet been found. I am trying to find a way to exclude an entire

By Jeff Atwood ·
Comments

Recent Posts

Stay Gold, America

Stay Gold, America

We are at an unprecedented point in American history, and I'm concerned we may lose sight of the American Dream.

By Jeff Atwood ·
Comments
The Great Filter Comes For Us All

The Great Filter Comes For Us All

With a 13 billion year head start on evolution, why haven’t any other forms of life in the universe contacted us by now? (Arrival is a fantastic movie. Watch it, but don’t stop there – read the Story of Your Life novella it was based on for so much

By Jeff Atwood ·
Comments
I Fight For The Users

I Fight For The Users

If you haven’t been able to keep up with my blistering pace of one blog post per year, I don’t blame you. There’s a lot going on right now. It’s a busy time. But let’s pause and take a moment to celebrate that Elon Musk

By Jeff Atwood ·
Comments
The 2030 Self-Driving Car Bet

The 2030 Self-Driving Car Bet

It’s my honor to announce that John Carmack and I have initiated a friendly bet of $10,000* to the 501(c)(3) charity of the winner’s choice: By January 1st, 2030, completely autonomous self-driving cars meeting SAE J3016 level 5 will be commercially available for passenger use

By Jeff Atwood ·
Comments