7 minute read

Sometimes you hear and accept advice but you don’t really know the details behind it. That has always been the case with me and high CPU situations caused by badly formatted regular expressions. The reason I haven’t really bothered to find out why is because I’ve never had a case on it until today.

It’s actually not that I haven’t been interested, in fact regular expressions fascinate me in a geeky way. My favorite class in college was a class on automata theory and formal languages a class about finite state machines, turing machines, and regular expression like functions and constructs.

The problem that I got involved a very complex regular expression, but rather than using that one for this case study, I will use a very simple regular expression that the BCL (Base Class Library) team blogged about here, since it shows the same problem.

Disclaimer: Before I go on, I just want to say that I am by no means an expert on regular expressions, so I may be missing some important points in the discussion or have some details wrong, and if you have regex specific questions you’re probably better of directing them to the BCL blog.

Overview of problem

The code I’m using looks like this

Regex slowRegex = new Regex("([a-z]+)*=");
Match m = slowRegex.Match("helloworldhelloworld");

The problem and solution is already given in the BCL post but here is a short description of what happens

The .NET Regular Expression Engine is an NFA (non-deterministic finite automation engine) which is great since it gives developers a great deal of flexibility when creating regular expressions. The drawbacks of an NFA vs. a DFA is that some regular expressions like the one above can take an extremely long time to run.

As David writes on the BCL blog the expression above will

  1. match any character a-z, one or more times
  2. match step #1 zero or more times
  3. match an equals

It will quickly go through steps 1 and 2 and then find out that it can’t match 3, then it has to backtrack and retry steps 1 and 2 in a different way, and because of the nested quantifiers ((..+)) there is an almost endless number of combinations it has to backtrack a lot, and the string will match to all of these combinations of ([a-z]+) so you have to go through the whole string for each combination to realize that the string isn’t a match to the whole regexp.

There are a few things to point out here

  • If it would have found a match early on the call would have been really quick, but since it doesn’t find a match it will go through all combinations of (one or more characters)*(zero or more times)
  • The number of combination grows exponentially with the length of the string so it will be fairly quick for a 3 char string, but painfully slow for a 20 char string.
  • The expression above could easily be changed to [a-z]*= resulting in a very performant regexp, and although this is fairly easy to spot on this particular expression, it is way harder to spot the transformation to a regexp without nested quantifiers for a more complex regexp. To quickly explain backtracking, consider the following regular expression and string
Regexp: (abe|abc)(dg|dc|de)
String: abcde

This regular expression will match any string that starts with abe or abc, and ends with any of dg, dc, or de, which could be rewritten as ab(e|c)d(g|c|e) to avoid the backtracks below.

The order of execution here will be something like

  1. a matches a in abe
  2. b matches b in abe
  3. c doesnt match e in abe, backtrack to try with the next expression
  4. c matches c in abc - done with first part
  5. d matches d in dg
  6. e doesn’t match g in dg, backtrack to try with the next expression
  7. e doesn’t match c in dc, backtrack to try with the next expression
  8. e matches e in de - done with the regular expression

In other words we backtracked 3 times to match the expression. The reason I bring this up is because the number of backtracks is essential to figuring out how performant the regexp is.

Debugging the issue

What I want to show you in this post is how to debug it and identify it with windbg and sos along with some information that may help you figure out how to optimize your regular expressions.

If I test with a fairly short string (about 20 chars) the regular expression will chug away for a good few minutes before returning to me that there are no matches.

If I attach windbg when CPU usage is high and run !clrstack on the only active thread (thread 0 in my case) I get this.

OS Thread Id: 0x90c (0)
ESP       EIP
0016eb50 7a4c5b28 System.Text.RegularExpressions.RegexInterpreter.Go()
0016ec84 7a4c5135 System.Text.RegularExpressions.RegexRunner.Scan(System.Text.RegularExpressions.Regex, System.String, Int32, Int32, Int32, Int32, Boolean)
0016ecb0 7a4b12b3 System.Text.RegularExpressions.Regex.Run(Boolean, Int32, System.String, Int32, Int32, Int32)
0016ece0 7a4b147c System.Text.RegularExpressions.Regex.Match(System.String)
0016ecec 0093050f BadRegexp.Form1.btnRun_Click(System.Object, System.EventArgs)
0016ed0c 7b062e97 System.Windows.Forms.Control.OnClick(System.EventArgs)
0016ed1c 7b11f4f9 System.Windows.Forms.Button.OnClick(System.EventArgs)
0016ed28 7b11f608 System.Windows.Forms.Button.OnMouseUp(System.Windows.Forms.MouseEventArgs)
0016ed50 7b0eaba2 System.Windows.Forms.Control.WmMouseUp(System.Windows.Forms.Message ByRef, System.Windows.Forms.MouseButtons, Int32)
0016ed9c 7b07687f System.Windows.Forms.Control.WndProc(System.Windows.Forms.Message ByRef)
0016eda0 7b08549e [InlinedCallFrame: 0016eda0]
0016ee40 7b0853bb System.Windows.Forms.Button.WndProc(System.Windows.Forms.Message ByRef)
0016ee48 7b07e43d System.Windows.Forms.Control+ControlNativeWindow.OnMessage(System.Windows.Forms.Message ByRef)
0016ee4c 7b07e416 System.Windows.Forms.Control+ControlNativeWindow.WndProc(System.Windows.Forms.Message ByRef)
0016ee60 7b07e21a System.Windows.Forms.NativeWindow.Callback(IntPtr, Int32, IntPtr, IntPtr)
0016f008 00940a0c [NDirectMethodFrameStandalone: 0016f008] System.Windows.Forms.UnsafeNativeMethods.DispatchMessageW(MSG ByRef)
0016f018 7b0882e9 System.Windows.Forms.Application+ComponentManager.System.Windows.Forms.UnsafeNativeMethods.IMsoComponentManager.FPushMessageLoop(Int32, Int32, Int32)
0016f0b8 7b087ea9 System.Windows.Forms.Application+ThreadContext.RunMessageLoopInner(Int32, System.Windows.Forms.ApplicationContext)
0016f130 7b087ce7 System.Windows.Forms.Application+ThreadContext.RunMessageLoop(Int32, System.Windows.Forms.ApplicationContext)
0016f160 7b0688ba System.Windows.Forms.Application.Run(System.Windows.Forms.Form)
0016f170 009300a8 BadRegexp.Program.Main()
0016f394 79e79dd3 [GCFrame: 0016f394]

This stack tells us that we are in the middle of matching a regular expression, and that the reason we are here is because we called Regex.Match in btnRun_Click

If I let it continue running and break in a while later I see the exact same stack but the instruction pointer (EIP) in Go() has changed so i can tell that it is moving but appears to be stuck in some type of loop inside the Go function or potentially around it.

To figure out what regular expression we are running and what string we are parsing we can run !dso and dump out some of the objects on the stack

0:000> !dso
OS Thread Id: 0x90c (0)
ESP/REG  Object   Name
...
0016ec90 0152b7ac System.Text.RegularExpressions.Regex
0016ecac 0152c434 System.String    helloworldhelloworld
...
0:000> !do 0152b7ac
...
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
790fcb30  400060a        4        System.String  0 instance 0152b784 pattern
...
0:000> !do 0152b784
...
String: ([a-z]+)*=
...

A few iterations later when i break in I see the exact same stack but this time it has a call to System.Text.RegularExpressions.RegexInterpreter.Backtrack on top, and based on the discussion above this looks extremely interesting.

In order to figure out if we are looping inside the Go function or outside of it I set a breakpoint on the Go function and on the Backtrack function.

0:000> !do 0152c47c
Name: System.Text.RegularExpressions.RegexInterpreter
MethodTable: 7a764e3c
...
0:000> !dumpmt -md 7a764e3c
...
MethodDesc Table
   Entry MethodDesc      JIT Name
...
7a4c5610   7a826100   PreJIT System.Text.RegularExpressions.RegexInterpreter.Go()
...
7a4c7a10   7a826040   PreJIT System.Text.RegularExpressions.RegexInterpreter.Backtrack()
...
0:000> !bpmd -md 7a826100
MethodDesc = 7a826100
Setting breakpoint: bp 7A4C5610 [System.Text.RegularExpressions.RegexInterpreter.Go()]
0:000> !bpmd -md 7a826040
MethodDesc = 7a826040
Setting breakpoint: bp 7A4C7A10 [System.Text.RegularExpressions.RegexInterpreter.Backtrack()]

Then I modify the breakpoints so that we just log rather than breaking in,

0:000> bl
 0 e 7a4c5610     0001 (0001)  0:**** System_ni!System.Text.RegularExpressions.RegexInterpreter.Go()
 1 e 7a4c7a10     0001 (0001)  0:**** System_ni!System.Text.RegularExpressions.RegexInterpreter.Backtrack()
0:000> bp 7a4c5610 ".echo Go;g"
breakpoint 0 redefined
0:000> bp 7a4c7a10 ".echo Backtrack;g"
breakpoint 1 redefined

and finally, here are the results (I hope I got all the calculations correct when creating the formulas :))…

Number of characters Go breakpoint (# times) Backtrack breakpoint (# times)
1 1 10
2 2 10+19=29
3 3 10+19+37=66
n n (9*2^n)+n-9

so the number of backtracks grows exponentially with the length of the string

If we set the same breakpoints for the regexp ([a-z])*= we get the following results

Number of characters Go breakpoint (# times) Backtrack breakpoint (# times)
1 1 10
2 2 10+14
3 3 10+14+18
n n 14n-4

So here the number of backtracks doesn’t grow exponentially which is why it performs a lot better even for long strings

Interestingly enough, for the slightly modified but equivalent regexp [a-z]*= the number of backtracks is 4n-1. It doesn’t really make a huge difference and I don’t know why there is a difference but still I found it interesting:)

Summary

When you work with regular expressions that use an NDA engine you should take extra care to try to avoid or decrease the following 3 elements

  • nested quantifiers (*, +, {m,n})
  • backtracks
  • ambiguous matches (i.e. in this case the number of combinations that cause a positive match for part of the string)

On a separate note, the technique I used to print out the number of backtracks using breakpoints can be used for other issues too if you need to determine where a tight loop exists.

Until next time, Tess