Testing With "The Force"

Markdown was one of the humane markup languages that we evaluated and adopted for Stack Overflow. I've been pretty happy with it, overall. So much so that I wanted to implement a tiny, lightweight subset of Markdown for comments as well.

I settled on these three commonly used elements:

*italic* or _italic_
**bold** or __bold__
`code`

I loves me some regular expressions and this is exactly the stuff regex was born to do! It doesn't look very tough. So I dusted off my copy of RegexBuddy and began.

I typed some test data in the test window, and whipped up a little regex in no time at all. This isn't my first time at the disco.

regexbuddy-naive-regex.png

Bam! Yes! Done and done! By gum, I must be a genius programmer!

Despite my obvious genius, I began to have some small, nagging doubts. Is the test phrase...

I would like this to be *italic* please.

... really enough testing?

Sure it is! I can feel in my bones that this thing freakin' works! It's almost like I'm being pulled toward shipping this code by some inexorable, dark, testing ... force. It's so seductively easy!

come-to-the-dark-side-vader.jpg

But wait. I have this whole database of real world comments that people have entered on Stack Overflow. shouldn't I perhaps try my awesome regular expression on that corpus of data to see what happens? Oh, fine. If we must. Just to humor you, nagging doubt. Let's run a query and see.

select Text from PostComments
where dbo.RegexIsMatch(Text, '*(.*?)*') = 1

Which produced this list of matches, among others:

Interesting fact about math: x * 7 == x + (x * 2) + (x * 4), or x + x >> 1 + x >> 2. Integer addition is usually pretty cheap.

Thanks. What I needed was to turn on Singleline mode too, and use .*? instead of .*.

yeah, see my edit - change select * to select RESULT.* one row - are sure you have more than one row item with the same InstanceGUID?

Not your main problem, but you are mix and matching wchar_t and TCHAR. mbstowcs() converts from char * to wchar_t *.

aawwwww.... Brainf**k is not valid. :/

Thank goodness I listened to my midichlorians and let the light side of the testing force prevail here!

come-to-the-light-side-skywalker.jpg

So how do we fix this regex? We use the light side of the force -- brute force, that is, against a ton of test cases! My job here is relatively easy because I have over 20,000 test cases sitting in a database. You may not have that luxury. Maybe you'll need to go out and find a bunch of test data on the internet somewhere. Or write a function that generates random strings to feed to the routine, also known as fuzz testing.

I wanted to leave the rest of this regular expression as an exercise for the reader, as I'm a sick guy who finds that sort of thing entertaining. If you don't -- well, what the heck is wrong with you, man? But I digress. I've been criticized for not providing, you know, "the answer" in my blog posts. Let's walk through some improvements to our italic regex pattern.

First, let's make sure we have at least one non-whitespace character inside the asterisks. And more than one character in total so we don't match the ** case. We'll use positive lookahead and lookbehind to do that.

*(?=S)(.+?)(?<=S)*

That helps a lot, but we can test against our data to discover some other problems. We get into trouble when there are unexpected characters in front of or behind the asterisks, like, say, p*q*r. So let's specify that we only want certain characters outside the asterisks.

(?<=[s^,(])*(?=S)(.+?)(?<=S)*(?=[s$,.?!])

Run this third version against the data corpus, and wow, that's starting to look pretty darn good! There are undoubtedly some edge conditions, particularly since we're unlucky enough to be talking about code in a lot of our comments, which has wacky asterisk use.

This regex doesn't have to be (and probably cannot be, given the huge possible number of human inputs) perfect, but running it against a large set of input test data gives me reasonable confidence that I'm not totally screwing up.

So by all means, test your code with the force -- brute force! It's good stuff! Just be careful not to get sloppy, and let the dark side of the testing force prevail. If you think one or two simple test cases covers it, that's taking the easy (and most likely, buggy and incorrect) way out.

Read more

Stay Gold, America

We are at an unprecedented point in American history, and I'm concerned we may lose sight of the American Dream.

By Jeff Atwood · · Comments

The Great Filter Comes For Us All

With a 13 billion year head start on evolution, why haven't any other forms of life in the universe contacted us by now? (Arrival is a fantastic movie. Watch it, but don't stop there - read the Story of Your Life novella it was based on

By Jeff Atwood · · Comments

I Fight For The Users

If you haven't been able to keep up with my blistering pace of one blog post per year, I don't blame you. There's a lot going on right now. It's a busy time. But let's pause and take a moment

By Jeff Atwood · · Comments

The 2030 Self-Driving Car Bet

It's my honor to announce that John Carmack and I have initiated a friendly bet of $10,000* to the 501(c)(3) charity of the winner’s choice: By January 1st, 2030, completely autonomous self-driving cars meeting SAE J3016 level 5 will be commercially available for passenger

By Jeff Atwood · · Comments