Coding Horror

programming and human factors

If You Like Regular Expressions So Much, Why Don't You Marry Them?

All right... I will!

Pee-Wee likes fruit salad so much, he married it

I'm continually amazed how useful regular expressions are in my daily coding. I'm still working on the MhtBuilder refactoring, and I needed a function to convert all URLs in a page of HTML from relative to absolute:

''' <summary>
''' converts all relative url references
'''    href="myfolder/mypage.htm"
''' into absolute url references
'''    href="http://mywebsite/myfolder/mypage.htm"
''' </summary>
Private Function ConvertRelativeToAbsoluteRefs(ByVal html As String) As String
Dim r As Regex
Dim urlPattern As String = _
"(?<attrib>shref|ssrc|sbackground)s*?=s*?" & _
"(?<delim1>[""']{0,2})(?!#|http|ftp|mailto|javascript)" & _
"/(?<url>[^""'>]+)(?<delim2>[""']{0,2})"
Dim cssPattern As String = _
"@imports+?(url)*['""(]{1,2}" & _
"(?!http)s*/(?<url>[^""')]+)['"")]{1,2}"
'-- href="/anything" to href="http://www.web.com/anything"
r = New Regex(urlPattern, _
RegexOptions.IgnoreCase Or RegexOptions.Multiline)
html = r.Replace(html, "${attrib}=${delim1}" & _HtmlFile.UrlRoot & "/${url}${delim2}")
'-- href="anything" to href="http://www.web.com/folder/anything"
r = New Regex(urlPattern.Replace("/", ""), _
RegexOptions.IgnoreCase Or RegexOptions.Multiline)
html = r.Replace(html, "${attrib}=${delim1}" & _HtmlFile.UrlFolder & "/${url}${delim2}")
'-- @import(/anything) to @import url(http://www.web.com/anything)
r = New Regex(cssPattern, _
RegexOptions.IgnoreCase Or RegexOptions.Multiline)
html = r.Replace(html, "@import url(" & _HtmlFile.UrlRoot & "/${url})")
'-- @import(anything) to @import url(http://www.web.com/folder/anything)
r = New Regex(cssPattern.Replace("/", ""), _
RegexOptions.IgnoreCase Or RegexOptions.Multiline)
html = r.Replace(html, "@import url(" & _HtmlFile.UrlFolder & "/${url})")
Return html
End Function

Each regex is repeated because I have to resolve relative URLs starting with forward slashes to the webroot first--and then all remaining relative URLs to the current web folder.

One of the BCL team recently recommended pretty-printing regular expressions, eg, using whitespace to make regexes more readable with RegexOptions.IgnorePatternWhitespace. I agree completely. We do this all the time with SQL. I can think of a half-dozen tools that will block of SQL and pretty format it-- but I am not aware of any regex tools that offer this functionality. I guess I'll email the author of Regexbuddy and see what he has to say.

And here's an interesting bit of trivia: did you know that the ASP.NET page parser uses regular expressions?

Written by Jeff Atwood

Indoor enthusiast. Co-founder of Stack Exchange and Discourse. Disclaimer: I have no idea what I'm talking about. Find me here: http://twitter.com/codinghorror