Cleaning Word’s Nasty HTML

I recently wrote a Word 2003 document that I later turned into a blog post. The transition between Word doc and HTML presented some problems. Word offers two HTML options in its save dialog: “Save as HTML” and “Save as Filtered HTML.” In practice, that means you get to choose between totally nasty HTML and slightly less nasty HTML.

I searched around for any existing Word cleanup solutions and found the Textism Word HTML Cleaner, and Tim Mackey’s set of regular expressions. The Textism solution is great, but requires a subscription for files over 20kb. And I wasn’t quite happy with Tim’s regular expressions, either. So I created my own Word HTML cleanup solution.

This c# 2.0 code removes all unnecessary cruft from Word documents saved as HTML, stripping the HTML down to the bare-bones basics:

static void Main(string[] args)
{
if (args.Length == 0 || String.IsNullOrEmpty(args[0]))
{
Console.WriteLine("No filename provided.");
return;
}
string filepath = args[0];
if (Path.GetFileName(filepath) == args[0])
{
filepath = Path.Combine(Environment.CurrentDirectory, filepath);
}
if (!File.Exists(args[0]))
{
Console.WriteLine("File doesn't exist.");
}
string html = File.ReadAllText(filepath);
Console.WriteLine("input html is " + html.Length + " chars");
html = CleanWordHtml(html);
html = FixEntities(html);
filepath = Path.GetFileNameWithoutExtension(filepath) + ".modified.htm";
File.WriteAllText(filepath, html);
Console.WriteLine("cleaned html is " + html.Length + " chars");
}
static string CleanWordHtml(string html)
{
StringCollection sc = new StringCollection();
// get rid of unnecessary tag spans (comments and title)
sc.Add(@"<!--(w|W)+?-->");
sc.Add(@"<title>(w|W)+?</title>");
// Get rid of classes and styles
sc.Add(@"s?class=w+");
sc.Add(@"s+style='[^']+'");
// Get rid of unnecessary tags
sc.Add(
@"<(meta|link|/?o:|/?style|/?div|/?std|/?head|/?html|body|/?body|/?span|![)[^>]*?>");
// Get rid of empty paragraph tags
sc.Add(@"(<[^>]+>)+&nbsp;(</w+>)+");
// remove bizarre v: element attached to <img> tag
sc.Add(@"s+v:w+=""[^""]+""");
// remove extra lines
sc.Add(@"(nr){2,}");
foreach (string s in sc)
{
html = Regex.Replace(html, s, "", RegexOptions.IgnoreCase);
}
return html;
}
static string FixEntities(string html)
{
NameValueCollection nvc = new NameValueCollection();
nvc.Add(""", "&ldquo;");
nvc.Add(""", "&rdquo;");
nvc.Add("–", "&mdash;");
foreach (string key in nvc.Keys)
{
html = html.Replace(key, nvc[key]);
}
return html;
}

Some caveats:

  1. I haven’t tested this with anything but Word 2003 documents saved as HTML. No guarantees on Word 97, Word 2000, Word XP, etcetera.
  2. Tables, basic formatting, and images are preserved as simple HTML. I have only tested it with a handful of Word 2003 docs saved as HTML, but it has worked fine on the few I tried.
  3. This requires .NET 2.0; I used .NET 2.0 because it’s less code.

If you’re feeling frisky, you can cut and paste the code above to build it yourself. Or you can just download it, lazyweb style:

Recent Posts

Let's Talk About The American Dream

Let's Talk About The American Dream

A few months ago I wrote about what it means to stay gold — to hold on to the best parts of ourselves, our communities, and the American Dream itself. But staying gold isn’t passive. It takes work. It takes action. It takes hard conversations that ask us to confront

By Jeff Atwood ·
Comments
Stay Gold, America

Stay Gold, America

We are at an unprecedented point in American history, and I'm concerned we may lose sight of the American Dream.

By Jeff Atwood ·
Comments
The Great Filter Comes For Us All

The Great Filter Comes For Us All

With a 13 billion year head start on evolution, why haven’t any other forms of life in the universe contacted us by now? (Arrival is a fantastic movie. Watch it, but don’t stop there – read the Story of Your Life novella it was based on for so much

By Jeff Atwood ·
Comments
I Fight For The Users

I Fight For The Users

If you haven’t been able to keep up with my blistering pace of one blog post per year, I don’t blame you. There’s a lot going on right now. It’s a busy time. But let’s pause and take a moment to celebrate that Elon Musk

By Jeff Atwood ·
Comments