Thursday, July 15, 2010

Tips & Tricks: How to clean Word generated HTML?

While trying to create tabulated content for one of my articles, I realized that the blogger doesn't have an option to create tables. Then I thought of hand-coding the HTML markup for the table and style it using CSS styles but I knew that would be an expensive effort. So I opted for the obvious solution - our very own MS Word which does a fantastic job when it comes to creating & formatting content for any kind of documentation one can think of.

Soon, I was documenting the content in the table that I had just created and once I was done with it all I had to do was copy & paste. Oops!! copy/paste sounds too easy, right? Yes, it does when you are not worried about what's under the hood. I mean the underlying HTML markup & loads of redundant CSS styles which Word generates for presentation of the content. So, before I called it done, I decided to clean the unnecessary markup first as it makes the HTML very bulky also. Another reason to take this pain was to ensure that I have a consistent styling setup for tables I would be creating in future too.

This led to a series of unsuccessful & tiring attempts which exhausted me to death. Well, that was the point when I was missing Adobe Dreamweaver like hell as I had used it several times in the past to tidy up any Word generated HTML. Also, I didn't spare myself for not learning the regular expression techniques which come in really handy during such painful moments. But since I didn't have Dreamweaver installed nor the regular expression skill, all I was left with is Google.

With a bit of Google, I came across a few nifty Word HTML cleaning online tools which allowed me to paste the Word content and press convert to generate a no-frills HTML without charging you a penny. Here is a list of links to a couple of such tools —
And people interested in going the 'regular expression' way, Tim Mackey has a put up a great article along with the source code and step-by-step instructions. And for those who do not want to lose the formatting and are fine with bit of extra baggage offered by Word, follow the steps below —
  1. Choose Save As in the Word document that you want to save as HTML.
  2. Select Other Formats.
  3. In the dialog box that opens, select Web Page, Filtered (*.htm; *.html) from the Save as type drop-down menu.
  4. Word will warn you against possible formatting loss but you may choose to proceed by pressing Yes. Please ensure that you have a copy of original Word document before confirming.
I did a 'Save As' using Web Page, Filtered (*.htm; *.html) as well as Web Page (*.htm; *.html) type with varied results. And to my surprise, the filtered HTML turned out to be just 20% of the regular HTML's size which is really awesome. Hope you will also save some bytes and do suggest other options you may have tried.