Recently, I put together a Ruby script to update a bunch of poorly formed HTML fragments into clean XHTML fragments.
My initial naive implementation used Nokogiri’s DocumentFragment class to parse the HTML and output XHTML. This approach worked fairly well for cases that involved tags that were not properly closed.
However, in other cases, like RHTML and Javascript, this approach proved too aggressive.
Next, I tried Hpricot.
This is a little better, but the Javascript is still not properly escaped so the result is not XHTML compatible.
Finally, I tried Tidy. In this implementation I used the command line tool directly.
Tidy solved all my problems!

