We're hiring!

We're actively seeking developers & designers for our new Detroit location. Learn more

Converting HTML to XHTML using Hpricot, Nokogiri, and Tidy

Recently, I put together a Ruby script to update a bunch of poorly formed HTML fragments into clean XHTML fragments.

My initial naive implementation used Nokogiri’s DocumentFragment class to parse the HTML and output XHTML. This approach worked fairly well for cases that involved tags that were not properly closed.

However, in other cases, like RHTML and Javascript, this approach proved too aggressive.

Next, I tried Hpricot.

This is a little better, but the Javascript is still not properly escaped so the result is not XHTML compatible.

Finally, I tried Tidy. In this implementation I used the command line tool directly.

Tidy solved all my problems!

This entry was posted in Web. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">