Printing HTML as Text in Python with Unicode

I just spent longer than I would like to admit trying to track down why my code was barfing on unicode characters (like the é in café). It turns out that Python 2.3.5 (the version shipped with Mac OS X Tiger) converts HTML charrefs (e.g. &eaigu; — é) to str characters rather than to unicode characters. Apparently, sgmllib (which is used by htmllib) uses chr() instead of unichr().

For various reasons, I need to be able to take HTML formatted text and convert it to plain text. The python standard library includes htmllib, which makes that a fairly easy task. I just had to modify it slightly to get it handle Unicode. Here’s my solution.

Update: Added fix for entityrefs, too (e.g.  &#233)

import codecs
import cStringIO
import formatter
import htmlentitydefs
import htmllib

class UnicodeHTMLParser(htmllib.HTMLParser):
    “”" HTMLParser that can handle unicode charrefs “”"
    
entitydefs = dict([ (k, unichr(v)) for k, v in htmlentitydefs.name2codepoint.items() ])

    def handle_charref(self, name):
        “”"Override builtin version to return unicode instead of binary strings for 8-bit chars.”"”
        try:
            n = int(name)
        except ValueError:
            self.unknown_charref(name)
            return
        if not 0 <= n <= 255:
            self.unknown_charref(name)
            return
        if 0 <= n <= 127:
            self.handle_data(chr(n))
        else:
            self.handle_data(unichr(n))
            
def prettyPrintHTML(html):
    “”" Strip HTML formatting to produce plain text suitable for printing. “”"
    sio = cStringIO.StringIO()
    # cStringIO doesn’t like Unicode, so wrap with a utf8 encoder/decoder.
    encoder, decoder, reader, writer = codecs.lookup(‘utf8′)
    utf8io = codecs.StreamReaderWriter(sio, reader, writer, ‘replace’)
    writer = formatter.DumbWriter(utf8io)
    prettifier = formatter.AbstractFormatter(writer)
    parser = UnicodeHTMLParser(prettifier)
    parser.feed(html)
    parser.close()
    utf8io.seek(0)
    result = utf8io.read()
    sio.close()
    utf8io.close()
    return result

5 Comments »

  1. vimal Said,

    May 25, 2006 @ 1:43 am

    Sir,
    I am new in python and I was searching for some code which can convert my HTML pages to txt and finally got your code, but I have some confusion. I am familier with C programming. In this code I am not able to understand from where this code is taking html page as input for eg If i have some html page in my computer and path is d:/vimal/test.htm and I wanna convert this page to text.

  2. vimal Said,

    May 25, 2006 @ 2:35 am

    One more thing Ihave thousands of pages in one folder and I want to convert all those in to text in seprate folder

  3. James Eagan Said,

    May 25, 2006 @ 10:09 am

    vimal: The code above just defines a function that implements that functionality. You would need to include it in a (simple wrapper) application to make it useful. The prettyPrintHTML function takes a string containing the HTML in the document as its only parameter, and returns the resulting plain text as a string.

    If you’re still having trouble, you might want to check out the python tutorial.

  4. Denne Reed Said,

    August 12, 2006 @ 7:44 pm

    Need to import formatter for this to work, no?

  5. James Eagan Said,

    August 13, 2006 @ 9:32 am

    Denne: Yes, you do. Thanks for pointing that out. That’s what I get for not testing after excerpting the code.

RSS feed for comments on this post · TrackBack URI

Leave a Comment