Printing HTML as Text in Python with Unicode
I just spent longer than I would like to admit trying to track down why my code was barfing on unicode characters (like the é in café). It turns out that Python 2.3.5 (the version shipped with Mac OS X Tiger) converts HTML charrefs (e.g. &eaigu; — é) to str characters rather than to unicode characters. Apparently, sgmllib (which is used by htmllib) uses chr() instead of unichr().
For various reasons, I need to be able to take HTML formatted text and convert it to plain text. The python standard library includes htmllib, which makes that a fairly easy task. I just had to modify it slightly to get it handle Unicode. Here’s my solution.
Update: Added fix for entityrefs, too (e.g. é)
import codecs
import cStringIO
import formatter
import htmlentitydefs
import htmllib
class UnicodeHTMLParser(htmllib.HTMLParser):
“”" HTMLParser that can handle unicode charrefs “”"
entitydefs = dict([ (k, unichr(v)) for k, v in htmlentitydefs.name2codepoint.items() ])
def handle_charref(self, name):
“”"Override builtin version to return unicode instead of binary strings for 8-bit chars.”"”
try:
n = int(name)
except ValueError:
self.unknown_charref(name)
return
if not 0 <= n <= 255:
self.unknown_charref(name)
return
if 0 <= n <= 127:
self.handle_data(chr(n))
else:
self.handle_data(unichr(n))
def prettyPrintHTML(html):
“”" Strip HTML formatting to produce plain text suitable for printing. “”"
sio = cStringIO.StringIO()
# cStringIO doesn’t like Unicode, so wrap with a utf8 encoder/decoder.
encoder, decoder, reader, writer = codecs.lookup(‘utf8′)
utf8io = codecs.StreamReaderWriter(sio, reader, writer, ‘replace’)
writer = formatter.DumbWriter(utf8io)
prettifier = formatter.AbstractFormatter(writer)
parser = UnicodeHTMLParser(prettifier)
parser.feed(html)
parser.close()
utf8io.seek(0)
result = utf8io.read()
sio.close()
utf8io.close()
return result
vimal Said,
May 25, 2006 @ 1:43 am
Sir,
I am new in python and I was searching for some code which can convert my HTML pages to txt and finally got your code, but I have some confusion. I am familier with C programming. In this code I am not able to understand from where this code is taking html page as input for eg If i have some html page in my computer and path is d:/vimal/test.htm and I wanna convert this page to text.
vimal Said,
May 25, 2006 @ 2:35 am
One more thing Ihave thousands of pages in one folder and I want to convert all those in to text in seprate folder
James Eagan Said,
May 25, 2006 @ 10:09 am
vimal: The code above just defines a function that implements that functionality. You would need to include it in a (simple wrapper) application to make it useful. The prettyPrintHTML function takes a string containing the HTML in the document as its only parameter, and returns the resulting plain text as a string.
If you’re still having trouble, you might want to check out the python tutorial.
Denne Reed Said,
August 12, 2006 @ 7:44 pm
Need to import formatter for this to work, no?
James Eagan Said,
August 13, 2006 @ 9:32 am
Denne: Yes, you do. Thanks for pointing that out. That’s what I get for not testing after excerpting the code.