[Fixed]-How do you convert a Word Document into very simple html in Python?


A good solution involves uploading into Google Docs and exporting the html version from it. (There must be an api for that?)

It does so many “clean ups”; Beautiful Soup down the road can be used to make any further changes, as appropriate. It is the most powerful and elegant html parsing library on the planet.

This is a known standard for Journalist companies.



I found this web page: http://www.textfixer.com/html/convert-word-to-html.php

It converts a formated text to simple HTML markup, preserving bold, italic, links and paragraphs, but not adding tags for font-sizes and faces. Exactly what I needed to save some time.



My super-simple app WordOff has an API for cleaning up cruft from Word-exported HTML. You could override the save method of your flatpages model to pipe your HTML through the API the first time it gets saved. Something like this:

import urllib
import urllib2

def decruft(html):
    data = urllib.urlencode({'html' : html})
    req = urllib2.Request('http://wordoff.org/api/clean', data)
    response = urllib2.urlopen(req)
    return response.read()

def save(self, **kwargs):
    if not self.pk: # only de-cruft when content is first added
        self.content = decruft(self.content)
    super(FlatPage, self).save(**kwargs)


It depends how much formatting and images you’re dealing with. I do one of a couple things:

  • Google Docs: Probably the closest you’ll get to the original formatting and usable HTML.
  • Markdown: Abandon formatting. Paste it into a plain text editor, run it through Markdown and fix the rest by hand.


You can also use Abiword/wvWare to convert word document to XHTML and then parse it with BeautifulSoup/ElementTree/etc. to preprocess it if you need. In my experience, Abiword does a pretty good job at converting Word files and produce relatively clean XHTML files.

I should mention that Abiword can be run on the command line, so it’s easy to integrate it in an automated process.


Word 2010 has the ability to “save as filtered web page”. This will eliminate the overwhelming majority of the HTML that Word inserts.

Leave a comment