[Fixed]-How do you convert a Word Document into very simple html in Python?

6πŸ‘

βœ…

A good solution involves uploading into Google Docs and exporting the html version from it. (There must be an api for that?)

It does so many β€œclean ups”; Beautiful Soup down the road can be used to make any further changes, as appropriate. It is the most powerful and elegant html parsing library on the planet.

This is a known standard for Journalist companies.

πŸ‘€lprsd

4πŸ‘

I found this web page: http://www.textfixer.com/html/convert-word-to-html.php

It converts a formated text to simple HTML markup, preserving bold, italic, links and paragraphs, but not adding tags for font-sizes and faces. Exactly what I needed to save some time.

πŸ‘€DerVO

3πŸ‘

My super-simple app WordOff has an API for cleaning up cruft from Word-exported HTML. You could override the save method of your flatpages model to pipe your HTML through the API the first time it gets saved. Something like this:

import urllib
import urllib2

def decruft(html):
    data = urllib.urlencode({'html' : html})
    req = urllib2.Request('http://wordoff.org/api/clean', data)
    response = urllib2.urlopen(req)
    return response.read()

def save(self, **kwargs):
    if not self.pk: # only de-cruft when content is first added
        self.content = decruft(self.content)
    super(FlatPage, self).save(**kwargs)
πŸ‘€tomd

2πŸ‘

It depends how much formatting and images you’re dealing with. I do one of a couple things:

  • Google Docs: Probably the closest you’ll get to the original formatting and usable HTML.
  • Markdown: Abandon formatting. Paste it into a plain text editor, run it through Markdown and fix the rest by hand.
πŸ‘€Chris Amico

2πŸ‘

You can also use Abiword/wvWare to convert word document to XHTML and then parse it with BeautifulSoup/ElementTree/etc. to preprocess it if you need. In my experience, Abiword does a pretty good job at converting Word files and produce relatively clean XHTML files.

I should mention that Abiword can be run on the command line, so it’s easy to integrate it in an automated process.

πŸ‘€Etienne

2πŸ‘

Word 2010 has the ability to β€œsave as filtered web page”. This will eliminate the overwhelming majority of the HTML that Word inserts.

πŸ‘€Greg Burdett

Leave a comment