6π
A good solution involves uploading into Google Docs and exporting the html version from it. (There must be an api for that?)
It does so many βclean upsβ; Beautiful Soup down the road can be used to make any further changes, as appropriate. It is the most powerful and elegant html parsing library on the planet.
This is a known standard for Journalist companies.
4π
I found this web page: http://www.textfixer.com/html/convert-word-to-html.php
It converts a formated text to simple HTML markup, preserving bold, italic, links and paragraphs, but not adding tags for font-sizes and faces. Exactly what I needed to save some time.
- AssertionError: The field ' ' was declared on serializer ' ', but has not been included in the 'fields' option
- Heroku/Django: No module named dj_database_url
- How to test login process?
3π
My super-simple app WordOff has an API for cleaning up cruft from Word-exported HTML. You could override the save method of your flatpages model to pipe your HTML through the API the first time it gets saved. Something like this:
import urllib
import urllib2
def decruft(html):
data = urllib.urlencode({'html' : html})
req = urllib2.Request('http://wordoff.org/api/clean', data)
response = urllib2.urlopen(req)
return response.read()
def save(self, **kwargs):
if not self.pk: # only de-cruft when content is first added
self.content = decruft(self.content)
super(FlatPage, self).save(**kwargs)
2π
It depends how much formatting and images youβre dealing with. I do one of a couple things:
- Google Docs: Probably the closest youβll get to the original formatting and usable HTML.
- Markdown: Abandon formatting. Paste it into a plain text editor, run it through Markdown and fix the rest by hand.
- Django β Add field to queryset to store computation results
- DRF β How to handle exception on serializer create()?
- How to disable south debug logging in django?
- Django Activity Feed (Feedly Integration?)
- Python Sphinx css not working on github pages
2π
You can also use Abiword/wvWare to convert word document to XHTML and then parse it with BeautifulSoup/ElementTree/etc. to preprocess it if you need. In my experience, Abiword does a pretty good job at converting Word files and produce relatively clean XHTML files.
I should mention that Abiword can be run on the command line, so itβs easy to integrate it in an automated process.
- How to reload new update in Django project with Apache, mod_wsgi?
- Host Django with XAMPP on Windows
- Stop nosetests from printing logging information?
- Email verification in Django
- How to implement a first-time-only login scheme for a mobile web application implemented with jQuery Mobile, PhoneGap, and Django?
2π
Word 2010 has the ability to βsave as filtered web pageβ. This will eliminate the overwhelming majority of the HTML that Word inserts.
- How can I make all CharField in uppercase direct in model?
- How can I test whether Django is running in debug mode?
- Django: Filtering a model that contains a field that stores Regex