[Solved]-How to pickle unicodes and save them in utf-8 databases

19👍

Pickle data is opaque, binary data, even when you use protocol version 0:

>>> pickle.dumps(data, 0)
'(dp0\nI1\nV\xe9\np1\ns.'

When you try to store that in a TextField, Django will try to decode that data to UTF8 to store it; this is what fails because this is not UTF-8 encoded data; it is binary data instead:

>>> pickled_data.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 9: invalid continuation byte

The solution is to not try to store this in a TextField. Use a BinaryField instead:

A field to store raw binary data. It only supports bytes assignment. Be aware that this field has limited functionality. For example, it is not possible to filter a queryset on a BinaryField value.

You have a bytes value (Python 2 strings are byte strings, renamed to bytes in Python 3).

If you insist on storing the data in a text field, explicitly decode it as latin1; the Latin 1 codec maps bytes one-on-one to Unicode codepoints:

>>> pickled_data.decode('latin1')
u'(dp0\nI1\nV\xe9\np1\ns.'

and make sure you encode it again before unpickling again:

>>> encoded = pickled_data.decode('latin1')
>>> pickle.loads(encoded)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/Libraries/buildout.python/parts/opt/lib/python2.7/pickle.py", line 1381, in loads
    file = StringIO(str)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 9: ordinal not in range(128)
>>> pickle.loads(encoded.encode('latin1'))
{1: u'\xe9'}

Do note that if you let this value go to the browser and back again in a text field, the browser is likely to have replaced characters in that data. Internet Explorer will replace \n characters with \r\n, for example, because it assumes it is dealing with text.

Not that you ever should allow accepting pickle data from a network connection in any case, because that is a security hole waiting for exploitation.

Leave a comment