Support for “bad” encodings

The ftfy.bad_codecs module gives Python the ability to decode some common, flawed encodings.

Python does not want you to be sloppy with your text. Its encoders and decoders (“codecs”) follow the relevant standards whenever possible, which means that when you get text that doesn’t follow those standards, you’ll probably fail to decode it. Or you might succeed at decoding it for implementation-specific reasons, which is perhaps worse.

There are some encodings out there that Python wishes didn’t exist, which are widely used outside of Python:

  • “utf-8-variants”, a family of not-quite-UTF-8 encodings, including the ever-popular CESU-8 and “Java modified UTF-8”.

  • “Sloppy” versions of character map encodings, where bytes that don’t map to anything will instead map to the Unicode character with the same number.

Simply importing this module, or in fact any part of the ftfy package, will make these new “bad codecs” available to Python through the standard Codecs API. You never have to actually call any functions inside ftfy.bad_codecs.

However, if you want to call something because your code checker insists on it, you can call ftfy.bad_codecs.ok().

A quick example of decoding text that’s encoded in CESU-8:

>>> import ftfy.bad_codecs
>>> print(b'\xed\xa0\xbd\xed\xb8\x8d'.decode('utf-8-variants'))
😍

“Sloppy” encodings

ftfy.bad_codecs.sloppy provides character-map encodings that fill their “holes” in a messy but common way: by outputting the Unicode codepoints with the same numbers.

This is incredibly ugly, and it’s also in the HTML5 standard.

A single-byte encoding maps each byte to a Unicode character, except that some bytes are left unmapped. In the commonly-used Windows-1252 encoding, for example, bytes 0x81 and 0x8D, among others, have no meaning.

Python, wanting to preserve some sense of decorum, will handle these bytes as errors. But Windows knows that 0x81 and 0x8D are possible bytes and they’re different from each other. It just hasn’t defined what they are in terms of Unicode.

Software that has to interoperate with Windows-1252 and Unicode – such as all the common Web browsers – will pick some Unicode characters for them to map to, and the characters they pick are the Unicode characters with the same numbers: U+0081 and U+008D. This is the same as what Latin-1 does, and the resulting characters tend to fall into a range of Unicode that’s set aside for obsolete Latin-1 control characters anyway.

These sloppy codecs let Python do the same thing, thus interoperating with other software that works this way. It defines a sloppy version of many single-byte encodings with holes. (There is no need for a sloppy version of an encoding without holes: for example, there is no such thing as sloppy-iso-8859-2 or sloppy-macroman.)

The following encodings will become defined:

  • sloppy-windows-1250 (Central European, sort of based on ISO-8859-2)

  • sloppy-windows-1251 (Cyrillic)

  • sloppy-windows-1252 (Western European, based on Latin-1)

  • sloppy-windows-1253 (Greek, sort of based on ISO-8859-7)

  • sloppy-windows-1254 (Turkish, based on ISO-8859-9)

  • sloppy-windows-1255 (Hebrew, based on ISO-8859-8)

  • sloppy-windows-1256 (Arabic)

  • sloppy-windows-1257 (Baltic, based on ISO-8859-13)

  • sloppy-windows-1258 (Vietnamese)

  • sloppy-cp874 (Thai, based on ISO-8859-11)

  • sloppy-iso-8859-3 (Maltese and Esperanto, I guess)

  • sloppy-iso-8859-6 (different Arabic)

  • sloppy-iso-8859-7 (Greek)

  • sloppy-iso-8859-8 (Hebrew)

  • sloppy-iso-8859-11 (Thai)

Aliases such as “sloppy-cp1252” for “sloppy-windows-1252” will also be defined.

Five of these encodings (sloppy-windows-1250 through sloppy-windows-1254) are used within ftfy.

Here are some examples, using ftfy.explain_unicode() to illustrate how sloppy-windows-1252 merges Windows-1252 with Latin-1:

>>> from ftfy import explain_unicode
>>> some_bytes = b'\x80\x81\x82'
>>> explain_unicode(some_bytes.decode('latin-1'))
U+0080  \x80    [Cc] <unknown>
U+0081  \x81    [Cc] <unknown>
U+0082  \x82    [Cc] <unknown>
>>> explain_unicode(some_bytes.decode('windows-1252', 'replace'))
U+20AC  €       [Sc] EURO SIGN
U+FFFD  �       [So] REPLACEMENT CHARACTER
U+201A  ‚       [Ps] SINGLE LOW-9 QUOTATION MARK
>>> explain_unicode(some_bytes.decode('sloppy-windows-1252'))
U+20AC  €       [Sc] EURO SIGN
U+0081  \x81    [Cc] <unknown>
U+201A  ‚       [Ps] SINGLE LOW-9 QUOTATION MARK

Variants of UTF-8

This file defines a codec called “utf-8-variants” (or “utf-8-var”), which can decode text that’s been encoded with a popular non-standard version of UTF-8. This includes CESU-8, the accidental encoding made by layering UTF-8 on top of UTF-16, as well as Java’s twist on CESU-8 that contains a two-byte encoding for codepoint 0.

This is particularly relevant in Python 3, which provides no other way of decoding CESU-8 1.

The easiest way to use the codec is to simply import ftfy.bad_codecs:

>>> import ftfy.bad_codecs
>>> result = b'here comes a null! \xc0\x80'.decode('utf-8-var')
>>> print(repr(result).lstrip('u'))
'here comes a null! \x00'

The codec does not at all enforce “correct” CESU-8. For example, the Unicode Consortium’s not-quite-standard describing CESU-8 requires that there is only one possible encoding of any character, so it does not allow mixing of valid UTF-8 and CESU-8. This codec does allow that, just like Python 2’s UTF-8 decoder does.

Characters in the Basic Multilingual Plane still have only one encoding. This codec still enforces the rule, within the BMP, that characters must appear in their shortest form. There is one exception: the sequence of bytes 0xc0 0x80, instead of just 0x00, may be used to encode the null character U+0000, like in Java.

If you encode with this codec, you get legitimate UTF-8. Decoding with this codec and then re-encoding is not idempotent, although encoding and then decoding is. So this module won’t produce CESU-8 for you. Look for that functionality in the sister module, “Breaks Text For You”, coming approximately never.

1

In a pinch, you can decode CESU-8 in Python 2 using the UTF-8 codec: first decode the bytes (incorrectly), then encode them, then decode them again, using UTF-8 as the codec every time. But Python 2 is dead, so use ftfy instead.