Support for “bad” encodings¶
ftfy.bad_codecs module gives Python the ability to decode some common,
Python does not want you to be sloppy with your text. Its encoders and decoders (“codecs”) follow the relevant standards whenever possible, which means that when you get text that doesn’t follow those standards, you’ll probably fail to decode it. Or you might succeed at decoding it for implementation-specific reasons, which is perhaps worse.
There are some encodings out there that Python wishes didn’t exist, which are widely used outside of Python:
“utf-8-variants”, a family of not-quite-UTF-8 encodings, including the ever-popular CESU-8 and “Java modified UTF-8”.
“Sloppy” versions of character map encodings, where bytes that don’t map to anything will instead map to the Unicode character with the same number.
Simply importing this module, or in fact any part of the
ftfy package, will
make these new “bad codecs” available to Python through the standard Codecs
API. You never have to actually call any functions inside
However, if you want to call something because your code checker insists on it,
you can call
A quick example of decoding text that’s encoded in CESU-8:
>>> import ftfy.bad_codecs >>> print(b'\xed\xa0\xbd\xed\xb8\x8d'.decode('utf-8-variants')) 😍
ftfy.bad_codecs.sloppy provides character-map encodings that fill their “holes”
in a messy but common way: by outputting the Unicode codepoints with the same
This is incredibly ugly, and it’s also in the HTML5 standard.
A single-byte encoding maps each byte to a Unicode character, except that some bytes are left unmapped. In the commonly-used Windows-1252 encoding, for example, bytes 0x81 and 0x8D, among others, have no meaning.
Python, wanting to preserve some sense of decorum, will handle these bytes as errors. But Windows knows that 0x81 and 0x8D are possible bytes and they’re different from each other. It just hasn’t defined what they are in terms of Unicode.
Software that has to interoperate with Windows-1252 and Unicode – such as all the common Web browsers – will pick some Unicode characters for them to map to, and the characters they pick are the Unicode characters with the same numbers: U+0081 and U+008D. This is the same as what Latin-1 does, and the resulting characters tend to fall into a range of Unicode that’s set aside for obsolete Latin-1 control characters anyway.
These sloppy codecs let Python do the same thing, thus interoperating with other software that works this way. It defines a sloppy version of many single-byte encodings with holes. (There is no need for a sloppy version of an encoding without holes: for example, there is no such thing as sloppy-iso-8859-2 or sloppy-macroman.)
The following encodings will become defined:
sloppy-windows-1250 (Central European, sort of based on ISO-8859-2)
sloppy-windows-1252 (Western European, based on Latin-1)
sloppy-windows-1253 (Greek, sort of based on ISO-8859-7)
sloppy-windows-1254 (Turkish, based on ISO-8859-9)
sloppy-windows-1255 (Hebrew, based on ISO-8859-8)
sloppy-windows-1257 (Baltic, based on ISO-8859-13)
sloppy-cp874 (Thai, based on ISO-8859-11)
sloppy-iso-8859-3 (Maltese and Esperanto, I guess)
sloppy-iso-8859-6 (different Arabic)
Aliases such as “sloppy-cp1252” for “sloppy-windows-1252” will also be defined.
Five of these encodings (
are used within ftfy.
Here are some examples, using
ftfy.explain_unicode() to illustrate how
sloppy-windows-1252 merges Windows-1252 with Latin-1:
>>> from ftfy import explain_unicode >>> some_bytes = b'\x80\x81\x82' >>> explain_unicode(some_bytes.decode('latin-1')) U+0080 \x80 [Cc] <unknown> U+0081 \x81 [Cc] <unknown> U+0082 \x82 [Cc] <unknown>
>>> explain_unicode(some_bytes.decode('windows-1252', 'replace')) U+20AC € [Sc] EURO SIGN U+FFFD � [So] REPLACEMENT CHARACTER U+201A ‚ [Ps] SINGLE LOW-9 QUOTATION MARK
>>> explain_unicode(some_bytes.decode('sloppy-windows-1252')) U+20AC € [Sc] EURO SIGN U+0081 \x81 [Cc] <unknown> U+201A ‚ [Ps] SINGLE LOW-9 QUOTATION MARK
Variants of UTF-8¶
This file defines a codec called “utf-8-variants” (or “utf-8-var”), which can decode text that’s been encoded with a popular non-standard version of UTF-8. This includes CESU-8, the accidental encoding made by layering UTF-8 on top of UTF-16, as well as Java’s twist on CESU-8 that contains a two-byte encoding for codepoint 0.
This is particularly relevant in Python 3, which provides no other way of decoding CESU-8 1.
The easiest way to use the codec is to simply import
>>> import ftfy.bad_codecs >>> result = b'here comes a null! \xc0\x80'.decode('utf-8-var') >>> print(repr(result).lstrip('u')) 'here comes a null! \x00'
The codec does not at all enforce “correct” CESU-8. For example, the Unicode Consortium’s not-quite-standard describing CESU-8 requires that there is only one possible encoding of any character, so it does not allow mixing of valid UTF-8 and CESU-8. This codec does allow that, just like Python 2’s UTF-8 decoder does.
Characters in the Basic Multilingual Plane still have only one encoding. This
codec still enforces the rule, within the BMP, that characters must appear in
their shortest form. There is one exception: the sequence of bytes
instead of just
0x00, may be used to encode the null character
If you encode with this codec, you get legitimate UTF-8. Decoding with this codec and then re-encoding is not idempotent, although encoding and then decoding is. So this module won’t produce CESU-8 for you. Look for that functionality in the sister module, “Breaks Text For You”, coming approximately never.
In a pinch, you can decode CESU-8 in Python 2 using the UTF-8 codec: first decode the bytes (incorrectly), then encode them, then decode them again, using UTF-8 as the codec every time. But Python 2 is dead, so use ftfy instead.