ftfy: fixes text for you¶
ftfy fixes Unicode that’s broken in various ways. It works in Python 2.7,
Python 3.2, or later.
The goal of ftfy is to take in bad Unicode and output good Unicode, for use in your Unicode-aware code. This is different from taking in non-Unicode and outputting Unicode, which is not a goal of ftfy. It also isn’t designed to protect you from having to write Unicode-aware code. ftfy helps those who help themselves.
Of course you’re better off if your input is decoded properly and has no glitches. But you often don’t have any control over your input; it’s someone else’s mistake, but it’s your problem now.
ftfy will do everything it can to fix the problem.
Time is marching on. ftfy 4.x supports Python 2.7 and 3.x, but when ftfy 5.0 is released, it will probably only support Python 3.
If you’re running on Python 2, ftfy 4.x will keep working for you. You
don’t have to upgrade to 5.0. You can save yourself a headache by adding
ftfy < 5 to your requirements, making sure you stay on version 4.
See Future versions of ftfy for why this needs to happen.
The most interesting kind of brokenness that ftfy will fix is when someone has encoded Unicode with one standard and decoded it with a different one. This often shows up as characters that turn into nonsense sequences (called “mojibake”):
- The word
schönmight appear as
- An em dash (
—) might appear as
- Text that was meant to be enclosed in quotation marks might end up
instead enclosed in
<9d>represents an unprintable character.
This causes your Unicode-aware code to end up with garbage text because someone else (or maybe “someone else”) made a mistake.
This happens very often to real text. It’s often the fault of software that
makes it difficult to use UTF-8 correctly, such as Microsoft Office and some
programming languages. The
ftfy.fix_encoding() function will look for
evidence of mojibake and, when possible, it will undo the process that produced
it to get back the text that was supposed to be there.
Does this sound impossible? It’s really not. UTF-8 is a well-designed encoding that makes it obvious when it’s being misused, and a string of mojibake usually contains all the information we need to recover the original string.
When ftfy is tested on multilingual data from Twitter, it has a false positive rate of less than 1 per million tweets.
Any given text string might have other irritating properties, possibly even
interacting with the erroneous decoding. The main function of ftfy,
ftfy.fix_text(), will fix other problems along the way, such as:
- The text could contain HTML entities such as
&in place of certain characters, when you would rather see what the characters actually are.
- For that matter, it could contain instructions for a text terminal to
do something like change colors, but you are not sending the text to a
terminal, so those instructions are just going to look like
^[[30m;or something in the middle of the text.
- The text could write words in non-standard ways for display purposes,
such as using the three characters
pfor the word “flop”. This can happen when you copy text out of a PDF, for example.
- It might not be in NFC normalized form. You generally want your text to be NFC-normalized, to avoid situations where unequal sequences of codepoints can represent exactly the same text. You can also opt for ftfy to use the more aggressive NFKC normalization.
Before version 4.0, ftfy used NFKC normalization by default. This covered a
lot of helpful fixes at once, such as expanding ligatures and replacing
“fullwidth” characters with their standard form. However, it also performed
transformations that lose information, such as converting
The default, starting in ftfy 4.0, is to use NFC normalization unless told to use NFKC normalization (or no normalization at all). The more helpful parts of NFKC are implemented as separate, limited fixes.
There are other interesting things that
ftfy can do that aren’t part of
ftfy.fix_text() pipeline, such as:
ftfy.explain_unicode(): show you what’s going on in a string, for debugging purposes
ftfy.fixes.decode_escapes(): do what everyone thinks the built-in
unicode_escapecodec does, but this one doesn’t cause mojibake
Encodings ftfy can handle¶
ftfy can’t fix all possible mix-ups. Its goal is to cover the most common
encoding mix-ups while keeping false positives to a very low rate.
ftfy can understand text that was decoded as any of these single-byte
- Latin-1 (ISO-8859-1)
- Windows-1252 (cp1252 – used in Microsoft products)
- Windows-1251 (cp1251 – the Russian version of cp1252)
- MacRoman (used on Mac OS 9 and earlier)
- cp437 (used in MS-DOS and some versions of the Windows command prompt)
when it was actually intended to be decoded as one of these variable-length encodings:
- CESU-8 (what some people think is UTF-8)
It can also understand text that was intended as Windows-1252 but decoded as Latin-1. That’s the very common case where things like smart-quotes and bullets turn into single weird control characters.
However, ftfy cannot understand other mixups between single-byte encodings, because it is extremely difficult to detect which mixup in particular is the one that happened.
We also can’t handle the non-UTF encodings used for Chinese, Japanese, and
Korean, such as
gb18030. See issue #34 for why this is so
But remember that the input to
ftfy is Unicode, so it handles actual
CJK text just fine. It just can’t discover that a CJK encoding introduced
mojibake into the text.
The main function,
ftfy.fix_text(), will run text through a sequence of
fixes. If the text changed, it will run them through again, so that you can be
sure the output ends up in a standard form that will be unchanged by
All the fixes are on by default, but you can pass options to turn them off. Check that the default fixes are appropriate for your use case. For example:
- You should set
fix_entitiesto False if the output is meant to be interpreted as HTML.
- You should set
fix_character_widthto False if you want to preserve the spacing of CJK text.
- You should set
uncurl_quotesto False if you want to preserve quotation marks with nice typography. You could even consider doing quite the opposite of
uncurl_quotes, running smartypants on the result to make all the punctuation nice.
If the only fix you need is to detect and repair decoding errors (mojibake), then
you should use
Changed in version 4.0: The default normalization was changed from
'NFC'. The new options
fix_latin_ligatures and fix_character_width were added to implement some
of the less lossy parts of NFKC normalization on top of NFC.
fix_text(text, fix_entities='auto', remove_terminal_escapes=True, fix_encoding=True, fix_latin_ligatures=True, fix_character_width=True, uncurl_quotes=True, fix_line_breaks=True, fix_surrogates=True, remove_control_chars=True, remove_bom=True, normalization='NFC', max_decode_length=1000000)¶
Given Unicode text as input, fix inconsistencies and glitches in it, such as mojibake.
Let’s start with some examples:
>>> print(fix_text('uÌˆnicode')) ünicode
>>> print(fix_text('Broken text… it’s ﬂubberiﬁc!', ... normalization='NFKC')) Broken text... it's flubberific!
>>> print(fix_text('HTML entities <3')) HTML entities <3
>>> print(fix_text('<em>HTML entities <3</em>')) <em>HTML entities <3</em>
>>> print(fix_text('\001\033[36;44mI’m blue, da ba dee da ba ' ... 'doo…\033[0m', normalization='NFKC')) I'm blue, da ba dee da ba doo...
>>> # This example string starts with a byte-order mark, even if >>> # you can't see it on the Web. >>> print(fix_text('\ufeffParty like\nit’s 1999!')) Party like it's 1999!
>>> print(fix_text('ＬＯＵＤ ＮＯＩＳＥＳ')) LOUD NOISES
>>> len(fix_text('ﬁ' * 100000)) 200000
>>> len(fix_text('')) 0
Based on the options you provide, ftfy applies these steps in order:
fix_entitiesis True, replace HTML entities with their equivalent characters. If it’s “auto” (the default), then consider replacing HTML entities, but don’t do so in text where you have seen a pair of actual angle brackets (that’s probably actually HTML and you shouldn’t mess with the entities).
remove_terminal_escapesis True, remove sequences of bytes that are instructions for Unix terminals, such as the codes that make text appear in different colors.
fix_encodingis True, look for common mistakes that come from encoding or decoding Unicode text incorrectly, and fix them if they are reasonably fixable. See
uncurl_quotesis True, replace various curly quotation marks with plain-ASCII straight quotes.
fix_latin_ligaturesis True, then ligatures made of Latin letters, such as
ﬁ, will be separated into individual letters. These ligatures are usually not meaningful outside of font rendering, and often represent copy-and-paste errors.
fix_character_widthis True, half-width and full-width characters will be replaced by their standard-width form.
fix_line_breaksis true, convert all line breaks to Unix style (CRLF and CR line breaks become LF line breaks).
fix_surrogatesis true, ensure that there are no UTF-16 surrogates in the resulting string, by converting them to the correct characters when they’re appropriately paired, or replacing them with ufffd otherwise.
fix_control_charactersis true, remove all C0 control characters except the common useful ones: TAB, CR, LF, and FF. (CR characters may have already been removed by the
remove_bomis True, remove the Byte-Order Mark if it exists. (This is a decoded Unicode string. It doesn’t have a “byte order”.)
normalizationis not None, apply the specified form of Unicode normalization, which can be one of ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.
- The default normalization, NFC, combines characters and diacritics that are written using separate code points, such as converting “e” plus an acute accent modifier into “é”, or converting “ka” (か) plus a dakuten into the single character “ga” (が). Unicode can be converted to NFC form without any change in its meaning.
- If you ask for NFKC normalization, it will apply additional normalizations that can change the meanings of characters. For example, ellipsis characters will be replaced with three periods, all ligatures will be replaced with the individual characters that make them up, and characters that differ in font style will be converted to the same character.
- If anything was changed, repeat all the steps, so that the function is idempotent. “&amp;” will become “&”, for example, not “&”.
fix_textwill work one line at a time, with the possibility that some lines are in different encodings, allowing it to fix text that has been concatenated together from different sources.
When it encounters lines longer than
max_decode_length(1 million codepoints by default), it will not run the
fix_encodingstep, to avoid unbounded slowdowns.
If you’re certain that any decoding errors in the text would have affected the entire text in the same way, and you don’t mind operations that scale with the length of the text, you can use
fix_text_segmentdirectly to fix the whole string in one batch.
fix_text_segment(text, fix_entities='auto', remove_terminal_escapes=True, fix_encoding=True, fix_latin_ligatures=True, fix_character_width=True, uncurl_quotes=True, fix_line_breaks=True, fix_surrogates=True, remove_control_chars=True, remove_bom=True, normalization='NFC')¶
Apply fixes to text in a single chunk. This could be a line of text within a larger run of
fix_text, or it could be a larger amount of text that you are certain is in a consistent encoding.
fix_textfor a description of the parameters.
Fix text with incorrectly-decoded garbage (“mojibake”) whenever possible.
This function looks for the evidence of mojibake, formulates a plan to fix it, and applies the plan. It determines whether it should replace nonsense sequences of single-byte characters that were really meant to be UTF-8 characters, and if so, turns them into the correctly-encoded Unicode character that they were meant to represent.
The input to the function must be Unicode. If you don’t have Unicode text, you’re not using the right tool to solve your problem.
fix_encodingdecodes text that looks like it was decoded incorrectly. It leaves alone text that doesn’t.
>>> print(fix_encoding('Ãºnico')) único
>>> print(fix_encoding('This text is fine already :þ')) This text is fine already :þ
Because these characters often come from Microsoft products, we allow for the possibility that we get not just Unicode characters 128-255, but also Windows’s conflicting idea of what characters 128-160 are.
>>> print(fix_encoding('This â€” should be an em dash')) This — should be an em dash
We might have to deal with both Windows characters and raw control characters at the same time, especially when dealing with characters like 0x81 that have no mapping in Windows. This is a string that Python’s standard
.decodemethods cannot correct.
>>> print(fix_encoding('This text is sad .â\x81”.')) This text is sad .⁔.
However, it has safeguards against fixing sequences of letters and punctuation that can occur in valid text. In the following example, the last three characters are not replaced with a Korean character, even though they could be.
>>> print(fix_encoding('not such a fan of Charlotte Brontë…”')) not such a fan of Charlotte Brontë…”
This function can now recover some complex manglings of text, such as when UTF-8 mojibake has been normalized in a way that replaces U+A0 with a space:
>>> print(fix_encoding('The more you know ðŸŒ ')) The more you know 🌠
Cases of genuine ambiguity can sometimes be addressed by finding other characters that are not double-encoded, and expecting the encoding to be consistent:
>>> print(fix_encoding('AHÅ™, the new sofa from IKEA®')) AHÅ™, the new sofa from IKEA®
Finally, we handle the case where the text is in a single-byte encoding that was intended as Windows-1252 all along but read as Latin-1:
>>> print(fix_encoding('This text was never UTF-8 at all\x85')) This text was never UTF-8 at all…
The best version of the text is found using
fix_file(input_file, encoding=None, fix_entities='auto', remove_terminal_escapes=True, fix_encoding=True, fix_latin_ligatures=True, fix_character_width=True, uncurl_quotes=True, fix_line_breaks=True, fix_surrogates=True, remove_control_chars=True, remove_bom=True, normalization='NFC')¶
Fix text that is found in a file.
If the file is being read as Unicode text, use that. If it’s being read as bytes, then we hope an encoding was supplied. If not, unfortunately, we have to guess what encoding it is. We’ll try a few common encodings, but we make no promises. See the
guess_bytesfunction for how this is done.
The output is a stream of fixed lines of text.
A utility method that’s useful for debugging mysterious Unicode.
It breaks down a string, showing you for each codepoint its number in hexadecimal, its glyph, its category in the Unicode standard, and its name in the Unicode standard.
>>> explain_unicode('(╯°□°)╯︵ ┻━┻') U+0028 ( [Ps] LEFT PARENTHESIS U+256F ╯ [So] BOX DRAWINGS LIGHT ARC UP AND LEFT U+00B0 ° [So] DEGREE SIGN U+25A1 □ [So] WHITE SQUARE U+00B0 ° [So] DEGREE SIGN U+0029 ) [Pe] RIGHT PARENTHESIS U+256F ╯ [So] BOX DRAWINGS LIGHT ARC UP AND LEFT U+FE35 ︵ [Ps] PRESENTATION FORM FOR VERTICAL LEFT PARENTHESIS U+0020 [Zs] SPACE U+253B ┻ [So] BOX DRAWINGS HEAVY UP AND HORIZONTAL U+2501 ━ [So] BOX DRAWINGS HEAVY HORIZONTAL U+253B ┻ [So] BOX DRAWINGS HEAVY UP AND HORIZONTAL
When first using ftfy, you might be confused to find that you can’t give it a
bytestring (the type of object called
str in Python 2).
ftfy fixes text. Treating bytestrings as text is exactly the kind of thing that causes the Unicode problems that ftfy has to fix. So if you don’t give it a Unicode string, ftfy will point you to the Python Unicode HOWTO.
Reasonable ways that you might exchange data, such as JSON or XML, already have perfectly good ways of expressing Unicode strings. Given a Unicode string, ftfy can apply fixes that are very likely to work without false positives.
A note on encoding detection¶
If your input is a mess of unmarked bytes, you might want a tool that can just statistically analyze those bytes and predict what encoding they’re in.
ftfy is not that tool. The
ftfy.guess_bytes() function it contains will
do this in very limited cases, but to support more encodings from around the
world, something more is needed.
You may have heard of
chardet. Chardet is admirable, but it doesn’t
completely do the job either. Its heuristics are designed for multi-byte
encodings, such as UTF-8 and the language-specific encodings used in East Asian
languages. It works badly on single-byte encodings, to the point where it will
output wrong answers with high confidence.
ftfy.guess_bytes() doesn’t even try the East Asian encodings, so the
ideal thing would combine the simple heuristic of
the multibyte character set detection of
chardet. This ideal thing doesn’t
ftfy uses Twitter’s streaming API as an endless source of realistic sample data. Twitter is massively multilingual, and despite that it’s supposed to be uniformly UTF-8, in practice, any encoding mistake that someone can make will be made by someone’s Twitter client.
We check what ftfy’s
fix_encoding() heuristic does to this data, and we
aim to have the rate of false positives be indistinguishable from zero.
A pre-release version of ftfy was evaluated on 30,880,000 tweets received from Twitter’s streaming API in April 2015. There was 1 false positive, and it was due to a bug that has now been fixed.
When looking at the changes ftfy makes, we found:
ftfy.fix_text(), with all default options, will change about 1 in 18 tweets.
- With stylistic changes (
uncurl_quotes) turned off,
ftfy.fix_text()will change about 1 in every 300 tweets.
ftfy.fix_encoding()alone will change about 1 in every 8500 tweets.
We sampled 1000 of these
ftfy.fix_encoding() changes for further
evaluation, and found:
- 980 of them correctly restored the text.
- 12 of them incompletely or incorrectly restored the text, when a sufficiently advanced heuristic might have been able to fully recover the text.
- 8 of them represented text that had lost too much information to be fixed.
- None of those 1000 changed correct text to incorrect text (these would be false positives).
In all the data we’ve sampled, including from previous versions of ftfy, there are only three false positives that remain that we know of:
fix_encoding('├┤a┼┐a┼┐a┼┐a┼┐a') == 'ôaſaſaſaſa' fix_encoding('ESSE CARA AI QUEM É¿') == 'ESSE CARA AI QUEM ɿ' fix_encoding('``hogwarts nao existe, voce nao vai pegar o trem pra lá´´') == '``hogwarts nao existe, voce nao vai pegar o trem pra lᴴ'
ftfy can be used from the command line. By default, it takes UTF-8 input and writes it to UTF-8 output, fixing problems in its Unicode as it goes.
Here’s the usage documentation for the
usage: ftfy [-h] [-o OUTPUT] [-g] [-e ENCODING] [-n NORMALIZATION] [--preserve-entities] [filename] ftfy (fixes text for you), version 4.0.0 positional arguments: filename The file whose Unicode is to be fixed. Defaults to -, meaning standard input. optional arguments: -h, --help show this help message and exit -o OUTPUT, --output OUTPUT The file to output to. Defaults to -, meaning standard output. -g, --guess Ask ftfy to guess the encoding of your input. This is risky. Overrides -e. -e ENCODING, --encoding ENCODING The encoding of the input. Defaults to UTF-8. -n NORMALIZATION, --normalization NORMALIZATION The normalization of Unicode to apply. Defaults to NFC. Can be "none". --preserve-entities Leave HTML entities as they are. The default is to decode them, as long as no HTML tags have appeared in the file.
ftfy.fixes: how individual fixes are implemented¶
This module contains the individual fixes that the main fix_text function can perform.
Decode all three types of HTML entities/character references.
Code by Fredrik Lundh of effbot.org. Rob Speer made a slight change to it for efficiency: it won’t match entities longer than 8 characters, because there are no valid entities like that.
>>> print(unescape_html('<tag>')) <tag>
Strip out “ANSI” terminal escape sequences, such as those that produce colored text on Unix.
>>> print(remove_terminal_escapes( ... "\033[36;44mI'm blue, da ba dee da ba doo...\033[0m" ... )) I'm blue, da ba dee da ba doo...
Replace curly quotation marks with straight equivalents.
>>> print(uncurl_quotes('\u201chere\u2019s a test\u201d')) "here's a test"
Replace single-character ligatures of Latin letters, such as ‘ﬁ’, with the characters that they contain, as in ‘fi’. Latin ligatures are usually not intended in text strings (though they’re lovely in rendered text). If you have such a ligature in your string, it is probably a result of a copy-and-paste glitch.
We leave ligatures in other scripts alone to be safe. They may be intended, and removing them may lose information. If you want to take apart nearly all ligatures, use NFKC normalization.
>>> print(fix_latin_ligatures("ﬂuﬃeﬆ")) fluffiest
The ASCII characters, katakana, and Hangul characters have alternate “halfwidth” or “fullwidth” forms that help text line up in a grid.
If you don’t need these width properties, you probably want to replace these characters with their standard form, which is what this function does.
Note that this replaces the ideographic space, U+3000, with the ASCII space, U+20.
>>> print(fix_character_width("ＬＯＵＤ ＮＯＩＳＥＳ")) LOUD NOISES >>> print(fix_character_width("Ｕﾀｰﾝ")) # this means "U-turn" Uターン
Convert all line breaks to Unix style.
This will convert the following sequences into the standard \n line break:
- CRLF (\r\n), used on Windows and in some communication protocols
- CR (\r), once used on Mac OS Classic, and now kept alive by misguided software such as Microsoft Office for Mac
- LINE SEPARATOR (\u2028) and PARAGRAPH SEPARATOR (\u2029), defined by Unicode and used to sow confusion and discord
- NEXT LINE (\x85), a C1 control character that is certainly not what you meant
The NEXT LINE character is a bit of an odd case, because it usually won’t show up if
fix_encodingis also being run. \x85 is very common mojibake for \u2026, HORIZONTAL ELLIPSIS.
>>> print(fix_line_breaks( ... "This string is made of two things:\u2029" ... "1. Unicode\u2028" ... "2. Spite" ... )) This string is made of two things: 1. Unicode 2. Spite
For further testing and examples, let’s define a function to make sure we can see the control characters in their escaped form:
>>> def eprint(text): ... print(text.encode('unicode-escape').decode('ascii'))
>>> eprint(fix_line_breaks("Content-type: text/plain\r\n\r\nHi.")) Content-type: text/plain\n\nHi.
>>> eprint(fix_line_breaks("This is how Microsoft \r trolls Mac users")) This is how Microsoft \n trolls Mac users
>>> eprint(fix_line_breaks("What is this \x85 I don't even")) What is this \n I don't even
Replace 16-bit surrogate codepoints with the characters they represent (when properly paired), or with � otherwise.
>>> high_surrogate = unichr(0xd83d) >>> low_surrogate = unichr(0xdca9) >>> print(fix_surrogates(high_surrogate + low_surrogate)) 💩 >>> print(fix_surrogates(low_surrogate + high_surrogate)) ��
The above doctest had to be very carefully written, because even putting the Unicode escapes of the surrogates in the docstring was causing various tools to fail, which I think just goes to show why this fixer is necessary.
Remove all ASCII control characters except for the important ones.
This removes characters in these ranges:
- U+0000 to U+0008
- U+000E to U+001F
It leaves alone these characters that are commonly used for formatting:
- TAB (U+0009)
- LF (U+000A)
- FF (U+000C)
- CR (U+000D)
Feel free to object that FF isn’t “commonly” used for formatting. I’ve at least seen it used.
Remove a byte-order mark that was accidentally decoded as if it were part of the text.
>>> print(remove_bom("\ufeffWhere do you want to go today?")) Where do you want to go today?
Decode backslashed escape sequences, including \x, \u, and \U character references, even in the presence of other Unicode.
This is what Python’s “string-escape” and “unicode-escape” codecs were meant to do, but in contrast, this actually works. It will decode the string exactly the same way that the Python interpreter decodes its string literals.
>>> factoid = '\\u20a1 is the currency symbol for the colón.' >>> print(factoid[1:]) u20a1 is the currency symbol for the colón. >>> print(decode_escapes(factoid)) ₡ is the currency symbol for the colón.
Even though Python itself can read string literals with a combination of escapes and literal Unicode – you’re looking at one right now – the “unicode-escape” codec doesn’t work on literal Unicode. (See http://stackoverflow.com/a/24519338/773754 for more details.)
Instead, this function searches for just the parts of a string that represent escape sequences, and decodes them, leaving the rest alone. All valid escape sequences are made of ASCII characters, and this allows “unicode-escape” to work correctly.
This fix cannot be automatically applied by the
ftfy.fix_textfunction, because escaped text is not necessarily a mistake, and there is no way to distinguish text that’s supposed to be escaped from text that isn’t.
Performs a single step of re-decoding text that’s been decoded incorrectly.
Returns the decoded text, plus a “plan” for how to reproduce what it did.
Apply a plan for fixing the encoding of text.
The plan is a list of tuples of the form (operation, encoding, cost):
operationis ‘encode’ if it turns a string into bytes, ‘decode’ if it turns bytes into a string, and ‘transcode’ if it keeps the type the same.
encodingis the name of the encoding to use, such as ‘utf-8’ or ‘latin-1’, or the function name in the case of ‘transcode’.
costdoes not affect how the plan itself works. It’s used by other users of plans, namely
fix_encoding_and_explain, which has to decide which plan to use.
Some mojibake has been additionally altered by a process that said “hmm, byte A0, that’s basically a space!” and replaced it with an ASCII space. When the A0 is part of a sequence that we intend to decode as UTF-8, changing byte A0 to 20 would make it fail to decode.
This process finds sequences that would convincingly decode as UTF-8 if byte 20 were changed to A0, and puts back the A0. For the purpose of deciding whether this is a good idea, this step gets a cost of twice the number of bytes that are changed.
This is used as a step within
This function identifies sequences where information has been lost in a “sloppy” codec, indicated by byte 1A, and if they would otherwise look like a UTF-8 sequence, it replaces them with the UTF-8 sequence for U+FFFD.
A further explanation:
ftfy can now fix text in a few cases that it would previously fix incompletely, because of the fact that it can’t successfully apply the fix to the entire string. A very common case of this is when characters have been erroneously decoded as windows-1252, but instead of the “sloppy” windows-1252 that passes through unassigned bytes, the unassigned bytes get turned into U+FFFD (�), so we can’t tell what they were.
This most commonly happens with curly quotation marks that appear
â€œ like this â€�.
We can do better by building on ftfy’s “sloppy codecs” to let them handle less-sloppy but more-lossy text. When they encounter the character
�, instead of refusing to encode it, they encode it as byte 1A – an ASCII control code called SUBSTITUTE that once was meant for about the same purpose. We can then apply a fixer that looks for UTF-8 sequences where some continuation bytes have been replaced by byte 1A, and decode the whole sequence as �; if that doesn’t work, it’ll just turn the byte back into � itself.
As a result, the above text
â€œ like this â€�will decode as
“ like this �.
If U+1A was actually in the original string, then the sloppy codecs will not be used, and this function will not be run, so your weird control character will be left alone but wacky fixes like this won’t be possible.
This is used as a step within
Fix particular characters that seem to be found in the wild encoded in UTF-8 and decoded in Latin-1 or Windows-1252, even when this fix can’t be consistently applied.
For this function, we assume the text has been decoded in Windows-1252. If it was decoded in Latin-1, we’ll call this right after it goes through the Latin-1-to-Windows-1252 fixer.
This is used as a step within
ftfy.badness: measures the “badness” of text¶
Heuristics to determine whether re-encoding text is actually making it more reasonable.
Determine how often a text has unexpected characters or sequences of characters. This metric is used to disambiguate when text should be re-decoded or left as is.
We start by normalizing text in NFC form, so that penalties for diacritical marks don’t apply to characters that know what to do with them.
The following things are deemed weird:
Lowercase letters followed by non-ASCII uppercase letters
Non-Latin characters next to Latin characters
Un-combined diacritical marks, unless they’re stacking on non-alphabetic characters (in languages that do that kind of thing a lot) or other marks
C1 control characters
Adjacent symbols from any different pair of these categories:
- Modifier marks
- Letter modifiers
- Non-digit numbers
- Symbols (including math and currency)
The return value is the number of instances of weirdness.
An overall cost function for text. Weirder is worse, but all else being equal, shorter strings are better.
The overall cost is measured as the “weirdness” (see
sequence_weirdness()) plus the length.
ftfy.bad_codecs: support some encodings Python doesn’t like¶
Give Python the ability to decode some common, flawed encodings.
Python does not want you to be sloppy with your text. Its encoders and decoders (“codecs”) follow the relevant standards whenever possible, which means that when you get text that doesn’t follow those standards, you’ll probably fail to decode it. Or you might succeed at decoding it for implementation-specific reasons, which is perhaps worse.
There are some encodings out there that Python wishes didn’t exist, which are widely used outside of Python:
- “utf-8-variants”, a family of not-quite-UTF-8 encodings, including the ever-popular CESU-8 and “Java modified UTF-8”.
- “Sloppy” versions of character map encodings, where bytes that don’t map to anything will instead map to the Unicode character with the same number.
Simply importing this module, or in fact any part of the
ftfy package, will
make these new “bad codecs” available to Python through the standard Codecs
API. You never have to actually call any functions inside
However, if you want to call something because your code checker insists on it,
you can call
A quick example of decoding text that’s encoded in CESU-8:
>>> import ftfy.bad_codecs >>> print(b'\xed\xa0\xbd\xed\xb8\x8d'.decode('utf-8-variants')) 😍
Decodes single-byte encodings, filling their “holes” in the same messy way that everyone else does.
A single-byte encoding maps each byte to a Unicode character, except that some bytes are left unmapped. In the commonly-used Windows-1252 encoding, for example, bytes 0x81 and 0x8D, among others, have no meaning.
Python, wanting to preserve some sense of decorum, will handle these bytes as errors. But Windows knows that 0x81 and 0x8D are possible bytes and they’re different from each other. It just hasn’t defined what they are in terms of Unicode.
Software that has to interoperate with Windows-1252 and Unicode – such as all the common Web browsers – will pick some Unicode characters for them to map to, and the characters they pick are the Unicode characters with the same numbers: U+0081 and U+008D. This is the same as what Latin-1 does, and the resulting characters tend to fall into a range of Unicode that’s set aside for obselete Latin-1 control characters anyway.
These sloppy codecs let Python do the same thing, thus interoperating with other software that works this way. It defines a sloppy version of many single-byte encodings with holes. (There is no need for a sloppy version of an encoding without holes: for example, there is no such thing as sloppy-iso-8859-2 or sloppy-macroman.)
The following encodings will become defined:
- sloppy-windows-1250 (Central European, sort of based on ISO-8859-2)
- sloppy-windows-1251 (Cyrillic)
- sloppy-windows-1252 (Western European, based on Latin-1)
- sloppy-windows-1253 (Greek, sort of based on ISO-8859-7)
- sloppy-windows-1254 (Turkish, based on ISO-8859-9)
- sloppy-windows-1255 (Hebrew, based on ISO-8859-8)
- sloppy-windows-1256 (Arabic)
- sloppy-windows-1257 (Baltic, based on ISO-8859-13)
- sloppy-windows-1258 (Vietnamese)
- sloppy-cp874 (Thai, based on ISO-8859-11)
- sloppy-iso-8859-3 (Maltese and Esperanto, I guess)
- sloppy-iso-8859-6 (different Arabic)
- sloppy-iso-8859-7 (Greek)
- sloppy-iso-8859-8 (Hebrew)
- sloppy-iso-8859-11 (Thai)
Aliases such as “sloppy-cp1252” for “sloppy-windows-1252” will also be defined.
Only sloppy-windows-1251 and sloppy-windows-1252 are used by the rest of ftfy; the rest are rather uncommon.
Here are some examples, using
ftfy.explain_unicode to illustrate how
sloppy-windows-1252 merges Windows-1252 with Latin-1:
>>> from ftfy import explain_unicode >>> some_bytes = b'\x80\x81\x82' >>> explain_unicode(some_bytes.decode('latin-1')) U+0080 \x80 [Cc] <unknown> U+0081 \x81 [Cc] <unknown> U+0082 \x82 [Cc] <unknown>
>>> explain_unicode(some_bytes.decode('windows-1252', 'replace')) U+20AC € [Sc] EURO SIGN U+FFFD � [So] REPLACEMENT CHARACTER U+201A ‚ [Ps] SINGLE LOW-9 QUOTATION MARK
>>> explain_unicode(some_bytes.decode('sloppy-windows-1252')) U+20AC € [Sc] EURO SIGN U+0081 \x81 [Cc] <unknown> U+201A ‚ [Ps] SINGLE LOW-9 QUOTATION MARK
Variants of UTF-8¶
This file defines a codec called “utf-8-variants” (or “utf-8-var”), which can decode text that’s been encoded with a popular non-standard version of UTF-8. This includes CESU-8, the accidental encoding made by layering UTF-8 on top of UTF-16, as well as Java’s twist on CESU-8 that contains a two-byte encoding for codepoint 0.
This is particularly relevant in Python 3, which provides no other way of decoding CESU-8 .
The easiest way to use the codec is to simply import
>>> import ftfy.bad_codecs >>> result = b'here comes a null! \xc0\x80'.decode('utf-8-var') >>> print(repr(result).lstrip('u')) 'here comes a null! \x00'
The codec does not at all enforce “correct” CESU-8. For example, the Unicode Consortium’s not-quite-standard describing CESU-8 requires that there is only one possible encoding of any character, so it does not allow mixing of valid UTF-8 and CESU-8. This codec does allow that, just like Python 2’s UTF-8 decoder does.
Characters in the Basic Multilingual Plane still have only one encoding. This
codec still enforces the rule, within the BMP, that characters must appear in
their shortest form. There is one exception: the sequence of bytes
instead of just
0x00, may be used to encode the null character
If you encode with this codec, you get legitimate UTF-8. Decoding with this codec and then re-encoding is not idempotent, although encoding and then decoding is. So this module won’t produce CESU-8 for you. Look for that functionality in the sister module, “Breaks Text For You”, coming approximately never.
|||In a pinch, you can decode CESU-8 in Python 2 using the UTF-8 codec: first decode the bytes (incorrectly), then encode them, then decode them again, using UTF-8 as the codec every time.|
ftfy.formatting: justify Unicode text in a monospaced terminal¶
Determine the width that a character is likely to be displayed as in a monospaced terminal. The width will always be 0, 1, or 2.
We are assuming that the character will appear in a modern, Unicode-aware terminal emulator, where CJK characters – plus characters designated “fullwidth” by CJK encodings – will span two character cells. (Maybe it’s an over-simplification to call these displays “monospaced”.)
When a character width is “Ambiguous”, we assume it to span one character cell, because we assume that the monospaced display you’re using is designed mostly to display Western characters with CJK characters as the exception. This assumption will go wrong, for example, if your display is actually on a Japanese flip-phone... but let’s assume it isn’t.
Combining marks and formatting codepoints do not advance the cursor, so their width is 0.
If the character is a particular kind of control code – the kind represented by the lowest bytes of ASCII, with Unicode category Cc – the idea of it having a “width” probably doesn’t even apply, because it will probably have some other effect on your terminal. For example, there’s no sensible width for a line break. We return the default of 1 in the absence of anything sensible to do there.
>>> character_width('車') 2 >>> character_width('A') 1 >>> character_width('') 0
display_center(text, width, fillchar=' ')¶
textcentered in a Unicode string whose display width, in a monospaced terminal, should be at least
widthcharacter cells. The rest of the string will be padded with
fillchar, which must be a width-1 character.
>>> lines = ['Table flip', '(╯°□°)╯︵ ┻━┻', 'ちゃぶ台返し'] >>> for line in lines: ... print(display_center(line, 20, '▒')) ▒▒▒▒▒Table flip▒▒▒▒▒ ▒▒▒(╯°□°)╯︵ ┻━┻▒▒▒▒ ▒▒▒▒ちゃぶ台返し▒▒▒▒
display_ljust(text, width, fillchar=' ')¶
textleft-justified in a Unicode string whose display width, in a monospaced terminal, should be at least
widthcharacter cells. The rest of the string will be padded with
fillchar, which must be a width-1 character.
“Left” here means toward the beginning of the string, which may actually appear on the right in an RTL context. This is similar to the use of the word “left” in “left parenthesis”.
>>> lines = ['Table flip', '(╯°□°)╯︵ ┻━┻', 'ちゃぶ台返し'] >>> for line in lines: ... print(display_ljust(line, 20, '▒')) Table flip▒▒▒▒▒▒▒▒▒▒ (╯°□°)╯︵ ┻━┻▒▒▒▒▒▒▒ ちゃぶ台返し▒▒▒▒▒▒▒▒
This example, and the similar ones that follow, should come out justified correctly when viewed in a monospaced terminal. It will probably not look correct if you’re viewing this code or documentation in a Web browser.
display_rjust(text, width, fillchar=' ')¶
textright-justified in a Unicode string whose display width, in a monospaced terminal, should be at least
widthcharacter cells. The rest of the string will be padded with
fillchar, which must be a width-1 character.
“Right” here means toward the end of the string, which may actually be on the left in an RTL context. This is similar to the use of the word “right” in “right parenthesis”.
>>> lines = ['Table flip', '(╯°□°)╯︵ ┻━┻', 'ちゃぶ台返し'] >>> for line in lines: ... print(display_rjust(line, 20, '▒')) ▒▒▒▒▒▒▒▒▒▒Table flip ▒▒▒▒▒▒▒(╯°□°)╯︵ ┻━┻ ▒▒▒▒▒▒▒▒ちゃぶ台返し
Return the number of character cells that this string is likely to occupy when displayed in a monospaced, modern, Unicode-aware terminal emulator. We refer to this as the “display width” of the string.
This can be useful for formatting text that may contain non-spacing characters, or CJK characters that take up two character cells.
>>> monospaced_width('ちゃぶ台返し') 12 >>> len('ちゃぶ台返し') 6
ftfy.chardata and ftfy.build_data: trivia about characters¶
These files load information about the character properties in Unicode 7.0. Yes, even if your version of Python doesn’t support Unicode 7.0. This ensures that ftfy’s behavior is consistent across versions.
This gives other modules access to the gritty details about characters and the encodings that use them.
Convert each Unicode character to a letter indicating which of many classes it’s in.
See build_data.py for where this data comes from and what it means.
Given text and a single-byte encoding, check whether that text could have been decoded from that single-byte encoding.
In other words, check whether it can be encoded in that encoding, possibly sloppily.
Build the compressed data file ‘char_classes.dat’ and write it to the current directory.
If you run this, run it in Python 3.5 or later, even though that requires an alpha version at the time of writing this code. It will run in earlier versions, but you won’t get the Unicode 7 standard, leading to inconsistent behavior.
To protect against this, running this in the wrong version of Python will raise an error unless you pass
Future versions of ftfy¶
ftfy has full support for Python 2.7, even including a backport of Unicode 7 character classes to Python 2. But given the sweeping changes to Unicode in Python, it’s getting inconvenient to add new features to ftfy that work the same on both versions.
ftfy 5.0, when it is released, will probably only support Python 3.
If you want to see examples of why ftfy is particularly difficult to maintain
on two versions of Python (which is more like three versions because of Python
2’s “wide” and “narrow” builds), take a look at functions such as
We’re following the lead of venerable projects such as jQuery. jQuery dropped support for IE 6-8 in version 2.0, and in doing so, it became much simpler and faster for people who didn’t need to support old versions of IE in their web applications. Meanwhile, jQuery 1.9 remained there for those who needed it.
Similarly, ftfy 5.0 will reduce the size and complexity of the code greatly,
but ftfy 4.x will remain there. If you’re running on Python 2, please make sure
ftfy < 5 is in your requirements list, not just