Two of these functions are particularly useful on their own, as more robust versions of functions in the Python standard library:
Decode backslashed escape sequences, including \x, \u, and \U character references, even in the presence of other Unicode.
This function has to be called specifically. It’s not run automatically by ftfy, because escaped text is not necessarily a mistake, and there is no way to distinguish when it is.
This is what Python’s “string-escape” and “unicode-escape” codecs were meant to do, but in contrast, this actually works. It will decode the string exactly the same way that the Python interpreter decodes its string literals.
>>> factoid = '\\u20a1 is the currency symbol for the colón.' >>> print(factoid[1:]) u20a1 is the currency symbol for the colón. >>> print(decode_escapes(factoid)) ₡ is the currency symbol for the colón.
Even though Python itself can read string literals with a combination of escapes and literal Unicode – you’re looking at one right now – the “unicode-escape” codec doesn’t work on literal Unicode. (See http://stackoverflow.com/a/24519338/773754 for more details.)
Instead, this function searches for just the parts of a string that represent escape sequences, and decodes them, leaving the rest alone. All valid escape sequences are made of ASCII characters, and this allows “unicode-escape” to work correctly.
Sometimes, text from one encoding ends up embedded within text from a different one. This is common enough that we need to be able to fix it.
This is used as a transcoder within
If text still contains C1 control characters, treat them as their Windows-1252 equivalents. This matches what Web browsers do.
The ASCII characters, katakana, and Hangul characters have alternate “halfwidth” or “fullwidth” forms that help text line up in a grid.
If you don’t need these width properties, you probably want to replace these characters with their standard form, which is what this function does.
Note that this replaces the ideographic space, U+3000, with the ASCII space, U+20.
>>> print(fix_character_width("ＬＯＵＤ ＮＯＩＳＥＳ")) LOUD NOISES >>> print(fix_character_width("Ｕﾀｰﾝ")) # this means "U-turn" Uターン
Replace single-character ligatures of Latin letters, such as ‘ﬁ’, with the characters that they contain, as in ‘fi’. Latin ligatures are usually not intended in text strings (though they’re lovely in rendered text). If you have such a ligature in your string, it is probably a result of a copy-and-paste glitch.
We leave ligatures in other scripts alone to be safe. They may be intended, and removing them may lose information. If you want to take apart nearly all ligatures, use NFKC normalization.
>>> print(fix_latin_ligatures("ﬂuﬃeﬆ")) fluffiest
Convert all line breaks to Unix style.
This will convert the following sequences into the standard \n line break:
CRLF (\r\n), used on Windows and in some communication protocols
CR (\r), once used on Mac OS Classic, and now kept alive by misguided software such as Microsoft Office for Mac
LINE SEPARATOR (\u2028) and PARAGRAPH SEPARATOR (\u2029), defined by Unicode and used to sow confusion and discord
NEXT LINE (\x85), a C1 control character that is certainly not what you meant
The NEXT LINE character is a bit of an odd case, because it usually won’t show up if
fix_encodingis also being run. \x85 is very common mojibake for \u2026, HORIZONTAL ELLIPSIS.
>>> print(fix_line_breaks( ... "This string is made of two things:\u2029" ... "1. Unicode\u2028" ... "2. Spite" ... )) This string is made of two things: 1. Unicode 2. Spite
For further testing and examples, let’s define a function to make sure we can see the control characters in their escaped form:
>>> def eprint(text): ... print(text.encode('unicode-escape').decode('ascii'))
>>> eprint(fix_line_breaks("Content-type: text/plain\r\n\r\nHi.")) Content-type: text/plain\n\nHi.
>>> eprint(fix_line_breaks("This is how Microsoft \r trolls Mac users")) This is how Microsoft \n trolls Mac users
>>> eprint(fix_line_breaks("What is this \x85 I don't even")) What is this \n I don't even
Replace 16-bit surrogate codepoints with the characters they represent (when properly paired), or with � otherwise.
>>> high_surrogate = chr(0xd83d) >>> low_surrogate = chr(0xdca9) >>> print(fix_surrogates(high_surrogate + low_surrogate)) 💩 >>> print(fix_surrogates(low_surrogate + high_surrogate)) ��
The above doctest had to be very carefully written, because even putting the Unicode escapes of the surrogates in the docstring was causing various tools to fail, which I think just goes to show why this fixer is necessary.
Remove various control characters that you probably didn’t intend to be in your text. Many of these characters appear in the table of “Characters not suitable for use with markup” at http://www.unicode.org/reports/tr20/tr20-9.html.
ASCII control characters, except for the important whitespace characters (U+00 to U+08, U+0B, U+0E to U+1F, U+7F)
Deprecated Arabic control characters (U+206A to U+206F)
Interlinear annotation characters (U+FFF9 to U+FFFB)
The Object Replacement Character (U+FFFC)
The byte order mark (U+FEFF)
However, these similar characters are left alone:
Control characters that produce whitespace (U+09, U+0A, U+0C, U+0D, U+2028, and U+2029)
C1 control characters (U+80 to U+9F) – even though they are basically never used intentionally, they are important clues about what mojibake has happened
Control characters that affect glyph rendering, such as joiners and right-to-left marks (U+200C to U+200F, U+202A to U+202E)
Musical notation control characters (U+1D173 to U+1D17A) because wow if you’re using those you probably have a good reason
Tag characters, because they are now used in emoji sequences such as “Flag of Wales”
Strip out “ANSI” terminal escape sequences, such as those that produce colored text on Unix.
>>> print(remove_terminal_escapes( ... "\033[36;44mI'm blue, da ba dee da ba doo...\033[0m" ... )) I'm blue, da ba dee da ba doo...
This function identifies sequences where information has been lost in a “sloppy” codec, indicated by byte 1A, and if they would otherwise look like a UTF-8 sequence, it replaces them with the UTF-8 sequence for U+FFFD.
A further explanation:
ftfy can now fix text in a few cases that it would previously fix incompletely, because of the fact that it can’t successfully apply the fix to the entire string. A very common case of this is when characters have been erroneously decoded as windows-1252, but instead of the “sloppy” windows-1252 that passes through unassigned bytes, the unassigned bytes get turned into U+FFFD (�), so we can’t tell what they were.
This most commonly happens with curly quotation marks that appear
â€œ like this â€�.
We can do better by building on ftfy’s “sloppy codecs” to let them handle less-sloppy but more-lossy text. When they encounter the character
�, instead of refusing to encode it, they encode it as byte 1A – an ASCII control code called SUBSTITUTE that once was meant for about the same purpose. We can then apply a fixer that looks for UTF-8 sequences where some continuation bytes have been replaced by byte 1A, and decode the whole sequence as �; if that doesn’t work, it’ll just turn the byte back into � itself.
As a result, the above text
â€œ like this â€�will decode as
“ like this �.
If U+1A was actually in the original string, then the sloppy codecs will not be used, and this function will not be run, so your weird control character will be left alone but wacky fixes like this won’t be possible.
This is used as a transcoder within
Some mojibake has been additionally altered by a process that said “hmm, byte A0, that’s basically a space!” and replaced it with an ASCII space. When the A0 is part of a sequence that we intend to decode as UTF-8, changing byte A0 to 20 would make it fail to decode.
This process finds sequences that would convincingly decode as UTF-8 if byte 20 were changed to A0, and puts back the A0. For the purpose of deciding whether this is a good idea, this step gets a cost of twice the number of bytes that are changed.
This is used as a step within
Replace curly quotation marks with straight equivalents.
>>> print(uncurl_quotes('\u201chere\u2019s a test\u201d')) "here's a test"
Decode HTML entities and character references, including some nonstandard ones written in all-caps.
Python has a built-in called
html.unescapethat can decode HTML escapes, including a bunch of messy edge cases such as decoding escapes without semicolons such as “&”.
If you know you’ve got HTML-escaped text, applying
html.unescapeis the right way to convert it to plain text. But in ambiguous situations, that would create false positives. For example, the informally written text “this¬ that” should not automatically be decoded as “this¬ that”.
In this function, we decode the escape sequences that appear in the
html.entities.html5dictionary, as long as they are the unambiguous ones that end in semicolons.
We also decode all-caps versions of Latin letters and common symbols. If a database contains the name ‘P&EACUTE;REZ’, we can read that and intuit that it was supposed to say ‘PÉREZ’. This is limited to a smaller set of entities, because there are many instances where entity names are case-sensitive in complicated ways.
>>> unescape_html('<tag>') '<tag>'
>>> unescape_html('𝒥ohn ℋancock') '𝒥ohn ℋancock'
>>> unescape_html('✓') '✓'
>>> unescape_html('Pérez') 'Pérez'
>>> unescape_html('P&EACUTE;REZ') 'PÉREZ'
>>> unescape_html('BUNDESSTRA&SZLIG;E') 'BUNDESSTRASSE'
>>> unescape_html('ñ Ñ &NTILDE; &nTILDE;') 'ñ Ñ Ñ &nTILDE;'