Fixing problems and getting explanations¶
Ode to a Shipping Label¶
A poem about mojibake, whose original author might be Carlos Bueno on Facebook, shows a shipping label that serves as an excellent example for this section, addressed to the surname
We can use ftfy not only to fix the text that was on the label, but to show us what happened to it (like the poem does):
>>> from ftfy import fix_and_explain, apply_plan >>> shipping_label = "L&AMP;ATILDE;&AMP;SUP3;PEZ" >>> fixed, explanation = fix_and_explain(shipping_label) >>> fixed 'LóPEZ' >>> explanation [('apply', 'unescape_html'), ('apply', 'unescape_html'), ('apply', 'unescape_html'), ('encode', 'latin-1'), ('decode', 'utf-8')]
The capitalization is inconsistent because the encoding of a lowercase “ó” is in there, but everything was printed in capital letters.
The explanation may even be able to be applied to different text with the same problem:
>>> label2 = "CARR&AMP;ATILDE;&AMP;COPY;" >>> apply_plan(label2, explanation) 'CARRé'
Functions that fix text¶
The function that you’ll probably use most often is
ftfy.fix_text(), which applies all the fixes it can to every line of text, and returns the fixed text.
fix_text(text: str, config: Optional[ftfy.TextFixerConfig] = None, **kwargs) → str¶
Given Unicode text as input, fix inconsistencies and glitches in it, such as mojibake (text that was decoded in the wrong encoding).
Let’s start with some examples:
>>> fix_text('âœ” No problems') '✔ No problems'
>>> print(fix_text("¯\\_(ã\x83\x84)_/¯")) ¯\_(ツ)_/¯
>>> fix_text('Broken text… it’s ﬂubberiﬁc!') "Broken text... it's flubberific!"
>>> fix_text('ＬＯＵＤ ＮＯＩＳＥＳ') 'LOUD NOISES'
ftfy applies a number of different fixes to the text, and can accept configuration to select which fixes to apply.
The configuration takes the form of a
TextFixerConfigobject, and you can see a description of the options in that class’s docstring or in the full documentation at ftfy.readthedocs.org.
For convenience and backward compatibility, the configuration can also take the form of keyword arguments, which will set the equivalently-named fields of the TextFixerConfig object.
For example, here are two ways to fix text but skip the “uncurl_quotes” step:
fix_text(text, TextFixerConfig(uncurl_quotes=False)) fix_text(text, uncurl_quotes=False)
This function fixes text in independent segments, which are usually lines of text, or arbitrarily broken up every 1 million codepoints (configurable with
config.max_decode_length) if there aren’t enough line breaks. The bound on segment lengths helps to avoid unbounded slowdowns.
ftfy can also provide an ‘explanation’, a list of transformations it applied to the text that would fix more text like it. This function doesn’t provide explanations (because there may be different fixes for different segments of text).
To get an explanation, use the
fix_and_explain()function, which fixes the string in one segment and explains what it fixed.
fix_and_explain(text: str, config: Optional[ftfy.TextFixerConfig] = None, **kwargs) → ftfy.ExplainedText¶
Fix text as a single segment, returning the fixed text and an explanation of what was fixed.
The explanation is a list of steps that can be applied with
apply_plan(), or if config.explain is False, it will be None.
ftfy.fix_and_explain() doesn’t separate the text into lines that it fixes separately – because it’s looking for a unified explanation of what happened to the text, not a different one for each line.
A more targeted function is
ftfy.fix_encoding_and_explain(), which only fixes problems that can be solved by encoding and decoding the text, not other problems such as HTML entities:
fix_encoding_and_explain(text: str, config: Optional[ftfy.TextFixerConfig] = None, **kwargs) → ftfy.ExplainedText¶
Apply the steps of ftfy that detect mojibake and fix it. Returns the fixed text and a list explaining what was fixed.
This includes fixing text by encoding and decoding it in different encodings, as well as the subordinate fixes
>>> fix_encoding_and_explain("sÃ³") ExplainedText(text='só', explanation=[('encode', 'latin-1'), ('decode', 'utf-8')]) >>> result = fix_encoding_and_explain("voilÃ le travail") >>> result.text 'voilà le travail' >>> result.explanation [('encode', 'latin-1'), ('transcode', 'restore_byte_a0'), ('decode', 'utf-8')]
This function has a counterpart that returns just the fixed string, without the explanation. It still fixes the string as a whole, not line by line.
fix_encoding(text: str, config: Optional[ftfy.TextFixerConfig] = None, **kwargs)¶
Apply just the encoding-fixing steps of ftfy to this text. Returns the fixed text, discarding the explanation.
>>> fix_encoding("Ã³") 'ó' >>> fix_encoding("&ATILDE;&SUP3;") '&ATILDE;&SUP3;'
The return type of the
..._and_explain functions is a kind of NamedTuple called
ExplainedText(text: str, explanation: Optional[List[Tuple[str, str]]])¶
The return type from ftfy’s functions that provide an “explanation” of which steps it applied to fix the text, such as
When the ‘explain’ option is disabled, these functions return the same type, but the
explanationwill be None.
These explanations can be re-applied to text using
apply_plan(text: str, plan: List[Tuple[str, str]])¶
Apply a plan for fixing the encoding of text.
The plan is a list of tuples of the form (operation, arg).
operationis one of:
'encode': convert a string to bytes, using
argas the encoding
'decode': convert bytes to a string, using
argas the encoding
'transcode': convert bytes to bytes, using the function named
'apply': convert a string to a string, using the function named
The functions that can be applied by ‘transcode’ and ‘apply’ are specifically those that appear in the dictionary named
FIXERS. They can also can be imported from the
>>> mojibake = "schÃ¶n" >>> text, plan = fix_and_explain(mojibake) >>> apply_plan(mojibake, plan) 'schön'
Showing the characters in a string¶
A different kind of explanation you might need is simply a breakdown of what Unicode characters a string contains. For this, ftfy provides a utility function,
A utility method that’s useful for debugging mysterious Unicode.
It breaks down a string, showing you for each codepoint its number in hexadecimal, its glyph, its category in the Unicode standard, and its name in the Unicode standard.
>>> explain_unicode('(╯°□°)╯︵ ┻━┻') U+0028 ( [Ps] LEFT PARENTHESIS U+256F ╯ [So] BOX DRAWINGS LIGHT ARC UP AND LEFT U+00B0 ° [So] DEGREE SIGN U+25A1 □ [So] WHITE SQUARE U+00B0 ° [So] DEGREE SIGN U+0029 ) [Pe] RIGHT PARENTHESIS U+256F ╯ [So] BOX DRAWINGS LIGHT ARC UP AND LEFT U+FE35 ︵ [Ps] PRESENTATION FORM FOR VERTICAL LEFT PARENTHESIS U+0020 [Zs] SPACE U+253B ┻ [So] BOX DRAWINGS HEAVY UP AND HORIZONTAL U+2501 ━ [So] BOX DRAWINGS HEAVY HORIZONTAL U+253B ┻ [So] BOX DRAWINGS HEAVY UP AND HORIZONTAL
A command-line utility that provides similar information, and even more detail, is lunasorcery’s utf8info.