Heuristics for detecting mojibake¶
The “badness” heuristic¶
ftfy.badness contains a heuristic that detects likely mojibake.
This heuristic signals to ftfy which segments of text need to be fixed, and also indicates when the text can stop being fixed.
The design of this heuristic is that we categorize the approximately 400 Unicode characters that occur in UTF-8 mojibake, specifically the characters that come from mixing up UTF-8 with the other encodings we support. We identify sequences and contexts of these characters that are much more likely to be mojibake than intended strings, such as lowercase accented letters followed immediately by currency symbols.
Get the ‘badness’ of a sequence of text, counting the number of unlikely character sequences. A badness greater than 0 indicates that some of it seems to be mojibake.
Returns true iff the given text looks like it contains mojibake.
This can be faster than
badness, because it returns when the first match is found to a regex instead of counting matches. Note that as strings get longer, they have a higher chance of returning True for
The “UTF-8 detector” heuristic¶
A more narrow heuristic is defined in
UTF8_DETECTOR_RE. This heuristic looks for specific sequences of mojibake characters that come from the decoding of common UTF-8 sequences.
Text that matches this regular expression can be partially fixed by
ftfy.fixes.decode_inconsistent_utf8(), even when the string as a whole doesn’t decode consistently.
Because of this, the expression requires that the match isn’t preceded by likely UTF-8 characters – if this were allowed, then it might pick two or three characters out of a larger mess of mojibake to decode as another character while leaving the rest untouched. This makes the problem more confusing, doesn’t really solve anything, and can even pile up recursively to decode as entirely arbitrary characters.
The “lossy UTF-8” heuristic¶
chardata.py also includes
LOSSY_UTF8_RE, which is used similarly to the “UTF-8 detector” heuristic. This regular expression matches sequences that look like they were incorrectly decoded from UTF-8, but with characters replaced by question marks or the Unicode replacement character
Characters that match this heuristic will be replaced by
� in the