Configuring ftfy

The main functions of ftfy – ftfy.fix_text() and ftfy.fix_and_explain() – run text through a sequence of fixes. If the text changed, it will run them through again, so that you can be sure the output ends up in a standard form that will be unchanged by ftfy.

All the fixes are on by default, but you can pass in a configuration object or keyword options to turn them off. Check that the default fixes are appropriate for your use case. For example:

  • You should set fix_entities to False if the output is meant to be interpreted as HTML.

  • You should set fix_character_width to False if you want to preserve the spacing of CJK text.

  • You should set uncurl_quotes to False if you want to preserve quotation marks with nice typography. You could even consider doing the opposite of uncurl_quotes, running smartypants on the result to make all the punctuation typographically nice.

  • To be cautious and only fix mojibake when it can be fixed with a consistent sequence of encoding and decoding steps, you should set decode_inconsistent_utf8 to False.

If the only fix you need is to detect and repair decoding errors (mojibake), use the ftfy.fix_encoding() function directly. However, note that mojibake is often entangled with other issues such as the curliness of quotation marks, so limiting the process to this step might make some mojibake unfixable.

The TextFixerConfig object

The top-level functions of ftfy take a config argument that is an instance of ftfy.TextFixerConfig. If this argument is None, the configuration will use its default values.

class ftfy.TextFixerConfig[source]

A TextFixerConfig object stores configuration options for ftfy.

It’s implemented as a namedtuple with defaults, so you can instantiate it by providing the values to change from their defaults as keyword arguments. For example, to disable ‘unescape_html’ and keep the rest of the defaults:

TextFixerConfig(unescape_html=False)

Here are the options and their default values:

  • unescape_html: “auto”

    Configures whether to replace HTML entities such as &amp; with the character they represent. “auto” says to do this by default, but disable it when a literal < character appears, indicating that the input is actual HTML and entities should be preserved. The value can be True, to always enable this fixer, or False, to always disable it.

  • remove_terminal_escapes: True

    Removes “ANSI” terminal escapes, such as for changing the color of text in a terminal window.

  • fix_encoding: True

    Detect mojibake and attempt to fix it by decoding the text in a different encoding standard.

    The following four options affect fix_encoding works, and do nothing if fix_encoding is False:

    • restore_byte_a0: True

      Allow a literal space (U+20) to be interpreted as a non-breaking space (U+A0) when that would make it part of a fixable mojibake string.

      Because spaces are very common characters, this could lead to false positives, but we try to apply it only when there’s strong evidence for mojibake. Disabling restore_byte_a0 is safer from false positives, but creates false negatives.

    • replace_lossy_sequences: True

      Detect mojibake that has been partially replaced by the characters ‘�’ or ‘?’. If the mojibake could be decoded otherwise, replace the detected sequence with ‘�’.

    • decode_inconsistent_utf8: True

      When we see sequences that distinctly look like UTF-8 mojibake, but there’s no consistent way to reinterpret the string in a new encoding, replace the mojibake with the appropriate UTF-8 characters anyway.

      This helps to decode strings that are concatenated from different encodings.

    • fix_c1_controls: True

      Replace C1 control characters (the useless characters U+80 - U+9B that come from Latin-1) with their Windows-1252 equivalents, like HTML5 does, even if the whole string doesn’t decode as Latin-1.

  • fix_latin_ligatures: True

    Replace common Latin-alphabet ligatures, such as , with the letters they’re made of.

  • fix_character_width: True

    Replace fullwidth Latin characters and halfwidth Katakana with their more standard widths.

  • uncurl_quotes: True

    Replace curly quotes with straight quotes.

  • fix_line_breaks: True

    Replace various forms of line breaks with the standard Unix line break, \n.

  • fix_surrogates: True

    Replace sequences of UTF-16 surrogate codepoints with the character they were meant to encode. This fixes text that was decoded with the obsolete UCS-2 standard, and allows it to support high-numbered codepoints such as emoji.

  • remove_control_chars: True

    Remove certain control characters that have no displayed effect on text.

  • normalization: “NFC”

    Choose what kind of Unicode normalization is applied. Usually, we apply NFC normalization, so that letters followed by combining characters become single combined characters.

    Changing this to “NFKC” applies more compatibility conversions, such as replacing the ‘micro sign’ with a standard Greek lowercase mu, which looks identical. However, some NFKC normalizations change the meaning of text, such as converting “10³” to “103”.

normalization can be None, to apply no normalization.

  • max_decode_length: 1_000_000

    The maximum size of “segment” that ftfy will try to fix all at once.

  • explain: True

    Whether to compute ‘explanations’, lists describing what ftfy changed. When this is False, the explanation will be None, and the code that builds the explanation will be skipped, possibly saving time.

    Functions that accept TextFixerConfig and don’t return an explanation will automatically set explain to False.

Keyword arguments

The top-level functions also accept keyword arguments in place of a config argument. Given these keyword arguments, they will pass them to the ftfy.TextFixerConfig constructor, overriding the default values of those configuration options.