How to use the sacrebleu.UnicodeRegex.nondigit_punct_re function in sacrebleu

To help you get started, we’ve selected a few sacrebleu examples, based on popular ways it is used in public projects.

Secure your code as it's written. Use Snyk Code to scan source code in minutes - no build needed - and fix issues immediately.

github mjpost / sacreBLEU / sacrebleu.py View on Github external
So we just tokenize on punctuation and symbols,
    except when a punctuation is preceded and followed by a digit
    (e.g. a comma/dot as a thousand/decimal separator).

    Note that a number (e.g., a year) followed by a dot at the end of sentence is NOT tokenized,
    i.e. the dot stays with the number because `s/(\p{P})(\P{N})/ $1 $2/g`
    does not match this case (unless we add a space after each sentence).
    However, this error is already in the original mteval-v14.pl
    and we want to be consistent with it.
    The error is not present in the non-international version,
    which uses `$norm_text = " $norm_text "` (or `norm = " {} ".format(norm)` in Python).

    :param string: the input string
    :return: a list of tokens
    """
    string = UnicodeRegex.nondigit_punct_re().sub(r'\1 \2 ', string)
    string = UnicodeRegex.punct_nondigit_re().sub(r' \1 \2', string)
    string = UnicodeRegex.symbol_re().sub(r' \1 ', string)
    return string.strip()