Secure your code as it's written. Use Snyk Code to scan source code in minutes - no build needed - and fix issues immediately.
So we just tokenize on punctuation and symbols,
except when a punctuation is preceded and followed by a digit
(e.g. a comma/dot as a thousand/decimal separator).
Note that a number (e.g., a year) followed by a dot at the end of sentence is NOT tokenized,
i.e. the dot stays with the number because `s/(\p{P})(\P{N})/ $1 $2/g`
does not match this case (unless we add a space after each sentence).
However, this error is already in the original mteval-v14.pl
and we want to be consistent with it.
The error is not present in the non-international version,
which uses `$norm_text = " $norm_text "` (or `norm = " {} ".format(norm)` in Python).
:param string: the input string
:return: a list of tokens
"""
string = UnicodeRegex.nondigit_punct_re().sub(r'\1 \2 ', string)
string = UnicodeRegex.punct_nondigit_re().sub(r' \1 \2', string)
string = UnicodeRegex.symbol_re().sub(r' \1 ', string)
return string.strip()