Secure your code as it's written. Use Snyk Code to scan source code in minutes - no build needed - and fix issues immediately.
>>> preprocess_text('культуры', 'sr')
"kul'tury"
Azerbaijani (Azeri) has a similar transliteration step to Serbian,
and then the Latin-alphabet text is handled similarly to Turkish.
>>> preprocess_text('бағырты', 'az')
'bağırtı'
We don't transliterate Traditional to Simplified Chinese in this step.
There are some steps where we unify them internally: see chinese.py
for more information.
"""
# NFC or NFKC normalization, as needed for the language
info = get_language_info(language)
text = unicodedata.normalize(info['normal_form'], text)
# Transliteration of multi-script languages
if info['transliteration'] is not None:
text = transliterate(info['transliteration'], text)
# Abjad mark removal
if info['remove_marks']:
text = remove_marks(text)
# Case folding
if info['dotless_i']:
text = casefold_with_i_dots(text)
else:
text = text.casefold()
# Frequencies for multiple tokens are combined using the formula
# 1 / f = 1 / f1 + 1 / f2 + ...
# Thus the resulting frequency is less than any individual frequency, and
# the smallest frequency dominates the sum.
freqs = get_frequency_dict(lang, wordlist)
one_over_result = 0.0
for token in tokens:
if token not in freqs:
# If any word is missing, just return the default value
return minimum
one_over_result += 1.0 / freqs[token]
freq = 1.0 / one_over_result
if get_language_info(lang)['tokenizer'] == 'jieba':
# If we used the Jieba tokenizer, we could tokenize anything to match
# our wordlist, even nonsense. To counteract this, we multiply by a
# probability for each word break that was inferred.
freq /= INFERRED_SPACE_FACTOR ** (len(tokens) - 1)
# All our frequency data is only precise to within 1% anyway, so round
# it to 3 significant digits
unrounded = max(freq, minimum)
if unrounded == 0.:
return 0.
else:
leading_zeroes = math.floor(-math.log(unrounded, 10))
return round(unrounded, leading_zeroes + 3)
True, then wordfreq will not use its own Chinese wordlist for tokenization.
Instead, it will use the large wordlist packaged with the Jieba tokenizer,
and it will leave Traditional Chinese characters as is. This will probably
give more accurate tokenization, but the resulting tokens won't necessarily
have word frequencies that can be looked up.
If you end up seeing tokens that are entire phrases or sentences glued
together, that probably means you passed in CJK text with the wrong
language code.
"""
# Use globals to load CJK tokenizers on demand, so that we can still run
# in environments that lack the CJK dependencies
global _mecab_tokenize, _jieba_tokenize
language = langcodes.get(lang)
info = get_language_info(language)
text = preprocess_text(text, language)
if info['tokenizer'] == 'mecab':
from wordfreq.mecab import mecab_tokenize as _mecab_tokenize
# Get just the language code out of the Language object, so we can
# use it to select a MeCab dictionary
tokens = _mecab_tokenize(text, language.language)
if not include_punctuation:
tokens = [token for token in tokens if not PUNCT_RE.match(token)]
elif info['tokenizer'] == 'jieba':
from wordfreq.chinese import jieba_tokenize as _jieba_tokenize
tokens = _jieba_tokenize(text, external_wordlist=external_wordlist)
if not include_punctuation:
tokens = [token for token in tokens if not PUNCT_RE.match(token)]
In particular:
- Any sequence of 2 or more adjacent digits, possibly with intervening
punctuation such as a decimal point, will replace each digit with '0'
so that frequencies for numbers don't have to be counted separately.
This is similar to but not quite identical to the word2vec Google News
data, which replaces digits with '#' in tokens with more than one digit.
- In Chinese, unless Traditional Chinese is specifically requested using
'zh-Hant', all characters will be converted to Simplified Chinese.
"""
global _simplify_chinese
info = get_language_info(lang)
tokens = tokenize(text, lang, include_punctuation, external_wordlist)
if info['lookup_transliteration'] == 'zh-Hans':
from wordfreq.chinese import simplify_chinese as _simplify_chinese
tokens = [_simplify_chinese(token) for token in tokens]
return [smash_numbers(token) for token in tokens]