wordfreq

package module
v0.0.0-...-7fed92b Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 25, 2025 License: Apache-2.0 Imports: 16 Imported by: 0

README

wordfreq

A really terrible port of the great Python library https://github.com/rspeer/wordfreq to Go specifically for Chinese and Japanese.

The original library was authored by Robyn Speer.

Please see the README of the original project for more information, as not everything is fully documented here. What is documented was mostly adapted from the original README.

Reasons not to trust this port:

  1. It doesn't use the same tokenizers -- rather than using Jieba and MeCab for Chinese and Japanese respectively, this port uses gse and Kagome. These are great tokenizers but since they don't match what was used to compile the data for wordfreq originally, the results aren't going to be accurate. However, roughly the same dictionaries are being used. I don't really know how much of an impact this makes but it is safe to assume this makes the results inaccurate.
  2. I didn't put a lot of effort into testing it other than using it to determine frequencies for my dictionary project. It's literally just an AI port that I did my best to clean up and remove all the junk from.
  3. Probably others

Sources and supported languages

This data comes from a Luminoso project called Exquisite Corpus, whose goal is to download good, varied, multilingual corpus data, process it appropriately, and combine it into unified resources such as wordfreq.

Exquisite Corpus compiles 8 different domains of text, some of which themselves come from multiple sources:

  • Wikipedia, representing encyclopedic text
  • Subtitles, from OPUS OpenSubtitles 2018 and SUBTLEX
  • News, from NewsCrawl 2014 and GlobalVoices
  • Books, from Google Books Ngrams 2012
  • Web text, from OSCAR
  • Twitter, representing short-form social media
  • Reddit, representing potentially longer Internet comments
  • Miscellaneous word frequencies: in Chinese, we import a free wordlist that comes with the Jieba word segmenter, whose provenance we don't really know

The following languages are supported, with reasonable tokenization and at least 3 different sources of word frequencies:

Language    Code    #  Large?   WP    Subs  News  Books Web   Twit. Redd. Misc.
──────────────────────────────┼────────────────────────────────────────────────
Chinese     zh [1]  7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   -     Jieba
Japanese    ja      5  Yes    │ Yes   Yes   -     -     Yes   Yes   Yes   -

[1] This data represents text written in both Simplified and Traditional Chinese, with primarily Mandarin Chinese vocabulary.

License

The code is freely redistributable under the same Apache license as the original (see LICENSE.txt), and it includes data files from the original that may be redistributed under a Creative Commons Attribution-ShareAlike 4.0 license (https://creativecommons.org/licenses/by-sa/4.0/).

wordfreq (Go port) contains data extracted from Google Books Ngrams (http://books.google.com/ngrams) and Google Books Syntactic Ngrams (http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html). The terms of use of this data are:

Ngram Viewer graphs and data may be freely used for any purpose, although
acknowledgement of Google Books Ngram Viewer as the source, and inclusion
of a link to http://books.google.com/ngrams, would be appreciated.

wordfreq (Go port) also contains data derived from the following Creative Commons-licensed sources:

It contains data from OPUS OpenSubtitles 2018 (http://opus.nlpl.eu/OpenSubtitles.php), whose data originates from the OpenSubtitles project (http://www.opensubtitles.org/) and may be used with attribution to OpenSubtitles.

This Go port of wordfreq contains data derived from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK, SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al. (see citations below) and available at http://crr.ugent.be/programs-data/subtitle-frequencies.

The original wordfreq author (Robyn Speer) obtained permission by e-mail from Marc Brysbaert to distribute these wordlists in the original wordfreq, to be used for any purpose, not just for academic use, under these conditions:

  • Wordfreq and code derived from it must credit the SUBTLEX authors.
  • It must remain clear that SUBTLEX is freely available data.

As this Go port is code derived from the original wordfreq, it operates under the same conditions and credits the SUBTLEX authors accordingly.

These terms are similar to the Creative Commons Attribution-ShareAlike license.

Some additional data was collected by a custom application that watches the streaming Twitter API, in accordance with Twitter's Developer Agreement & Policy. This software gives statistics about words that are commonly used on Twitter; it does not display or republish any Twitter content.

Citations to work that the original wordfreq (and hence this port) is built on

  • Bojar, O., Chatterjee, R., Federmann, C., Haddow, B., Huck, M., Hokamp, C., Koehn, P., Logacheva, V., Monz, C., Negri, M., Post, M., Scarton, C., Specia, L., & Turchi, M. (2015). Findings of the 2015 Workshop on Statistical Machine Translation. http://www.statmt.org/wmt15/results.html

  • Brysbaert, M. & New, B. (2009). Moving beyond Kucera and Francis: A Critical Evaluation of Current Word Frequency Norms and the Introduction of a New and Improved Word Frequency Measure for American English. Behavior Research Methods, 41 (4), 977-990. http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf

  • Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A. (2011). The word frequency effect: A review of recent developments and implications for the choice of frequency estimates in German. Experimental Psychology, 58, 412-424.

  • Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PLoS One, 5(6), e10729. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729

  • Davis, M. (2012). Unicode text segmentation. Unicode Standard Annex, 29. http://unicode.org/reports/tr29/

  • Halácsy, P., Kornai, A., Németh, L., Rung, A., Szakadát, I., & Trón, V. (2004). Creating open language resources for Hungarian. In Proceedings of the 4th international conference on Language Resources and Evaluation (LREC2004). http://mokk.bme.hu/resources/webcorpus/

  • Keuleers, E., Brysbaert, M. & New, B. (2010). SUBTLEX-NL: A new frequency measure for Dutch words based on film subtitles. Behavior Research Methods, 42(3), 643-650. http://crr.ugent.be/papers/SUBTLEX-NL_BRM.pdf

  • Kudo, T. (2005). Mecab: Yet another part-of-speech and morphological analyzer. http://mecab.sourceforge.net/

  • Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W., and Petrov, S. (2012). Syntactic annotations for the Google Books Ngram Corpus. Proceedings of the ACL 2012 system demonstrations, 169-174. http://aclweb.org/anthology/P12-3029

  • Lison, P. and Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf

  • Ortiz Suárez, P. J., Sagot, B., and Romary, L. (2019). Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. In Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. https://oscar-corpus.com/publication/2019/clmc7/asynchronous/

  • ParaCrawl (2018). Provision of Web-Scale Parallel Corpora for Official European Languages. https://paracrawl.eu/

  • van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190. http://www.tandfonline.com/doi/pdf/10.1080/17470218.2013.850521

Documentation

Index

Constants

View Source
const (
	// CacheSize for frequency lookups
	CacheSize = 100000

	// InferredSpaceFactor is applied for each inferred word boundary in Chinese
	InferredSpaceFactor = 10.0
)

Constants from the original implementation

Variables

This section is empty.

Functions

func CBToFreq

func CBToFreq(cB int) float64

CBToFreq converts centibels to frequency proportion (0-1)

func DigitFreq

func DigitFreq(text string) float64

DigitFreq gets the relative frequency of a string of digits, using our estimates

func FreqToZipf

func FreqToZipf(freq float64) float64

FreqToZipf converts frequency proportion to Zipf scale

func HasDigitSequence

func HasDigitSequence(text string) bool

HasDigitSequence returns true if the text has a digit sequence that will be normalized out and handled with digit_freq

func SmashNumbers

func SmashNumbers(text string) string

SmashNumbers replaces sequences of multiple digits with zeroes, so we don't need to distinguish the frequencies of thousands of numbers

func ZipfToFreq

func ZipfToFreq(zipf float64) float64

ZipfToFreq converts Zipf scale to frequency proportion

Types

type ChineseProcessor

type ChineseProcessor struct {
	// contains filtered or unexported fields
}

ChineseProcessor handles Chinese text processing and tokenization

func NewChineseProcessor

func NewChineseProcessor(dataLoader *DataLoader) *ChineseProcessor

NewChineseProcessor creates a new Chinese processor

func (*ChineseProcessor) SimplifyChinese

func (cp *ChineseProcessor) SimplifyChinese(text string) (string, error)

SimplifyChinese converts Chinese text character-by-character to Simplified Chinese

func (*ChineseProcessor) Tokenize

func (cp *ChineseProcessor) Tokenize(text string) ([]string, error)

Tokenize tokenizes Chinese text using GSE (Jieba-equivalent)

type DataLoader

type DataLoader struct {
	// contains filtered or unexported fields
}

DataLoader handles loading and caching of embedded wordfreq data files

func NewDataLoader

func NewDataLoader() *DataLoader

NewDataLoader creates a new data loader

func (*DataLoader) GetFrequencyDict

func (dl *DataLoader) GetFrequencyDict(lang Language, wordlist WordlistType) (map[string]float64, error)

GetFrequencyDict converts frequency list to a map for faster lookups

func (*DataLoader) LoadChineseMapping

func (dl *DataLoader) LoadChineseMapping() (map[rune]string, error)

LoadChineseMapping loads the Traditional->Simplified Chinese character mapping from embedded data

func (*DataLoader) ReadCBPack

func (dl *DataLoader) ReadCBPack(filename string) ([][]string, error)

ReadCBPack reads a cBpack file from embedded data and returns the frequency data

func (*DataLoader) ReadTextFile

func (dl *DataLoader) ReadTextFile(filename string) (string, error)

ReadTextFile reads a text file from embedded data

type JapaneseProcessor

type JapaneseProcessor struct {
	// contains filtered or unexported fields
}

JapaneseProcessor handles Japanese text processing and tokenization

func NewJapaneseProcessor

func NewJapaneseProcessor() (*JapaneseProcessor, error)

NewJapaneseProcessor creates a new Japanese processor

func (*JapaneseProcessor) Tokenize

func (jp *JapaneseProcessor) Tokenize(text string) ([]string, error)

Tokenize tokenizes Japanese text using Kagome morphological analyzer

type Language

type Language string

Language represents a language code

const (
	LanguageChinese  Language = "zh"
	LanguageJapanese Language = "ja"
)

type LanguageInfo

type LanguageInfo struct {
	Tokenizer             TokenizerType
	LookupTransliteration string
}

LanguageInfo contains metadata about how to handle a language

func GetLanguageInfo

func GetLanguageInfo(lang Language) LanguageInfo

GetLanguageInfo returns metadata about how to handle text in a given language

type Tokenizer

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer handles text tokenization for Chinese and Japanese

func NewTokenizer

func NewTokenizer(dataLoader *DataLoader) (*Tokenizer, error)

NewTokenizer creates a new tokenizer

func (*Tokenizer) LossyTokenize

func (t *Tokenizer) LossyTokenize(text string, lang Language) ([]string, error)

LossyTokenize performs lossy tokenization for frequency lookup, matching the original wordfreq behavior

type TokenizerType

type TokenizerType string

TokenizerType represents different tokenization methods

const (
	TokenizerGse    TokenizerType = "gse"
	TokenizerKagome TokenizerType = "kagome"
)

type WordFreq

type WordFreq struct {
	// contains filtered or unexported fields
}

WordFreq is the main interface for word frequency lookups

func New

func New() (*WordFreq, error)

New creates a new WordFreq instance

func (*WordFreq) WordFrequency

func (wf *WordFreq) WordFrequency(word string, lang Language, wordlist WordlistType, minimum float64) (float64, error)

WordFrequency gets the frequency of a word in the specified language Returns a value between 0 and 1, where 1 means the word appears in every token

func (*WordFreq) ZipfFrequency

func (wf *WordFreq) ZipfFrequency(word string, lang Language, wordlist WordlistType, minimum float64) (float64, error)

ZipfFrequency gets the frequency of a word on the Zipf scale

type WordlistType

type WordlistType string

WordlistType represents different wordlist sizes

const (
	WordlistSmall WordlistType = "small"
	WordlistLarge WordlistType = "large"
	WordlistBest  WordlistType = "best"
)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL