wordfreq

package module

v0.0.0-...-7fed92b Latest Latest Go to latest Published: Jul 25, 2025 License: Apache-2.0 Imports: 16 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/invpt/wordfreq

Links

Open Source Insights

README ¶

wordfreq

A really terrible port of the great Python library https://github.com/rspeer/wordfreq to Go specifically for Chinese and Japanese.

The original library was authored by Robyn Speer.

Please see the README of the original project for more information, as not everything is fully documented here. What is documented was mostly adapted from the original README.

Reasons not to trust this port:

It doesn't use the same tokenizers -- rather than using Jieba and MeCab for Chinese and Japanese respectively, this port uses gse and Kagome. These are great tokenizers but since they don't match what was used to compile the data for wordfreq originally, the results aren't going to be accurate. However, roughly the same dictionaries are being used. I don't really know how much of an impact this makes but it is safe to assume this makes the results inaccurate.
I didn't put a lot of effort into testing it other than using it to determine frequencies for my dictionary project. It's literally just an AI port that I did my best to clean up and remove all the junk from.
Probably others

Sources and supported languages

This data comes from a Luminoso project called Exquisite Corpus, whose goal is to download good, varied, multilingual corpus data, process it appropriately, and combine it into unified resources such as wordfreq.

Exquisite Corpus compiles 8 different domains of text, some of which themselves come from multiple sources:

Wikipedia, representing encyclopedic text
Subtitles, from OPUS OpenSubtitles 2018 and SUBTLEX
News, from NewsCrawl 2014 and GlobalVoices
Books, from Google Books Ngrams 2012
Web text, from OSCAR
Twitter, representing short-form social media
Reddit, representing potentially longer Internet comments
Miscellaneous word frequencies: in Chinese, we import a free wordlist that comes with the Jieba word segmenter, whose provenance we don't really know

The following languages are supported, with reasonable tokenization and at least 3 different sources of word frequencies:

Language    Code    #  Large?   WP    Subs  News  Books Web   Twit. Redd. Misc.
──────────────────────────────┼────────────────────────────────────────────────
Chinese     zh [1]  7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   -     Jieba
Japanese    ja      5  Yes    │ Yes   Yes   -     -     Yes   Yes   Yes   -

[1] This data represents text written in both Simplified and Traditional Chinese, with primarily Mandarin Chinese vocabulary.

License

The code is freely redistributable under the same Apache license as the original (see LICENSE.txt), and it includes data files from the original that may be redistributed under a Creative Commons Attribution-ShareAlike 4.0 license (https://creativecommons.org/licenses/by-sa/4.0/).

wordfreq (Go port) contains data extracted from Google Books Ngrams (http://books.google.com/ngrams) and Google Books Syntactic Ngrams (http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html). The terms of use of this data are:

Ngram Viewer graphs and data may be freely used for any purpose, although
acknowledgement of Google Books Ngram Viewer as the source, and inclusion
of a link to http://books.google.com/ngrams, would be appreciated.

wordfreq (Go port) also contains data derived from the following Creative Commons-licensed sources:

The Leeds Internet Corpus, from the University of Leeds Centre for Translation Studies (http://corpus.leeds.ac.uk/list.html)
Wikipedia, the free encyclopedia (http://www.wikipedia.org)
ParaCrawl, a multilingual Web crawl (https://paracrawl.eu)

It contains data from OPUS OpenSubtitles 2018 (http://opus.nlpl.eu/OpenSubtitles.php), whose data originates from the OpenSubtitles project (http://www.opensubtitles.org/) and may be used with attribution to OpenSubtitles.

This Go port of wordfreq contains data derived from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK, SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al. (see citations below) and available at http://crr.ugent.be/programs-data/subtitle-frequencies.

The original wordfreq author (Robyn Speer) obtained permission by e-mail from Marc Brysbaert to distribute these wordlists in the original wordfreq, to be used for any purpose, not just for academic use, under these conditions:

Wordfreq and code derived from it must credit the SUBTLEX authors.
It must remain clear that SUBTLEX is freely available data.

As this Go port is code derived from the original wordfreq, it operates under the same conditions and credits the SUBTLEX authors accordingly.

These terms are similar to the Creative Commons Attribution-ShareAlike license.

Some additional data was collected by a custom application that watches the streaming Twitter API, in accordance with Twitter's Developer Agreement & Policy. This software gives statistics about words that are commonly used on Twitter; it does not display or republish any Twitter content.

Citations to work that the original wordfreq (and hence this port) is built on

Bojar, O., Chatterjee, R., Federmann, C., Haddow, B., Huck, M., Hokamp, C., Koehn, P., Logacheva, V., Monz, C., Negri, M., Post, M., Scarton, C., Specia, L., & Turchi, M. (2015). Findings of the 2015 Workshop on Statistical Machine Translation. http://www.statmt.org/wmt15/results.html
Brysbaert, M. & New, B. (2009). Moving beyond Kucera and Francis: A Critical Evaluation of Current Word Frequency Norms and the Introduction of a New and Improved Word Frequency Measure for American English. Behavior Research Methods, 41 (4), 977-990. http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf
Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A. (2011). The word frequency effect: A review of recent developments and implications for the choice of frequency estimates in German. Experimental Psychology, 58, 412-424.
Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PLoS One, 5(6), e10729. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729
Davis, M. (2012). Unicode text segmentation. Unicode Standard Annex, 29. http://unicode.org/reports/tr29/
Halácsy, P., Kornai, A., Németh, L., Rung, A., Szakadát, I., & Trón, V. (2004). Creating open language resources for Hungarian. In Proceedings of the 4th international conference on Language Resources and Evaluation (LREC2004). http://mokk.bme.hu/resources/webcorpus/
Keuleers, E., Brysbaert, M. & New, B. (2010). SUBTLEX-NL: A new frequency measure for Dutch words based on film subtitles. Behavior Research Methods, 42(3), 643-650. http://crr.ugent.be/papers/SUBTLEX-NL_BRM.pdf
Kudo, T. (2005). Mecab: Yet another part-of-speech and morphological analyzer. http://mecab.sourceforge.net/
Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W., and Petrov, S. (2012). Syntactic annotations for the Google Books Ngram Corpus. Proceedings of the ACL 2012 system demonstrations, 169-174. http://aclweb.org/anthology/P12-3029
Lison, P. and Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf
Ortiz Suárez, P. J., Sagot, B., and Romary, L. (2019). Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. In Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. https://oscar-corpus.com/publication/2019/clmc7/asynchronous/
ParaCrawl (2018). Provision of Web-Scale Parallel Corpora for Official European Languages. https://paracrawl.eu/
van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190. http://www.tandfonline.com/doi/pdf/10.1080/17470218.2013.850521

Documentation ¶

Index ¶

Constants
func CBToFreq(cB int) float64
func DigitFreq(text string) float64
func FreqToZipf(freq float64) float64
func HasDigitSequence(text string) bool
func SmashNumbers(text string) string
func ZipfToFreq(zipf float64) float64
type ChineseProcessor
- func NewChineseProcessor(dataLoader *DataLoader) *ChineseProcessor
- func (cp *ChineseProcessor) SimplifyChinese(text string) (string, error)
- func (cp *ChineseProcessor) Tokenize(text string) ([]string, error)
type DataLoader
- func NewDataLoader() *DataLoader
- func (dl *DataLoader) GetFrequencyDict(lang Language, wordlist WordlistType) (map[string]float64, error)
- func (dl *DataLoader) LoadChineseMapping() (map[rune]string, error)
- func (dl *DataLoader) ReadCBPack(filename string) ([][]string, error)
- func (dl *DataLoader) ReadTextFile(filename string) (string, error)
type JapaneseProcessor
- func NewJapaneseProcessor() (*JapaneseProcessor, error)
- func (jp *JapaneseProcessor) Tokenize(text string) ([]string, error)
type Language
type LanguageInfo
- func GetLanguageInfo(lang Language) LanguageInfo
type Tokenizer
- func NewTokenizer(dataLoader *DataLoader) (*Tokenizer, error)
- func (t *Tokenizer) LossyTokenize(text string, lang Language) ([]string, error)
type TokenizerType
type WordFreq
- func New() (*WordFreq, error)
- func (wf *WordFreq) WordFrequency(word string, lang Language, wordlist WordlistType, minimum float64) (float64, error)
- func (wf *WordFreq) ZipfFrequency(word string, lang Language, wordlist WordlistType, minimum float64) (float64, error)
type WordlistType

Constants ¶

View Source

const (
	// CacheSize for frequency lookups
	CacheSize = 100000

	// InferredSpaceFactor is applied for each inferred word boundary in Chinese
	InferredSpaceFactor = 10.0
)

Constants from the original implementation

Variables ¶

This section is empty.

Functions ¶

func CBToFreq ¶

func CBToFreq(cB int) float64

CBToFreq converts centibels to frequency proportion (0-1)

func DigitFreq ¶

func DigitFreq(text string) float64

DigitFreq gets the relative frequency of a string of digits, using our estimates

func FreqToZipf ¶

func FreqToZipf(freq float64) float64

FreqToZipf converts frequency proportion to Zipf scale

func HasDigitSequence ¶

func HasDigitSequence(text string) bool

HasDigitSequence returns true if the text has a digit sequence that will be normalized out and handled with digit_freq

func SmashNumbers ¶

func SmashNumbers(text string) string

SmashNumbers replaces sequences of multiple digits with zeroes, so we don't need to distinguish the frequencies of thousands of numbers

func ZipfToFreq ¶

func ZipfToFreq(zipf float64) float64

ZipfToFreq converts Zipf scale to frequency proportion

Types ¶

type ChineseProcessor ¶

type ChineseProcessor struct {
	// contains filtered or unexported fields
}

ChineseProcessor handles Chinese text processing and tokenization

func NewChineseProcessor ¶

func NewChineseProcessor(dataLoader *DataLoader) *ChineseProcessor

NewChineseProcessor creates a new Chinese processor

func (*ChineseProcessor) SimplifyChinese ¶

func (cp *ChineseProcessor) SimplifyChinese(text string) (string, error)

SimplifyChinese converts Chinese text character-by-character to Simplified Chinese

func (*ChineseProcessor) Tokenize ¶

func (cp *ChineseProcessor) Tokenize(text string) ([]string, error)

Tokenize tokenizes Chinese text using GSE (Jieba-equivalent)

type DataLoader ¶

type DataLoader struct {
	// contains filtered or unexported fields
}

DataLoader handles loading and caching of embedded wordfreq data files

func NewDataLoader ¶

func NewDataLoader() *DataLoader

NewDataLoader creates a new data loader

func (*DataLoader) GetFrequencyDict ¶

func (dl *DataLoader) GetFrequencyDict(lang Language, wordlist WordlistType) (map[string]float64, error)

GetFrequencyDict converts frequency list to a map for faster lookups

func (*DataLoader) LoadChineseMapping ¶

func (dl *DataLoader) LoadChineseMapping() (map[rune]string, error)

LoadChineseMapping loads the Traditional->Simplified Chinese character mapping from embedded data

func (*DataLoader) ReadCBPack ¶

func (dl *DataLoader) ReadCBPack(filename string) ([][]string, error)

ReadCBPack reads a cBpack file from embedded data and returns the frequency data

func (*DataLoader) ReadTextFile ¶

func (dl *DataLoader) ReadTextFile(filename string) (string, error)

ReadTextFile reads a text file from embedded data

type JapaneseProcessor ¶

type JapaneseProcessor struct {
	// contains filtered or unexported fields
}

JapaneseProcessor handles Japanese text processing and tokenization

func NewJapaneseProcessor ¶

func NewJapaneseProcessor() (*JapaneseProcessor, error)

NewJapaneseProcessor creates a new Japanese processor

func (*JapaneseProcessor) Tokenize ¶

func (jp *JapaneseProcessor) Tokenize(text string) ([]string, error)

Tokenize tokenizes Japanese text using Kagome morphological analyzer

type Language ¶

type Language string

Language represents a language code

const (
	LanguageChinese  Language = "zh"
	LanguageJapanese Language = "ja"
)

type LanguageInfo ¶

type LanguageInfo struct {
	Tokenizer             TokenizerType
	LookupTransliteration string
}

LanguageInfo contains metadata about how to handle a language

func GetLanguageInfo ¶

func GetLanguageInfo(lang Language) LanguageInfo

GetLanguageInfo returns metadata about how to handle text in a given language

type Tokenizer ¶

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer handles text tokenization for Chinese and Japanese

func NewTokenizer ¶

func NewTokenizer(dataLoader *DataLoader) (*Tokenizer, error)

NewTokenizer creates a new tokenizer

func (*Tokenizer) LossyTokenize ¶

func (t *Tokenizer) LossyTokenize(text string, lang Language) ([]string, error)

LossyTokenize performs lossy tokenization for frequency lookup, matching the original wordfreq behavior

type TokenizerType ¶

type TokenizerType string

TokenizerType represents different tokenization methods

const (
	TokenizerGse    TokenizerType = "gse"
	TokenizerKagome TokenizerType = "kagome"
)

type WordFreq ¶

type WordFreq struct {
	// contains filtered or unexported fields
}

WordFreq is the main interface for word frequency lookups

func New ¶

func New() (*WordFreq, error)

New creates a new WordFreq instance

func (*WordFreq) WordFrequency ¶

func (wf *WordFreq) WordFrequency(word string, lang Language, wordlist WordlistType, minimum float64) (float64, error)

WordFrequency gets the frequency of a word in the specified language Returns a value between 0 and 1, where 1 means the word appears in every token

func (*WordFreq) ZipfFrequency ¶

func (wf *WordFreq) ZipfFrequency(word string, lang Language, wordlist WordlistType, minimum float64) (float64, error)

ZipfFrequency gets the frequency of a word on the Zipf scale

type WordlistType ¶

type WordlistType string

WordlistType represents different wordlist sizes

const (
	WordlistSmall WordlistType = "small"
	WordlistLarge WordlistType = "large"
	WordlistBest  WordlistType = "best"
)

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL