Chinese word segmentation for node in pure javascript.
> @ig3/node-jieba-js A Chinese word segmentation tool
> in pure javascript, based on Python Jieba package
This package provides for segregation of Chinese text using essentially the
same algorithm as
Python Jieba
cut function: without the Hidden Markov Model and without Paddle.
It is compatible with dictionary files used with Python Jieba.
npm install @ig3/node-jieba-js
``
import jiebaFactory from '@ig3/node-jieba-js';
jiebaFactory({
cacheFile: '/path/to/jieba-dictionary-cache.json',
})
.then(jiebaInstance => {
const segregation = jiebaInstance.cut("我爸新学会了一项解决日常烦闷的活动,就是把以前的照片抱回办公室扫描保存,弄成电子版的。更无法接受的是,还居然放到网上来,时不时给我两张。\n这些积尘的化石居然突然重现,简直是招架不住。这个怀旧的阀门一旦打开,那就直到意识模糊都没停下来。");
console.log(segregation);
});
`
Or, from a CJS script, use dynamic import:
``
import('@ig3/node-jieba-js')
.then(module => {
module.jiebaFactory({
cacheFile: '/path/to/jieba-dictionary-cache.json',
})
.then(jiebaInstance => {
const segregation = jieba.cut("我爸新学会了一项解决日常烦闷的活动,就是把以前的照片抱回办公>室扫描保存,弄成电子版的。更无法接受的是,还居然放到网上来,时不时给我两张。\n这些积尘的化石居然突然重现,简直是招架不住。这个怀旧的阀门一旦打开,那就直到意识模
糊都没停下来。");
console.log(segregation);
});
});
object. * options \
The dictionaryEntries is for adding a small number of additional dictionary
entries. For larger numbers of entries it is better to put them into a
dictionary file and load the file.
The dictionary entries must be provided in 'internal' format: an array of
three elements: the word, the 'frequency' as an integer number of
occurrences per 100 million words (n.b. not text) and the part of speech as
text. The value of the dictionaryEntries option must be an array of such
arrays.
For example:
`
jiebaFactory({
cacheFile: '/path/to/jieba-dictionary-cache.json',
dictionaryEntries: [
['一', 217830, 'm'],
['一一二', 11, 'm']
]
})
.then(jiebaInstance => {
});
`jiebaFactorySync([options])
Like jiebaFactory except that it returns the jieba instance object
directly, rather than returning a Promise.jieba Instance Methods
$3
* text \ The text to be segregated.
* options \ dict is a string, it is appended to the list of dictionary files to be
loaded. This must be done before initialization.If
dict is an array, it is appended to the loaded dictionary data. Each
element must be an array with three elements: word, frequency (occurrences
per 100 million words) and part of speech. As in the dictionary text files
except split into separate array elements and frequency must be a number.If
dict is a function it is called with this set to the jieba instance
object and two arguments: the loaded dictionary array and the jieba
instance object. The return value is passed to useDict, unless it is a
Promise, in which case the value it resolves to is passed to useDict.
Where dict is a string or an array of strings or a function that returns a
string or an array of strings or a Promise that resolves to one of these.The strings must be paths to dictionary files. They are appended to the
list of dictionaries to be loaded.
Notes
$3
In Python Jieba, the cut method is a 'generator' but in @ig3/node-jieba-js
it returns an array of substrings. Why not return a generator function in
@ig3/node-jieba-js?A generator would be advantageous if it were possible to process the input
sentence incrementally but this is not possible. To algorithm to segregate
the sentence is to determine the best path through the entire sentence.
This requires processing the entire sentence before the best segregation of
any part of it can be determined.
trie Vs prefix dictionary
Prior to commit 51df778 on 20141019, Python Jieba generated a trie from the
dictionary and used that to produce the DAG then used the DAG to produce
the possible and 'best' route to segregate the sentence. Commit 51df778
changed this. Rather than generating a trie, pfdict is a set (set()) and
FREQ is a dictionary ({}). pfdict has all prefixes of words in the
dictionary, including the full word. Both pfdict and FREQ are used to
generate the DAG. Lookup in lfreq is used to identify words and failed
lookup in pfdict is used to terminate the search loop. Both lfreq and
pfdict are large indexes. The commit offers no explanation why this change
was made.Subsequently the prefix dictionary has been merged with the FREQ: FREQ
contains all prefixes with a 'frequency' of 0. Real words have a non-zero
frequencey. When generating the DAG, the 'word' must exist in FREQ but it
is added to the DAG only if it has a non-zero frequency, so that prefixes
are not added. This avoids having two large indexes but FREQ becomes
larger: including all the prefixes of words.
What are the advantages of the large index Vs the trie? More or less
memory? More or less CPU? More or less time? There is nothing in the Python
jieba commit logs to indicate why the implementation was changed.
But there is https://github.com/fxsjy/jieba/pull/187
Translation of the initial comment:
> For the get_DAG() function, employing a Trie data structure, particularly within a Python environment, results in excessive memory consumption. Experiments indicate that constructing a prefix set resolves this issue.
> This set stores words and their prefixes, e.g.
set([“number”, “data”, “data structure”, “data structure”]). When searching for words in a sentence, a forward lookup is performed within the prefix list until the word is not found in the prefix list or the search exceeds the sentence's boundaries. This approach increases the entry count by approximately 40% compared to the original lexicon.> This version passed all tests, yielding identical segmentation results to the original version. Test: A 5.7MB novel, using default dictionary, 64-bit Ubuntu, Python 2.7.6.
> Trie: Initial load 2.8 seconds, cached load 1.1 seconds; memory 277.4MB, average rate 724kB/s
> Prefix dictionary: Initial load 2.1 seconds, cached load 0.4 seconds; memory 99.0MB, average rate 781kB/s
> This approach resolves Trie's low space efficiency in pure Python implementations.
> Simultaneously refined code details, adhered to PEP8 formatting, and optimised several logical checks.
> Added __main__.py, enabling direct word segmentation via
python -m jieba.At least in Python then, using the larger index reduced memory consumption
and processing time. Might the same be true in JavaScript?
Dictionary files
The input dictionary format is:
* one word per line
* Each line has three fields, separated by single spaces:
* The word
* Frequency
* Part of speech
In response to
fsxjy/jieba issue#3,
fsxjy describes the 'frequency' number in the dictionary:
> That number indicates how many times the word appears in my corpus.
And in response to
fsxjy/jieba issue#7,
fsxjy describes the sources:
> @feriely, the sources are primarily twofold: one is the segmented corpus from the 1998 People's Daily available for download online, along with the MSR segmented corpus. The other consists of some text novels I collected myself, which I segmented using ICTCLAS (though there might be some inaccuracies). Then I used a Python script to count word frequencies.
It isn't certain that this describes the sources for the dictionaries as
the issue was regarding probabilities in 'finalseg/prob\_\*.py', but it
seems likely that the same corpus would be used.
I have not found The corpus used to build the dictionaries. It seems it is
not published. Certainly not as part of the Python Jieba source.
It is some collection of texts and the numbers in the dictionaries (small,
medium and big) are counts of occurrences in that unknown body of text. In
particular, the total number of characters and words in the corpus are
unknown.
Comparing dict.txt.big and dict.txt.small, the 'frequency' numbers for a
selection of common words are the same in both dictionaries, with the
exception of the frequency for 的 in dict.txt.big and dict.txt.small:
318825 in the former and 3188252 in the latter.
Checking dict.txt it is 318825. So, perhaps 318825 is correct and 3188252
is an error.
Otherwise, frequency was the same in dict.txt.big and dict.txt.small, for
every word checked.
I checked dict.txt.small in the Python Jieba source and it is 3188252. So,
it seems not an error I introduced.
的 is one of the most common words in every corpus I find. Checking another
source, frequency is about 4000000 / 100 million. And another source:
236106 per million words = 23610600 per 100 million. So, values are quite
variable. Perhaps all that really matters is that it is one of the most
frequent.
Presumably the numbers would be the same for all words in both files and
the difference is that the different dictionary files have different
subsets of the total number of words in the corpus. The dictionary
dict.txt.big must be the most complete, but whether than includes all words
in the corpus or only a subset is unknown. Assuming it is a large subset of
the total words, then the sum of the 'frequency' numbers in dict.txt.big
will be less than the total words in the corpus. Also, as some words
contain other words (e.g. '的' is contained in '目的', '真的', '的话',
etc.) it is not clear whether the count for '的' is a count for '的' alone
(i.e. not part of any larger word) or if it is the much larger count of how
many times '的' appears in the corpus, regardless of context (i.e.
including occurrences in 'longer' words).
According to Dictionary Formats, describing the dictionary formats for jieba-php, the parts of speech are:
* m = Numeral (数词)
* n = Noun (名词)
* v = Verb (动词)
* a = Adjective (形容词)
* d = Adverb (副词)
According to
Custom Dictionary Format, describing custom dictionaries for jieba-php, the frequency is optional with a default of 1.
I see no provision for this default frequency in the code that reads the
main dictionary in the current Python version of jieba. It appears that if
the frequency is missing a ValueError will result, reporting the 'invalid'
dictionary entry. However, in calculation of the route it does use a
default of 1 when a word is not in the FREQ lookup index.
(log(self.FREQ.get(sentence[idx:x + 1]) or 1). In current Python jieba, when loading a user dictionary, the frequency and
part of speech may be omitted. The default frequency isn't necessarily 1:
there is a function that determines the default (
suggest_freq`). In this implementation, the frequency in the dictionaries is scaled to
occurrences per 100 million words. The scaling assumes that the sum of
frequencies in dict.txt.big is a good approximation of the total words in
the corpus.
This allows words from other sources to be added, as long as occurrences
per 100 million words can be determined.
This began as a fork of
bluelovers/jieba-js
as at 2025-09-20. It has been substantially rewritten.
Some details of the algorithms and dictionaries are derived from
fxsjy/jieba.
- fxsjy/jieba
- bluelovers/jieba-js
- hermanschaaf/jieba-js
- pulipulichen/jieba-js
- 線上中文斷詞工具:Jieba-JS
- 彙整中文與英文的詞性標註代號:結巴斷詞器與FastTag / Identify the Part of Speech in Chinese and English
- jxsjy/jieba - Customization & Advanced Usage