Pure TypeScript Korean Morphological Analyzer - serverless compatible, based on kuromoji.js and mecab-ko-dic
npm install kuromoji-koPure JavaScript Korean Morphological Analyzer
A port of kuromoji.js adapted for Korean language processing using mecab-ko-dic.
- π Pure JavaScript - runs in Node.js, browsers, and serverless (Vercel, Cloudflare Workers)
- π¦ No native dependencies - no compilation required
- π°π· Korean-optimized - uses mecab-ko-dic with Sejong tagset
- β‘ Viterbi algorithm - accurate morphological analysis
- π§ Simple API - tokenize Korean text in a few lines
``bash`
npm install kuromoji-ko
`javascript
import { MeCab } from 'kuromoji-ko';
const mecab = await MeCab.create({ engine: 'ko', dictPath: './dict' });
const tokens = mecab.parse('μλ
νμΈμ');
for (const token of tokens) {
console.log(token.surface, token.pos, token.lemma);
}
// μλ
['NNG'] μλ
// ν ['XSV'] νλ€
// μΈμ ['EF'] μΈμ
`
`javascript
import kuromoji from 'kuromoji-ko';
const tokenizer = await kuromoji.builder({
dicPath: './dict'
}).build();
const tokens = tokenizer.tokenize('μλ νμΈμ');
for (const token of tokens) {
console.log(token.surface_form, token.pos, token.posDescription);
}
// μλ
NNG μΌλ° λͺ
μ¬
// ν XSV λμ¬ νμ μ λ―Έμ¬
// μΈμ EF μ’
κ²° μ΄λ―Έ
`
Before using kuromoji-ko, you need to build the dictionary files from mecab-ko-dic:
`bashDownload mecab-ko-dic
git clone https://bitbucket.org/eunjeon/mecab-ko-dic.git
This creates binary dictionary files in the
./dict directory.API
$3
####
MeCab.create(options)Create a MeCab instance asynchronously.
`javascript
import { MeCab } from 'kuromoji-ko';const mecab = await MeCab.create({
engine: 'ko', // Only 'ko' is supported
dictPath: './dict' // Path to dictionary directory
});
`####
mecab.parse(text)Parse text into an array of Token objects.
`javascript
const tokens = mecab.parse('μλ²μ§κ°λ°©μλ€μ΄κ°μ λ€');
tokens.forEach(t => console.log(t.surface, t.pos));
`$3
| Property | Type | Description |
|----------|------|-------------|
|
surface | string | How the token looks in the input text |
| pos | string[] | Parts of speech as array (split by "+") |
| lemma | string | Dictionary headword (adds "λ€" for verbs) |
| pronunciation | string \| null | How the token is pronounced |
| hasBatchim | boolean \| null | Whether token has final consonant (λ°μΉ¨) |
| hasJongseong | boolean \| null | Alias for hasBatchim |
| semanticClass | string \| null | Semantic word class or category |
| type | string \| null | Token type (Inflect/Compound/Preanalysis) |
| expression | ExpressionToken[] \| null | Breakdown of compound/inflected tokens |
| features | string | Raw features string (comma-separated) |
| raw | string | Raw MeCab output format (surface\tfeatures) |$3
For compound or inflected words,
expression returns an array of ExpressionToken:| Property | Type | Description |
|----------|------|-------------|
|
morpheme | string | The normalized token |
| pos | string | Part of speech |
| lemma | string | Dictionary form (adds "λ€" for verbs) |
| semanticClass | string \| null | Semantic category |---
$3
####
kuromoji.builder(options)Create a tokenizer builder.
`javascript
const builder = kuromoji.builder({
dicPath: './dict', // Path to dictionary directory
loader: customLoader // Optional custom file loader
});
`$3
Build and return the tokenizer (async).
`javascript
const tokenizer = await builder.build();
`$3
Tokenize Korean text into morphemes.
`javascript
const tokens = tokenizer.tokenize('νκ΅μ΄ ννμ λΆμ');
`$3
Get just the surface forms as an array.
`javascript
const words = tokenizer.wakati('νκ΅μ΄ ννμ λΆμ');
// ['νκ΅μ΄', 'ννμ', 'λΆμ']
`$3
Get space-separated surface forms.
`javascript
const str = tokenizer.wakatiString('νκ΅μ΄ ννμ λΆμ');
// 'νκ΅μ΄ ννμ λΆμ'
`KoreanToken Object (Classic API)
Each token from
tokenizer.tokenize() has the following properties:| Property | Description | Example |
|----------|-------------|---------|
|
surface_form | Surface text | 'νκ΅μ΄' |
| word_position | Position in text (1-indexed) | 1 |
| word_id | Dictionary word ID | 12345 |
| word_type | KNOWN or UNKNOWN | 'KNOWN' |
| pos | POS tag (Sejong tagset) | 'NNG' |
| posDescription | POS description | 'μΌλ° λͺ
μ¬' |
| semantic_class | Semantic category | '*' |
| has_final_consonant | Ends with λ°μΉ¨? (T/F/*) | 'F' |
| reading | Pronunciation | 'νκ΅μ΄' |
| type | Inflect/Compound/Preanalysis | 'Compound' |
| first_pos | First POS (compounds) | 'NNG' |
| last_pos | Last POS (compounds) | 'NNG' |
| expression | Decomposition | 'νκ΅/NNG/+μ΄/NNG/' |Korean POS Tags (Sejong Tagset)
$3
| Tag | Description |
|-----|-------------|
| NNG | μΌλ° λͺ
μ¬ (General noun) |
| NNP | κ³ μ λͺ
μ¬ (Proper noun) |
| NNB | μμ‘΄ λͺ
μ¬ (Dependent noun) |
| NR | μμ¬ (Numeral) |
| NP | λλͺ
μ¬ (Pronoun) |$3
| Tag | Description |
|-----|-------------|
| VV | λμ¬ (Verb) |
| VA | νμ©μ¬ (Adjective) |
| VX | 보쑰 μ©μΈ (Auxiliary) |
| VCP | κΈμ μ§μ μ¬ (Copula μ΄λ€) |
| VCN | λΆμ μ§μ μ¬ (Negative μλλ€) |$3
| Tag | Description |
|-----|-------------|
| JKS | 주격 μ‘°μ¬ (Subject) |
| JKO | λͺ©μ 격 μ‘°μ¬ (Object) |
| JKB | λΆμ¬κ²© μ‘°μ¬ (Adverbial) |
| JX | λ³΄μ‘°μ¬ (Auxiliary particle) |$3
| Tag | Description |
|-----|-------------|
| EP | μ μ΄λ§ μ΄λ―Έ (Pre-final) |
| EF | μ’
κ²° μ΄λ―Έ (Final) |
| EC | μ°κ²° μ΄λ―Έ (Connective) |
| ETN | λͺ
μ¬ν μ μ± μ΄λ―Έ (Nominalizing) |
| ETM | κ΄νν μ μ± μ΄λ―Έ (Adnominalizing) |$3
| Tag | Description |
|-----|-------------|
| SL | μΈκ΅μ΄ (Foreign) |
| SH | νμ (Chinese characters) |
| SN | μ«μ (Numbers) |
| SW | κΈ°ν κΈ°νΈ (Symbols) |Browser Usage
`html
`Serverless (Vercel) Usage
kuromoji-ko runs without native dependencies, making it perfect for serverless:
`javascript
// api/tokenize.js
import kuromoji from 'kuromoji-ko';let tokenizerPromise = null;
function getTokenizer() {
if (!tokenizerPromise) {
tokenizerPromise = kuromoji.builder({
dicPath: './dict'
}).build();
}
return tokenizerPromise;
}
export default async function handler(req, res) {
const tokenizer = await getTokenizer();
const tokens = tokenizer.tokenize(req.body.text);
res.json(tokens);
}
``kuromoji-ko implements morphological analysis using:
1. Double-Array TRIE - Efficient dictionary lookup for surface forms
2. Viterbi Algorithm - Dynamic programming to find the optimal segmentation
3. Connection Costs - Bigram model for morpheme transitions
4. Unknown Word Handling - Character-type based POS estimation
- kuromoji.js - Original Japanese implementation
- mecab-ko-dic - Korean dictionary
- MeCab - Original C++ morphological analyzer
Apache-2.0
Dictionary files (mecab-ko-dic) are also Apache-2.0 licensed.