Icelandic word form to lemma lookup for browser and Node.js
npm install lemma-isFast Icelandic lemmatization for JavaScript. Built for search indexing.
``typescript
import { BinaryLemmatizer, extractIndexableLemmas, buildSearchQuery, highlight } from "lemma-is";
lemmatizer.lemmatize("börnin"); // → ["barn"]
lemmatizer.lemmatize("keypti"); // → ["kaupa"]
lemmatizer.lemmatize("hestinum"); // → ["hestur"]
// Full pipeline for search
extractIndexableLemmas("Börnin keypti hestinn", lemmatizer);
// → ["barn", "kaupa", "hestur"]
// Query normalization (backend-agnostic)
buildSearchQuery("bílaleigur", lemmatizer);
// → { groups: [["bílaleiga"]], query: "bílaleiga" }
`
Icelandic is heavily inflected. A single noun like "hestur" (horse) has 16 forms:
``
hestur, hest, hesti, hests, hestar, hesta, hestum, hestanna...
If a user searches "hestur" but your document contains "hestinum", they won't find it—unless you normalize both to the lemma at index time.
Icelandic is underserved in the search ecosystem:
- PostgreSQL has no Icelandic stemmer (Snowball doesn't support it)
- Elasticsearch has no Icelandic analyzer in its 36 built-in languages
- Algolia lists Icelandic but only provides basic plurals—no morphological analysis
- Existing Icelandic NLP tools (Greynir, Nefnir) are Python-only
For comparison, Finnish has Voikko with PostgreSQL and Elasticsearch plugins. Icelandic has had nothing equivalent—until now.
lemma-is is the first npm package providing Icelandic lemmatization for search. It embeds the BÍN morphological database and runs anywhere JavaScript runs.
GreynirEngine remains the gold standard for sentence parsing and grammatical analysis in Icelandic. But full parsing is not forgiving: if a sentence doesn't parse, you don't get disambiguated lemmas. That makes it a poor fit for messy, real‑world search indexing where recall matters.
GreynirEngine also exposes a non‑parsing lemmatizer via its bintokenizer/simple_lemmatize pipeline, which can return all possible lemmas for a token. This is more forgiving but overindexes heavily without sentence‑level disambiguation.
lemma-is targets this gap: high‑recall lemmatization for search, tolerant of noise, with light disambiguation and compound splitting, and it runs anywhere JavaScript runs.
IFD benchmark summary (lemma recall + overindexing measured against gold lemmas in the Icelandic Frequency Dictionary corpus):
| | lemma-is core | lemma-is full | GreynirEngine (BÍN lookup) |
|---|---|---|---|
| Runtime | Node, Bun, Deno | Node, Bun, Deno | Python |
| Throughput | ~19.0M words/min | ~14.7M words/min | ~13.3K words/min |
| Recall (IFD) | 95.996% | 98.585% | 81.4% (parsed-only) |
| Avg candidates | 1.57 | 1.57 | 1.0 |
| Overindexing (extraRate) | 0.388 | 0.373 | 0.186 |
| Memory (load) | ~18.5 MB | ~182 MB | ~417 MB RSS |
| Parse failures | n/a | n/a | 27% (sample) |
| Disambiguation | Bigrams + grammar rules | Bigrams + grammar rules | Full grammar + BÍN |
| Use case | Search indexing | Search indexing | NLP analysis |
See BENCHMARKS.md for methodology and detailed results.
The IFD gold corpus is referenced here: https://repository.clarin.is/repository/xmlui/handle/20.500.12537/36.
GreynirEngine numbers are from full sentence parsing on a 1,000-sentence IFD sample; parse failures and tokenization mismatches lower measured recall. The bintokenizer-based lemmatizer is more forgiving but overindexes heavily when all lemmas are kept.
- Core memory: ~18.5 MB load (heap + ArrayBuffers) for lemma-is.core.binlemma-is.bin
- Full memory: ~182 MB load for
- Greynir full parser memory: ~417 MB RSS (sample run)
- Core speed: ~19.0M words/min; Full speed: ~14.7M words/min
- Core recall: 95.996% on IFD; Full recall: 98.585%
- Core recall boost: unknown‑word suffix fallback enabled only in core to raise recall without hurting full
- Lower memory compound lookup: Bloom filter known‑lemma lookup reduces RAM when splitting compounds
lemma-is returns all possible lemmas for ambiguous words:
`typescript`
lemmatizer.lemmatize("á");
// → ["á", "eiga"]
// Could be preposition "on", noun "river", or verb "owns"
GreynirEngine parses the sentence to return the single correct interpretation. For search, returning all candidates is often better—you'd rather show an extra result than miss a relevant document.
`bash`
npm install lemma-is
`typescript
import { readFileSync } from "fs";
import { BinaryLemmatizer, extractIndexableLemmas } from "lemma-is";
// Load the core binary (~9-11 MB, low memory, best for browser/edge)
const buffer = readFileSync("node_modules/lemma-is/data-dist/lemma-is.core.bin");
const lemmatizer = BinaryLemmatizer.loadFromBuffer(
buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
);
// Basic lemmatization
lemmatizer.lemmatize("börnin"); // → ["barn"]
lemmatizer.lemmatize("fóru"); // → ["fara", "fóra"]
// Full pipeline for search indexing
const lemmas = extractIndexableLemmas("Börnin fóru í bíó", lemmatizer);
// → ["barn", "fara", "fóra", "í", "bíó"]
`
The binary includes case, gender, and number for each word form:
`typescript`
lemmatizer.lemmatizeWithMorph("hestinum");
// → [{ lemma: "hestur", pos: "no", morph: { case: "þgf", gender: "kk", number: "et" } }]
// dative, masculine, singular
Shallow grammar rules use Icelandic case government to disambiguate prepositions:
`typescript
import { Disambiguator } from "lemma-is";
const disambiguator = new Disambiguator(lemmatizer, lemmatizer, { useGrammarRules: true });
// "á borðinu" - borðinu is dative, á governs dative → preposition
disambiguator.disambiguate("á", null, "borðinu");
// → { lemma: "á", pos: "fs", resolvedBy: "grammar_rules" }
`
Icelandic forms long compounds. Split them for better search coverage:
`typescript
import { CompoundSplitter, createKnownLemmaFilter } from "lemma-is";
const knownLemmas = createKnownLemmaFilter(lemmatizer.getAllLemmas());
const splitter = new CompoundSplitter(lemmatizer, knownLemmas);
splitter.split("landbúnaðarráðherra");
// → { isCompound: true, parts: ["landbúnaður", "ráðherra"] }
// "agriculture minister"
`
For production indexing, combine everything:
`typescript
import { extractIndexableLemmas, CompoundSplitter, createKnownLemmaSet } from "lemma-is";
const knownLemmas = createKnownLemmaSet(lemmatizer.getAllLemmas());
const splitter = new CompoundSplitter(lemmatizer, knownLemmas);
const text = "Ríkissjóður stendur í blóma ef milljarða arðgreiðsla er talin með.";
const lemmas = extractIndexableLemmas(text, lemmatizer, {
bigrams: lemmatizer,
compoundSplitter: splitter,
removeStopwords: true,
});
// Indexed: ríkissjóður, ríki, sjóður, standa, blómi, milljarður,
// arðgreiðsla, arður, greiðsla, telja
// Stopwords removed: í, ef, er, með
`
A search for "sjóður" or "arður" now finds this document.
Use the same lemmatization pipeline for search queries as for documents.
The helper returns grouped terms plus a boolean query string:
`typescript
import { buildSearchQuery } from "lemma-is";
const { groups, query } = buildSearchQuery("bílaleigur", lemmatizer, {
removeStopwords: true,
});
// groups: [["bílaleiga"]]
// query: "bílaleiga"
`
You can swap operators to match your backend:
`typescript
// SQLite FTS5 prefers AND/OR
const sqlite = buildSearchQuery("við fórum í bíó", lemmatizer, {
removeStopwords: true,
andOperator: " AND ",
orOperator: " OR ",
});
// Elasticsearch can use groups to build a bool query`
// (OR within a group, AND across groups)
PostgreSQL has no built-in Icelandic stemmer. Use lemma-is to pre-process:
`typescript
const lemmas = extractIndexableLemmas(text, lemmatizer, { removeStopwords: true });
await db.query(
INSERT INTO documents (title, body, search_vector)
VALUES ($1, $2, to_tsvector('simple', $3)),`
[title, body, Array.from(lemmas).join(" ")]
);
Use the simple configuration—it lowercases but doesn't stem, since our lemmas are already normalized.
Important: Don't use PostgreSQL's unaccent extension for Icelandic. Characters like á, ö, þ, ð are distinct letters, not accented variants.
For queries:
`typescriptSELECT * FROM documents WHERE search_vector @@ to_tsquery('simple', $1)
const { query } = buildSearchQuery(userQuery, lemmatizer, { removeStopwords: true });
const sql = ;`
await db.query(sql, [query]);
After finding matching documents, highlight the query terms in the original text:
`typescript
import { highlight } from "lemma-is";
const result = highlight("hestur", "Hestarnir eru á beitinni.", lemmatizer);
// result.matchCount = 1
// result.segments = [
// { text: "Hestarnir", highlight: true },
// { text: " eru á beitinni.", highlight: false }
// ]
`
The highlight function lemmatizes both query and document, finding matches across inflections. "hestur" matches "Hestarnir" because both normalize to the lemma "hestur".
Render the segments however fits your UI:
`typescript
// React example
result.segments.map((seg, i) =>
seg.highlight ? {seg.text} : seg.text
);
// HTML string
result.segments.map(seg =>
seg.highlight ? ${seg.text} : seg.text`
).join("");
For long documents, extract the most relevant snippets instead of highlighting the entire text:
`typescript
import { extractSnippets } from "lemma-is";
const doc = Langt áður fyrr bjó maður á bæ. Hestarnir voru margir og góðir.
Þeir gengu á fjöllum. Maðurinn unni hestunum sínum mjög.;
const result = extractSnippets("hestur", doc, lemmatizer, {
snippetWords: 10, // ~10 words per snippet
maxSnippets: 3, // up to 3 snippets
ellipsis: "…", // truncation marker
});
// result.totalMatches = 2
// result.snippets[0] = {
// text: "…Hestarnir voru margir og góðir.…",
// segments: [
// { text: "…", highlight: false },
// { text: "Hestarnir", highlight: true },
// { text: " voru margir og góðir.…", highlight: false }
// ],
// score: 11, // match density score
// start: 32, // offset in original
// end: 64
// }
`
Snippets are ranked by match density and selected to avoid overlap. Use segments for custom rendering or text for display.
This is an early effort with known limitations.
There are two binaries:
- Core (~9-11 MB): default, optimized for browser/edge/cold start
- Full (91 MB): maximum coverage and disambiguation
The full binary targets Node.js servers where data loads once at startup. Not recommended for:
- Serverless/edge — cold start loading 91 MB may be slow
- Browser — download size prohibitive
- Cloudflare Workers — fits 128 MB limit but cold starts are slow
For browser apps, use the core binary.
To use the full binary, build it locally:
`bash`
pnpm build:binary
Then load it from data-dist/lemma-is.bin.
For cold-start runtimes and the browser, you can build a compact core binary that trades accuracy for size by:
- Keeping only the most frequent word forms
- Dropping bigram data and morphological features
This reduces memory significantly at the cost of recall/precision on rare words.
`bash`
pnpm build:core
The output is written to data-dist/lemma-is.core.bin. Use it exactly like the full binary; it just covers fewer word forms.
This is a lookup table with shallow grammar rules, not a grammatical parser. It doesn't understand sentence structure, named entities, or semantic meaning. The grammar rules help with common patterns but can't handle all disambiguation.
For applications needing full grammatical analysis, use GreynirEngine.
Bigram disambiguation only works when the word pair exists in corpus data. Without context, ambiguous words return all candidates:
`typescript`
lemmatizer.lemmatize("á");
// → ["á", "eiga"] — no way to know which is more likely
For search indexing, use indexAllCandidates: true (the default) to index all lemmas.
You can go word → lemma but not lemma → words. If you need to show all inflected forms of a lemma, you'll need to build a reverse mapping from BÍN data.
Single binary file containing:
- 289K lemmas from BÍN
- 3M word form mappings
- 414K bigram frequencies
- Morphological features per word form
`bashDownload BÍN data from https://bin.arnastofnun.is/DMII/LTdata/k-LTdata/
Extract SHsnid.csv to data/
uv run python scripts/build-binary.py
`
`bash``
pnpm test # run tests
pnpm build # build dist/
pnpm typecheck # type check
- BÍN — Morphological database from the Árni Magnússon Institute
- Miðeind — GreynirEngine and foundational Icelandic NLP work
- tokenize-is — Icelandic tokenizer
MIT for the code.
The linguistic data is derived from BÍN © Árni Magnússon Institute for Icelandic Studies.
By using this package, you agree to BÍN's conditions:
- Credit the Árni Magnússon Institute in your product
- Do not redistribute the raw data separately
- Do not publish inflection paradigms without permission
Full terms: BÍN License Conditions