A WebAssembly library for efficient text deduplication and similarity detection using shingling and MinHash.
npm install shinglesbash
npm install shingles
or
yarn add shingles
or
pnpm add shingles
`
Usage (JavaScript / TypeScript)
shingles is designed to be used in a background thread (Web Worker) for large datasets, but the API is simple enough for main-thread use on smaller collections.
`javascript
import init, { Deduplicator } from 'shingles';
async function main() {
await init();
// 128 hashes divided into 16 bands (Threshold ~0.7-0.8)
// use a BigInt for the seed (n suffix required in JS)
const seed = 42n;
const dedup = new Deduplicator(128, 16, seed);
const privacySalt = "user-secret-key-123";
// add documents
const docs = [
{ id: 1, text: "The quick brown fox jumps over the lazy dog." },
{ id: 2, text: "The quick brown fox leaps over the lazy dog!" }, // near dupe
{ id: 3, text: "I love coding in Rust and WebAssembly." } // different
];
for (const doc of docs) {
const shingless = Deduplicator.get_shingless(doc.text, 5);
const signature = dedup.compute_minhash(shingless, privacySalt);
dedup.add_document(doc.id, signature);
}
// query for near-duplicates
const queryText = "A quick brown fox jumps over a lazy dog.";
const queryshingless = Deduplicator.get_shingless(queryText, 5);
const querySig = dedup.compute_minhash(queryshingless, privacySalt);
// LSH finds candidates in O(1) time instead of O(N)
const candidates = dedup.query_candidates(querySig);
console.log("Candidate IDs:", candidates); // Output: [1, 2]
// fine-grained similarity check
// compare the query against the most likely candidate (ID: 1)
const sig1 = dedup.compute_minhash(Deduplicator.get_shingless(docs[0].text, 5), privacySalt);
const similarity = dedup.estimate_jaccard(querySig, sig1);
console.log(Estimated Similarity: ${(similarity * 100).toFixed(2)}%);
}
main();
`
Why Privacy Salt?
When two parties use the same secret salt, their MinHash signatures become comparable without revealing the original text.
Different salts produce completely incomparable signatures → ideal for secure, private similarity checks (e.g., two users compare notes without sending plaintext).
Performance
- Tested on 1,000–5,000 short documents: signature computation + indexing < 500ms (browser, mid-range device)
- Scales to 10k+ docs with Web Workers (recommended for heavy use)
- Much faster than pure JS alternatives for large texts
Tuning the "Sensitivity" (S-Curve)
The relationship between Bands, Rows, and Similarity follows an S-curve. By adjusting the parameters in new Deduplicator(hashes, bands, seed), you control how "fuzzy" the matching is.
| Goal | Hashes | Bands | Threshold (T) |
|-------------|--------|-------|---------------|
| Strict | 128 | 8 | ≈ 0.84 |
| Balanced | 128 | 16 | ≈ 0.70 |
| Permissive | 128 | 32 | ≈ 0.55 |
Formula: $$T≈(1/b)1/r$$ where b is bands and r is rows per band
Building from Source
`bash
Install wasm-pack if needed
cargo install wasm-pack
Build for web target
wasm-pack build --target web
`
Roadmap
- Web Worker helpers for non-blocking heavy operations
- Exact similarity (Jaccard on shingles) only on candidates
- Streaming/large-text support
- Advanced PSI extensions for stronger privacy
- TypeScript types & better docs
Contributing
Contributions welcome! Feel free to open issues or PRs.
1. Fork the repo
2. cargo build / wasm-pack build`