WebLLM provider for @localmode - LLM inference with quantized models
npm install @localmode/webllmWebLLM provider for local-first LLM inference. Uses 4-bit quantized models for efficient browser-based text generation.


``bash`
pnpm install @localmode/webllm @localmode/core
`typescript
import { generateText, streamText } from '@localmode/core';
import { webllm } from '@localmode/webllm';
// Generate text
const { text, usage } = await generateText({
model: webllm.languageModel('Llama-3.2-1B-Instruct-q4f16'),
prompt: 'Explain quantum computing in simple terms.',
});
console.log(text);
console.log(Generated in ${usage.durationMs}ms);`
`typescript
import { streamText } from '@localmode/core';
import { webllm } from '@localmode/webllm';
const stream = await streamText({
model: webllm.languageModel('Llama-3.2-1B-Instruct-q4f16'),
prompt: 'Write a haiku about programming.',
});
for await (const { text } of stream) {
process.stdout.write(text);
}
`
`typescript
import { preloadModel, isModelCached } from '@localmode/webllm';
// Check if model is already cached
if (!(await isModelCached('Llama-3.2-1B-Instruct-q4f16'))) {
// Preload with progress
await preloadModel('Llama-3.2-1B-Instruct-q4f16', {
onProgress: (p) => console.log(Loading: ${p.progress?.toFixed(1)}%),`
});
}
| Model | Size | Context | Best For |
| ----------------------------- | ----- | ------- | ------------------ |
| Llama-3.2-1B-Instruct-q4f16 | 700MB | 4K | Simple tasks, fast |Llama-3.2-3B-Instruct-q4f16
| | 1.8GB | 4K | General purpose |Phi-3.5-mini-instruct-q4f16
| | 2.4GB | 4K | Reasoning |Qwen2.5-1.5B-Instruct-q4f16
| | 1GB | 4K | Multilingual |SmolLM2-1.7B-Instruct-q4f16
| | 1.1GB | 2K | Compact, fast |
`typescript
import { createWebLLM } from '@localmode/webllm';
const myWebLLM = createWebLLM({
onProgress: (p) => updateLoadingBar(p.progress),
});
const model = myWebLLM.languageModel('Llama-3.2-1B-Instruct-q4f16', {
systemPrompt: 'You are a helpful coding assistant.',
temperature: 0.5,
maxTokens: 1024,
});
``
- WebGPU support (Chrome 113+, Edge 113+)
- Sufficient GPU memory for the model