A minimal llama.cpp provider for the Vercel AI SDK implementing LanguageModelV3 and EmbeddingModelV3
npm install @shipworthy/ai-sdk-llama-cpp> Alpha Software - This package is in early development. The API may change between versions without notice.
> macOS Only - This package currently only supports macOS with Apple Silicon or Intel processors.
A minimal llama.cpp provider for the Vercel AI SDK, implementing the LanguageModelV3 interface.
This package loads llama.cpp directly into Node.js memory via native C++ bindings, enabling local LLM inference without requiring an external server.
- Native Performance: Direct C++ bindings using node-addon-api (N-API)
- GPU Acceleration: Automatic Metal support on macOS
- Streaming & Non-streaming: Full support for both generateText and streamText
- Chat Templates: Automatic or configurable chat template formatting (llama3, chatml, gemma, etc.)
- ESM Only: Modern ECMAScript modules, no CommonJS
- GGUF Support: Load any GGUF-format model
Before installing, ensure you have the following:
- macOS (Apple Silicon or Intel)
- Node.js >= 18.0.0
- CMake >= 3.15
- Xcode Command Line Tools
``bashInstall Xcode Command Line Tools (includes Clang)
xcode-select --install
Installation
`bash
npm install ai-sdk-llama-cpp
`The installation will automatically:
1. Detect macOS and verify platform compatibility
2. Compile llama.cpp as a static library with Metal support
3. Build the native Node.js addon
> Note: Installation on Windows or Linux will fail with an error. Only macOS is supported.
Usage
$3
`typescript
import { generateText } from "ai";
import { llamaCpp } from "ai-sdk-llama-cpp";const model = llamaCpp({
modelPath: "./models/llama-3.2-1b-instruct.Q4_K_M.gguf",
});
try {
const { text } = await generateText({
model,
prompt: "Explain quantum computing in simple terms.",
});
console.log(text);
} finally {
model.dispose();
}
`$3
`typescript
import { streamText } from "ai";
import { llamaCpp } from "ai-sdk-llama-cpp";const model = llamaCpp({
modelPath: "./models/llama-3.2-1b-instruct.Q4_K_M.gguf",
});
try {
const { textStream } = await streamText({
model,
prompt: "Write a haiku about programming.",
});
for await (const chunk of textStream) {
process.stdout.write(chunk);
}
} finally {
model.dispose();
}
`$3
`typescript
import { embed, embedMany } from "ai";
import { llamaCpp } from "ai-sdk-llama-cpp";const model = llamaCpp.embedding({
modelPath: "./models/nomic-embed-text-v1.5.Q4_K_M.gguf",
});
try {
const { embedding } = await embed({
model,
value: "Hello, world!",
});
const { embeddings } = await embedMany({
model,
values: ["Hello, world!", "Hello, ▲!"],
});
} finally {
model.dispose();
}
`$3
`typescript
const model = llamaCpp({
// Required: Path to the GGUF model file
modelPath: "./models/your-model.gguf", // Optional: Maximum context size (default: 2048)
contextSize: 4096,
// Optional: Number of layers to offload to GPU
// Default: 99 (all layers). Set to 0 to disable GPU.
gpuLayers: 99,
// Optional: Number of CPU threads (default: 4)
threads: 8,
// Optional: Enable verbose debug output from llama.cpp (default: false)
debug: true,
// Optional: Chat template to use for formatting messages
// - "auto" (default): Use the template embedded in the GGUF model file
// - Template name: Use a specific built-in template (e.g., "llama3", "chatml", "gemma")
chatTemplate: "auto",
});
`#### Chat Templates
The
chatTemplate option controls how messages are formatted before being sent to the model. Available templates include:-
chatml, llama2, llama2-sys, llama3, llama4
- mistral-v1, mistral-v3, mistral-v7
- phi3, phi4, gemma, falcon3, zephyr
- deepseek, deepseek2, deepseek3, command-r
- And more (see llama.cpp documentation for the full list)$3
The standard AI SDK generation parameters are supported:
`typescript
try {
const { text } = await generateText({
model,
prompt: "Hello!",
maxTokens: 256, // Maximum tokens to generate
temperature: 0.7, // Sampling temperature (0-2)
topP: 0.9, // Nucleus sampling threshold
topK: 40, // Top-k sampling
stopSequences: ["\n"], // Stop generation at these sequences
});
} finally {
model.dispose();
}
`Model Downloads
You'll need to download GGUF-format models separately. Popular sources:
- Hugging Face - Search for GGUF models
- TheBloke's Models - Popular quantized models
Example download:
`bash
Create models directory
mkdir -p modelsDownload a model (example: Llama 3.2 1B)
wget -P models/ https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf
`API Reference
$3
Creates a new llama.cpp language model instance.
Parameters:
-
config.modelPath (string, required): Path to the GGUF model file
- config.contextSize (number, optional): Maximum context size. Default: 2048
- config.gpuLayers (number, optional): GPU layers to offload. Default: 99
- config.threads (number, optional): CPU threads. Default: 4
- config.debug (boolean, optional): Enable verbose llama.cpp output. Default: false
- config.chatTemplate (string, optional): Chat template to use for formatting messages. Default: "auto"Returns:
LlamaCppLanguageModel - A language model compatible with the Vercel AI SDK$3
Implements the
LanguageModelV3 interface from @ai-sdk/provider.Methods:
-
doGenerate(options): Non-streaming text generation
- doStream(options): Streaming text generation
- dispose(): Unload the model and free GPU/CPU resources. Always call this when done to prevent memory leaks, especially when loading multiple modelsLimitations
This is a minimal implementation with the following limitations:
- macOS only: Windows and Linux are not supported
- No tool/function calling: Tool calls are not supported
- No image inputs: Only text prompts are supported
- No JSON mode: Structured output generation is not supported
Development
$3
`bash
Clone the repository
git clone https://github.com/lgrammel/ai-sdk-llama-cpp.git
cd ai-sdk-llama-cppInitialize submodules
git submodule update --init --recursiveInstall dependencies
npm installBuild the native addon and TypeScript
npm run build
`$3
-
npm run build - Build everything (native + TypeScript)
- npm run build:native - Build only the native addon
- npm run build:ts - Build only TypeScript
- npm run clean - Remove build artifacts
- npm run test - Run tests in watch mode
- npm run test:run - Run all tests once
- npm run test:unit - Run unit tests
- npm run test:integration - Run integration tests
- npm run test:e2e - Run end-to-end tests
- npm run test:coverage` - Run tests with coverageMIT
- llama.cpp - The underlying inference engine
- Vercel AI SDK - The AI SDK framework
- node-addon-api - N-API C++ wrapper