A native Capacitor plugin that embeds llama.cpp directly into mobile apps, enabling offline AI inference with chat-first API design. Complete iOS and Android support: text generation, chat, multimodal, TTS, LoRA, embeddings, and more.
npm install llama-cpp-capacitor


A native Capacitor plugin that embeds llama.cpp directly into mobile apps, enabling offline AI inference with comprehensive support for text generation, multimodal processing, TTS, LoRA adapters, and more.
llama.cpp: Inference of LLaMA model in pure C/C++
- Offline AI Inference: Run large language models completely offline on mobile devices
- Text Generation: Complete text completion with streaming support
- Chat Conversations: Multi-turn conversations with context management
- Multimodal Support: Process images and audio alongside text
- Text-to-Speech (TTS): Generate speech from text using vocoder models
- LoRA Adapters: Fine-tune models with LoRA adapters
- Embeddings: Generate vector embeddings for semantic search
- Reranking: Rank documents by relevance to queries
- Session Management: Save and load conversation states
- Benchmarking: Performance testing and optimization tools
- Structured Output: Generate JSON with schema validation
- Cross-Platform: iOS and Android support with native optimizations
This plugin is now FULLY IMPLEMENTED with complete native integration of llama.cpp for both iOS and Android platforms. The implementation includes:
llama-cpp/
āāā cpp/ # Complete llama.cpp C++ library
ā āāā ggml.c # GGML core
ā āāā gguf.cpp # GGUF format support
ā āāā llama.cpp # Main llama.cpp implementation
ā āāā rn-llama.cpp # React Native wrapper (adapted)
ā āāā rn-completion.cpp # Completion handling
ā āāā rn-tts.cpp # Text-to-speech
ā āāā tools/mtmd/ # Multimodal support
āāā ios/
ā āāā CMakeLists.txt # iOS build configuration
ā āāā Sources/ # Swift implementation
āāā android/
ā āāā src/main/
ā ā āāā CMakeLists.txt # Android build configuration
ā ā āāā jni.cpp # JNI implementation
ā ā āāā jni-utils.h # JNI utilities
ā āāā build.gradle # Android build config
āāā src/
ā āāā definitions.ts # Complete TypeScript interfaces
ā āāā index.ts # Main plugin implementation
ā āāā web.ts # Web fallback
āāā build-native.sh # Automated build script
`š¦ Installation
`sh
npm install llama-cpp-capacitor
`šØ Building the Native Library
The plugin includes a complete native implementation of llama.cpp. To build the native libraries:
$3
- CMake (3.16+ for iOS, 3.10+ for Android)
- Xcode (for iOS builds, macOS only)
- Android Studio with NDK (for Android builds)
- Make or Ninja build system
$3
`bash
Build for all platforms
npm run build:nativeBuild for specific platforms
npm run build:ios # iOS only
npm run build:android # Android onlyClean native builds
npm run clean:native
`$3
#### iOS Build
`bash
cd ios
cmake -B build -S .
cmake --build build --config Release
`#### Android Build
`bash
cd android
./gradlew assembleRelease
`$3
- iOS:
ios/build/LlamaCpp.framework/
- Android: android/src/main/jniLibs/{arch}/libllama-cpp-{arch}.so$3
The native
cpp/ layer is based on llama.cpp. To pull in a newer upstream version (e.g. for vision model support) without overwriting the Capacitor adapter code, use the bootstrap script (included in this repo):`bash
./scripts/bootstrap.sh [branch-or-tag-or-commit]
Example: ./scripts/bootstrap.sh master
`This syncs upstream into
cpp/ and keeps project-specific files (cap-*.cpp/h, tools/mtmd/, etc.) intact. After running it, reconcile any API changes in the adapter code, then rebuild with npm run build:native or ./build-native.sh. See cpp/README.md and docs/IOS_IMPLEMENTATION_GUIDE.md.$3
For a step-by-step guide on how methods are implemented on the iOS side (Swift bridge ā native framework, adding/updating C symbols, and updating the native layer for vision), see docs/IOS_IMPLEMENTATION_GUIDE.md.
$3
1. Install the plugin:
`sh
npm install llama-cpp
`2. Add to your iOS project:
`sh
npx cap add ios
npx cap sync ios
`3. Open the project in Xcode:
`sh
npx cap open ios
`$3
1. Install the plugin:
`sh
npm install llama-cpp
`2. Add to your Android project:
`sh
npx cap add android
npx cap sync android
`3. Open the project in Android Studio:
`sh
npx cap open android
`šÆ Quick Start
$3
`typescript
import { initLlama } from 'llama-cpp';// Initialize a model
const context = await initLlama({
model: '/path/to/your/model.gguf',
n_ctx: 2048,
n_threads: 4,
n_gpu_layers: 0,
});
// Generate text
const result = await context.completion({
prompt: "Hello, how are you today?",
n_predict: 50,
temperature: 0.8,
});
console.log('Generated text:', result.text);
`$3
`typescript
const result = await context.completion({
messages: [
{ role: "system", content: "You are a helpful AI assistant." },
{ role: "user", content: "What is the capital of France?" },
{ role: "assistant", content: "The capital of France is Paris." },
{ role: "user", content: "Tell me more about it." }
],
n_predict: 100,
temperature: 0.7,
});console.log('Chat response:', result.content);
`$3
`typescript
let fullText = '';
const result = await context.completion({
prompt: "Write a short story about a robot learning to paint:",
n_predict: 150,
temperature: 0.8,
}, (tokenData) => {
// Called for each token as it's generated
fullText += tokenData.token;
console.log('Token:', tokenData.token);
});console.log('Final result:', result.text);
`š Mobile-Optimized Speculative Decoding
Achieve 2-8x faster inference with significantly reduced battery consumption!
Speculative decoding uses a smaller "draft" model to predict multiple tokens ahead, which are then verified by the main model. This results in dramatic speedups with identical output quality.
$3
`typescript
import { initLlama } from 'llama-cpp-capacitor';// Initialize with speculative decoding
const context = await initLlama({
model: '/path/to/your/main-model.gguf', // Main model (e.g., 7B)
draft_model: '/path/to/your/draft-model.gguf', // Draft model (e.g., 1.5B)
// Speculative decoding parameters
speculative_samples: 3, // Number of tokens to predict speculatively
mobile_speculative: true, // Enable mobile optimizations
// Standard parameters
n_ctx: 2048,
n_threads: 4,
});
// Use normally - speculative decoding is automatic
const result = await context.completion({
prompt: "Write a story about AI:",
n_predict: 200,
temperature: 0.7,
});
console.log('š Generated with speculative decoding:', result.text);
`$3
`typescript
// Recommended mobile setup for best performance/battery balance
const mobileContext = await initLlama({
// Quantized models for mobile efficiency
model: '/models/llama-2-7b-chat.q4_0.gguf',
draft_model: '/models/tinyllama-1.1b-chat.q4_0.gguf',
// Conservative mobile settings
n_ctx: 1024, // Smaller context for mobile
n_threads: 3, // Conservative threading
n_batch: 64, // Smaller batch size
n_gpu_layers: 24, // Utilize mobile GPU
// Optimized speculative decoding
speculative_samples: 3, // 2-3 tokens ideal for mobile
mobile_speculative: true, // Enables mobile-specific optimizations
// Memory optimizations
use_mmap: true, // Memory mapping for efficiency
use_mlock: false, // Don't lock memory on mobile
});
`$3
- 2-8x faster inference - Dramatically reduced time to generate text
- 50-80% battery savings - Less time computing = longer battery life
- Identical output quality - Same text quality as regular decoding
- Automatic fallback - Falls back to regular decoding if draft model fails
- Mobile optimized - Specifically tuned for mobile device constraints
$3
| Model Type | Recommended Size | Quantization | Example |
|------------|------------------|--------------|---------|
| Main Model | 3-7B parameters | Q4_0 or Q4_1 |
llama-2-7b-chat.q4_0.gguf |
| Draft Model | 1-1.5B parameters | Q4_0 | tinyllama-1.1b-chat.q4_0.gguf |$3
`typescript
// Robust setup with automatic fallback
try {
const context = await initLlama({
model: '/models/main-model.gguf',
draft_model: '/models/draft-model.gguf',
speculative_samples: 3,
mobile_speculative: true,
});
console.log('ā
Speculative decoding enabled');
} catch (error) {
console.warn('ā ļø Falling back to regular decoding');
const context = await initLlama({
model: '/models/main-model.gguf',
// No draft_model = regular decoding
});
}
`š API Reference
$3
####
initLlama(params: ContextParams, onProgress?: (progress: number) => void): PromiseInitialize a new llama.cpp context with a model.
Parameters:
-
params: Context initialization parameters
- onProgress: Optional progress callback (0-100)Returns: Promise resolving to a
LlamaContext instance####
releaseAllLlama(): PromiseRelease all contexts and free memory.
####
toggleNativeLog(enabled: boolean): PromiseEnable or disable native logging.
####
addNativeLogListener(listener: (level: string, text: string) => void): { remove: () => void }Add a listener for native log messages.
$3
####
completion(params: CompletionParams, callback?: (data: TokenData) => void): PromiseGenerate text completion.
Parameters:
-
params: Completion parameters including prompt or messages
- callback: Optional callback for token-by-token streaming####
tokenize(text: string, options?: { media_paths?: string[] }): PromiseTokenize text or text with images.
####
detokenize(tokens: number[]): PromiseConvert tokens back to text.
####
embedding(text: string, params?: EmbeddingParams): PromiseGenerate embeddings for text.
####
rerank(query: string, documents: string[], params?: RerankParams): PromiseRank documents by relevance to a query.
####
bench(pp: number, tg: number, pl: number, nr: number): PromiseBenchmark model performance.
$3
####
initMultimodal(params: { path: string; use_gpu?: boolean }): PromiseInitialize multimodal support with a projector file.
####
isMultimodalEnabled(): PromiseCheck if multimodal support is enabled.
####
getMultimodalSupport(): Promise<{ vision: boolean; audio: boolean }>Get multimodal capabilities.
####
releaseMultimodal(): PromiseRelease multimodal resources.
$3
####
initVocoder(params: { path: string; n_batch?: number }): PromiseInitialize TTS with a vocoder model.
####
isVocoderEnabled(): PromiseCheck if TTS is enabled.
####
getFormattedAudioCompletion(speaker: object | null, textToSpeak: string): Promise<{ prompt: string; grammar?: string }>Get formatted audio completion prompt.
####
getAudioCompletionGuideTokens(textToSpeak: string): PromiseGet guide tokens for audio completion.
####
decodeAudioTokens(tokens: number[]): PromiseDecode audio tokens to audio data.
####
releaseVocoder(): PromiseRelease TTS resources.
$3
####
applyLoraAdapters(loraList: Array<{ path: string; scaled?: number }>): PromiseApply LoRA adapters to the model.
####
removeLoraAdapters(): PromiseRemove all LoRA adapters.
####
getLoadedLoraAdapters(): PromiseGet list of loaded LoRA adapters.
$3
####
saveSession(filepath: string, options?: { tokenSize: number }): PromiseSave current session to a file.
####
loadSession(filepath: string): PromiseLoad session from a file.
š§ Configuration
$3
`typescript
interface ContextParams {
model: string; // Path to GGUF model file
n_ctx?: number; // Context size (default: 512)
n_threads?: number; // Number of threads (default: 4)
n_gpu_layers?: number; // GPU layers (iOS only)
use_mlock?: boolean; // Lock memory (default: false)
use_mmap?: boolean; // Use memory mapping (default: true)
embedding?: boolean; // Embedding mode (default: false)
cache_type_k?: string; // KV cache type for K
cache_type_v?: string; // KV cache type for V
pooling_type?: string; // Pooling type
// ... more parameters
}
`$3
`typescript
interface CompletionParams {
prompt?: string; // Text prompt
messages?: Message[]; // Chat messages
n_predict?: number; // Max tokens to generate
temperature?: number; // Sampling temperature
top_p?: number; // Top-p sampling
top_k?: number; // Top-k sampling
stop?: string[]; // Stop sequences
// ... more parameters
}
`š± Platform Support
| Feature | iOS | Android | Web |
|---------|-----|---------|-----|
| Text Generation | ā
| ā
| ā |
| Chat Conversations | ā
| ā
| ā |
| Streaming | ā
| ā
| ā |
| Multimodal | ā
| ā
| ā |
| TTS | ā
| ā
| ā |
| LoRA Adapters | ā
| ā
| ā |
| Embeddings | ā
| ā
| ā |
| Reranking | ā
| ā
| ā |
| Session Management | ā
| ā
| ā |
| Benchmarking | ā
| ā
| ā |
šØ Advanced Examples
$3
`typescript
// Initialize multimodal support
await context.initMultimodal({
path: '/path/to/mmproj.gguf',
use_gpu: true,
});// Process image with text
const result = await context.completion({
messages: [
{
role: "user",
content: [
{ type: "text", text: "What do you see in this image?" },
{ type: "image_url", image_url: { url: "file:///path/to/image.jpg" } }
]
}
],
n_predict: 100,
});
console.log('Image analysis:', result.content);
`$3
`typescript
// Initialize TTS
await context.initVocoder({
path: '/path/to/vocoder.gguf',
n_batch: 512,
});// Generate audio
const audioCompletion = await context.getFormattedAudioCompletion(
null, // Speaker configuration
"Hello, this is a test of text-to-speech functionality."
);
const guideTokens = await context.getAudioCompletionGuideTokens(
"Hello, this is a test of text-to-speech functionality."
);
const audioResult = await context.completion({
prompt: audioCompletion.prompt,
grammar: audioCompletion.grammar,
guide_tokens: guideTokens,
n_predict: 1000,
});
const audioData = await context.decodeAudioTokens(audioResult.audio_tokens);
`$3
`typescript
// Apply LoRA adapters
await context.applyLoraAdapters([
{ path: '/path/to/adapter1.gguf', scaled: 1.0 },
{ path: '/path/to/adapter2.gguf', scaled: 0.5 }
]);// Check loaded adapters
const adapters = await context.getLoadedLoraAdapters();
console.log('Loaded adapters:', adapters);
// Generate with adapters
const result = await context.completion({
prompt: "Test prompt with LoRA adapters:",
n_predict: 50,
});
// Remove adapters
await context.removeLoraAdapters();
`$3
#### JSON Schema (Auto-converted to GBNF)
`typescript
const result = await context.completion({
prompt: "Generate a JSON object with a person's name, age, and favorite color:",
n_predict: 100,
response_format: {
type: 'json_schema',
json_schema: {
strict: true,
schema: {
type: 'object',
properties: {
name: { type: 'string' },
age: { type: 'number' },
favorite_color: { type: 'string' }
},
required: ['name', 'age', 'favorite_color']
}
}
}
});console.log('Structured output:', result.content);
`#### Direct GBNF Grammar
`typescript
// Define GBNF grammar directly for maximum control
const grammar = ;const result = await context.completion({
prompt: "Generate a person's profile:",
grammar: grammar,
n_predict: 100
});
console.log('Grammar-constrained output:', result.text);
`#### Manual JSON Schema to GBNF Conversion
`typescript
import { convertJsonSchemaToGrammar } from 'llama-cpp-capacitor';const schema = {
type: 'object',
properties: {
name: { type: 'string' },
age: { type: 'number' }
},
required: ['name', 'age']
};
// Convert schema to GBNF grammar
const grammar = await convertJsonSchemaToGrammar(schema);
console.log('Generated grammar:', grammar);
const result = await context.completion({
prompt: "Generate a person:",
grammar: grammar,
n_predict: 100
});
`š Model Compatibility
This plugin supports GGUF format models, which are compatible with llama.cpp. You can find GGUF models on Hugging Face by searching for the "GGUF" tag.
$3
- Llama 2: Meta's latest language model
- Mistral: High-performance open model
- Code Llama: Specialized for code generation
- Phi-2: Microsoft's efficient model
- Gemma: Google's open model
$3
For mobile devices, consider using quantized models (Q4_K_M, Q5_K_M, etc.) to reduce memory usage and improve performance.
ā” Performance Considerations
$3
- Use quantized models for better memory efficiency
- Adjust
n_ctx based on your use case
- Monitor memory usage with use_mlock: false$3
- iOS: Set
n_gpu_layers to use Metal GPU acceleration
- Android: GPU acceleration is automatically enabled when available$3
- Adjust
n_threads based on device capabilities
- More threads may improve performance but increase memory usageš Troubleshooting
$3
1. Model not found: Ensure the model path is correct and the file exists
2. Out of memory: Try using a quantized model or reducing
n_ctx
3. Slow performance: Enable GPU acceleration or increase n_threads
4. Multimodal not working: Ensure the mmproj file is compatible with your model$3
Enable native logging to see detailed information:
`typescript
import { toggleNativeLog, addNativeLogListener } from 'llama-cpp';await toggleNativeLog(true);
const logListener = addNativeLogListener((level, text) => {
console.log(
[${level}] ${text});
});
`š¦ Publishing
To publish the package to npm:
1. Build (runs automatically on
npm publish via prepublishOnly): npm run build ā produces dist/ (plugin bundles, ESM, docs).
2. Optional ā include native libs in the tarball: npm run build:all (requires macOS/NDK) ā builds iOS framework and Android .so into ios/Frameworks and android/src/main/jniLibs.
3. Verify pack: npm run pack (JS only) or npm run pack:full (JS + native) ā lists files that would be published.
4. Publish: npm publish`.See NPM_PUBLISH_GUIDE.md for 2FA/token setup and troubleshooting.
We welcome contributions! Please see our Contributing Guide for details.
This project is licensed under the MIT License - see the LICENSE file for details.
- llama.cpp - The core inference engine
- Capacitor - The cross-platform runtime
- llama.rn - Inspiration for the React Native implementation
- š§ Email: support@arusatech.com
- š Issues: GitHub Issues
- š Documentation: GitHub Wiki