CLI for LLM inference, benchmarking, and model management - run local LLMs with Metal/CUDA acceleration
npm install @ruvector/ruvllm-cli




Command-line interface for local LLM inference and benchmarking - run AI models on your machine with Metal, CUDA, and CPU acceleration.
- Hardware Acceleration - Metal (macOS), CUDA (NVIDIA), Vulkan, Apple Neural Engine
- GGUF Support - Load quantized models (Q4, Q5, Q6, Q8) for efficient inference
- Interactive Chat - Terminal-based chat sessions with conversation history
- Benchmarking - Measure tokens/second, memory usage, time-to-first-token
- HTTP Server - OpenAI-compatible API server for integration
- Model Management - Download, list, and manage models from HuggingFace
- Streaming Output - Real-time token streaming for responsive UX
``bashInstall globally
npm install -g @ruvector/ruvllm-cli
For full native performance, install the Rust binary:
`bash
cargo install ruvllm-cli
`Quick Start
$3
`bash
Basic inference
ruvllm run --model ./llama-7b-q4.gguf --prompt "Explain quantum computing"With options
ruvllm run \
--model ./model.gguf \
--prompt "Write a haiku about Rust" \
--temperature 0.8 \
--max-tokens 100 \
--backend metal
`$3
`bash
Start chat session
ruvllm chat --model ./model.ggufWith system prompt
ruvllm chat --model ./model.gguf --system "You are a helpful coding assistant"
`$3
`bash
Run benchmark
ruvllm bench --model ./model.gguf --iterations 20Compare backends
ruvllm bench --model ./model.gguf --backend metal
ruvllm bench --model ./model.gguf --backend cpu
`$3
`bash
OpenAI-compatible API server
ruvllm serve --model ./model.gguf --port 8080Then use with any OpenAI client
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt": "Hello", "max_tokens": 50}'
`$3
`bash
List available models
ruvllm listDownload from HuggingFace
ruvllm download TheBloke/Llama-2-7B-GGUFDownload specific quantization
ruvllm download TheBloke/Llama-2-7B-GGUF --quant q4_k_m
`CLI Reference
| Command | Description |
|---------|-------------|
|
run | Run inference on a prompt |
| chat | Interactive chat session |
| bench | Benchmark model performance |
| serve | Start HTTP server |
| list | List downloaded models |
| download | Download model from HuggingFace |$3
| Option | Description | Default |
|--------|-------------|---------|
|
--model, -m | Path to GGUF model file | - |
| --backend, -b | Acceleration backend (metal, cuda, cpu) | auto |
| --threads, -t | Number of CPU threads | auto |
| --gpu-layers | Layers to offload to GPU | all |
| --context-size | Context window size | 2048 |
| --verbose, -v | Enable verbose logging | false |$3
| Option | Description | Default |
|--------|-------------|---------|
|
--temperature | Sampling temperature (0-2) | 0.7 |
| --top-p | Nucleus sampling threshold | 0.9 |
| --top-k | Top-k sampling | 40 |
| --max-tokens | Maximum tokens to generate | 256 |
| --repeat-penalty | Repetition penalty | 1.1 |Programmatic Usage
`typescript
import {
parseArgs,
formatBenchmarkTable,
getAvailableBackends,
ModelConfig,
BenchmarkResult,
} from '@ruvector/ruvllm-cli';// Parse CLI arguments
const args = parseArgs(['--model', './model.gguf', '--temperature', '0.8']);
console.log(args); // { model: './model.gguf', temperature: '0.8' }
// Check available backends
const backends = getAvailableBackends();
console.log('Available:', backends); // ['cpu', 'metal'] on macOS
// Format benchmark results
const results: BenchmarkResult[] = [
{
model: 'llama-7b',
backend: 'metal',
promptTokens: 50,
generatedTokens: 100,
promptTime: 120,
generationTime: 2500,
promptTPS: 416.7,
generationTPS: 40.0,
memoryUsage: 4200,
peakMemory: 4800,
},
];
console.log(formatBenchmarkTable(results));
`Performance
Benchmarks on Apple M2 Pro with Q4_K_M quantization:
| Model | Prompt TPS | Gen TPS | Memory |
|-------|------------|---------|--------|
| Llama-2-7B | 450 | 42 | 4.2 GB |
| Mistral-7B | 480 | 45 | 4.1 GB |
| Phi-2 | 820 | 85 | 1.8 GB |
| TinyLlama-1.1B | 1200 | 120 | 0.8 GB |
Configuration
Create
~/.ruvllm/config.json:`json
{
"defaultBackend": "metal",
"modelsDir": "~/.ruvllm/models",
"cacheDir": "~/.ruvllm/cache",
"streaming": true,
"logLevel": "info"
}
`Environment Variables
| Variable | Description |
|----------|-------------|
|
RUVLLM_MODELS_DIR | Models directory |
| RUVLLM_CACHE_DIR | Cache directory |
| RUVLLM_BACKEND | Default backend |
| RUVLLM_THREADS | CPU threads |
| HF_TOKEN` | HuggingFace token for gated models |- @ruvector/ruvllm - LLM orchestration library
- @ruvector/ruvllm-wasm - Browser LLM inference
- ruvector - All-in-one vector database
- RuvLLM Documentation
- CLI Crate
- API Reference
MIT OR Apache-2.0
---
Part of the RuVector ecosystem - High-performance vector database with self-learning capabilities.