@ruvector/ruvllm-cli

![npm version](https://www.npmjs.com/package/@ruvector/ruvllm-cli)
![npm downloads](https://www.npmjs.com/package/@ruvector/ruvllm-cli)
![npm downloads/month](https://www.npmjs.com/package/@ruvector/ruvllm-cli)
![License](https://github.com/ruvnet/ruvector/blob/main/LICENSE)
![TypeScript](https://www.typescriptlang.org/)

Command-line interface for local LLM inference and benchmarking - run AI models on your machine with Metal, CUDA, and CPU acceleration.

Features

- Hardware Acceleration - Metal (macOS), CUDA (NVIDIA), Vulkan, Apple Neural Engine
- GGUF Support - Load quantized models (Q4, Q5, Q6, Q8) for efficient inference
- Interactive Chat - Terminal-based chat sessions with conversation history
- Benchmarking - Measure tokens/second, memory usage, time-to-first-token
- HTTP Server - OpenAI-compatible API server for integration
- Model Management - Download, list, and manage models from HuggingFace
- Streaming Output - Real-time token streaming for responsive UX

Installation

``bash

`Install globally`


npm install -g @ruvector/ruvllm-cli
Or run directly with npx

npx @ruvector/ruvllm-cli --help


For full native performance, install the Rust binary:

`bash cargo install ruvllm-cli`

`Quick Start`

`$3`

`bash

`Basic inference`


ruvllm run --model ./llama-7b-q4.gguf --prompt "Explain quantum computing"
With options

ruvllm run \
  --model ./model.gguf \
  --prompt "Write a haiku about Rust" \
  --temperature 0.8 \
  --max-tokens 100 \
  --backend metal

$3

`bash

`Start chat session`


ruvllm chat --model ./model.gguf
With system prompt

ruvllm chat --model ./model.gguf --system "You are a helpful coding assistant"

$3

`bash

`Run benchmark`


ruvllm bench --model ./model.gguf --iterations 20
Compare backends

ruvllm bench --model ./model.gguf --backend metal
ruvllm bench --model ./model.gguf --backend cpu

$3

`bash

`OpenAI-compatible API server`


ruvllm serve --model ./model.gguf --port 8080
Then use with any OpenAI client

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello", "max_tokens": 50}'

$3

`bash

`List available models`


ruvllm list
Download from HuggingFace

ruvllm download TheBloke/Llama-2-7B-GGUF
Download specific quantization

ruvllm download TheBloke/Llama-2-7B-GGUF --quant q4_k_m


CLI Reference

| Command | Description | |---------|-------------| |run| Run inference on a prompt | |chat| Interactive chat session | |bench| Benchmark model performance | |serve| Start HTTP server | |list| List downloaded models | |download | Download model from HuggingFace |

`$3`

| Option | Description | Default | |--------|-------------|---------| |--model, -m| Path to GGUF model file | - | |--backend, -b| Acceleration backend (metal, cuda, cpu) | auto | |--threads, -t| Number of CPU threads | auto | |--gpu-layers| Layers to offload to GPU | all | |--context-size| Context window size | 2048 | |--verbose, -v | Enable verbose logging | false |

`$3`

| Option | Description | Default | |--------|-------------|---------| |--temperature| Sampling temperature (0-2) | 0.7 | |--top-p| Nucleus sampling threshold | 0.9 | |--top-k| Top-k sampling | 40 | |--max-tokens| Maximum tokens to generate | 256 | |--repeat-penalty | Repetition penalty | 1.1 |

`Programmatic Usage`

`typescript import { parseArgs, formatBenchmarkTable, getAvailableBackends, ModelConfig, BenchmarkResult, } from '@ruvector/ruvllm-cli';

// Parse CLI arguments const args = parseArgs(['--model', './model.gguf', '--temperature', '0.8']); console.log(args); // { model: './model.gguf', temperature: '0.8' }

// Check available backends const backends = getAvailableBackends(); console.log('Available:', backends); // ['cpu', 'metal'] on macOS

// Format benchmark results const results: BenchmarkResult[] = [ { model: 'llama-7b', backend: 'metal', promptTokens: 50, generatedTokens: 100, promptTime: 120, generationTime: 2500, promptTPS: 416.7, generationTPS: 40.0, memoryUsage: 4200, peakMemory: 4800, }, ];

console.log(formatBenchmarkTable(results));`

`Performance`

Benchmarks on Apple M2 Pro with Q4_K_M quantization:

| Model | Prompt TPS | Gen TPS | Memory | |-------|------------|---------|--------| | Llama-2-7B | 450 | 42 | 4.2 GB | | Mistral-7B | 480 | 45 | 4.1 GB | | Phi-2 | 820 | 85 | 1.8 GB | | TinyLlama-1.1B | 1200 | 120 | 0.8 GB |

`Configuration`

Create ~/.ruvllm/config.json:

`json { "defaultBackend": "metal", "modelsDir": "~/.ruvllm/models", "cacheDir": "~/.ruvllm/cache", "streaming": true, "logLevel": "info" }`

`Environment Variables`

| Variable | Description | |----------|-------------| |RUVLLM_MODELS_DIR| Models directory | |RUVLLM_CACHE_DIR| Cache directory | |RUVLLM_BACKEND| Default backend | |RUVLLM_THREADS| CPU threads | |HF_TOKEN` | HuggingFace token for gated models |

Related Packages

- @ruvector/ruvllm - LLM orchestration library
- @ruvector/ruvllm-wasm - Browser LLM inference
- ruvector - All-in-one vector database

Documentation

- RuvLLM Documentation
- CLI Crate
- API Reference

License

MIT OR Apache-2.0

---

Part of the RuVector ecosystem - High-performance vector database with self-learning capabilities.

@ruvector/ruvllm-cli

Command-line interface for local LLM inference and benchmarking - run AI models on your machine with Metal, CUDA, and CPU acceleration.

Features

Installation

``bash

`Install globally`


npm install -g @ruvector/ruvllm-cli
Or run directly with npx

npx @ruvector/ruvllm-cli --help


For full native performance, install the Rust binary:

`bash cargo install ruvllm-cli`

`Quick Start`

`$3`

`bash

`Basic inference`


ruvllm run --model ./llama-7b-q4.gguf --prompt "Explain quantum computing"
With options

ruvllm run \
  --model ./model.gguf \
  --prompt "Write a haiku about Rust" \
  --temperature 0.8 \
  --max-tokens 100 \
  --backend metal

$3

`bash

`Start chat session`


ruvllm chat --model ./model.gguf
With system prompt

ruvllm chat --model ./model.gguf --system "You are a helpful coding assistant"

$3

`bash

`Run benchmark`


ruvllm bench --model ./model.gguf --iterations 20
Compare backends

ruvllm bench --model ./model.gguf --backend metal
ruvllm bench --model ./model.gguf --backend cpu

$3

`bash

`OpenAI-compatible API server`


ruvllm serve --model ./model.gguf --port 8080
Then use with any OpenAI client

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello", "max_tokens": 50}'

$3

`bash

`List available models`


ruvllm list
Download from HuggingFace

ruvllm download TheBloke/Llama-2-7B-GGUF
Download specific quantization

ruvllm download TheBloke/Llama-2-7B-GGUF --quant q4_k_m


CLI Reference

`$3`

`Programmatic Usage`

`typescript import { parseArgs, formatBenchmarkTable, getAvailableBackends, ModelConfig, BenchmarkResult, } from '@ruvector/ruvllm-cli';

// Parse CLI arguments const args = parseArgs(['--model', './model.gguf', '--temperature', '0.8']); console.log(args); // { model: './model.gguf', temperature: '0.8' }

// Check available backends const backends = getAvailableBackends(); console.log('Available:', backends); // ['cpu', 'metal'] on macOS

console.log(formatBenchmarkTable(results));`

`Performance`

Benchmarks on Apple M2 Pro with Q4_K_M quantization:

`Configuration`

Create ~/.ruvllm/config.json:

`json { "defaultBackend": "metal", "modelsDir": "~/.ruvllm/models", "cacheDir": "~/.ruvllm/cache", "streaming": true, "logLevel": "info" }`

`Environment Variables`

Related Packages

- @ruvector/ruvllm - LLM orchestration library
- @ruvector/ruvllm-wasm - Browser LLM inference
- ruvector - All-in-one vector database

Documentation

- RuvLLM Documentation
- CLI Crate
- API Reference

License

MIT OR Apache-2.0

---

Part of the RuVector ecosystem - High-performance vector database with self-learning capabilities.