pi

Deploy and manage LLMs on GPU pods with automatic vLLM configuration for agentic workloads.

Installation

``bash npm install -g @mariozechner/pi`

`What is pi?`

pisimplifies running large language models on remote GPU pods. It automatically: - Sets up vLLM on fresh Ubuntu pods - Configures tool calling for agentic models (Qwen, GPT-OSS, GLM, etc.) - Manages multiple models on the same pod with "smart" GPU allocation - Provides OpenAI-compatible API endpoints for each model - Includes an interactive agent with file system tools for testing

`Quick Start`

`bash

`Set required environment variables`


export HF_TOKEN=your_huggingface_token      # Get from https://huggingface.co/settings/tokens
export PI_API_KEY=your_api_key              # Any string you want for API authentication
Setup a DataCrunch pod with NFS storage (models path auto-extracted)

pi pods setup dc1 "ssh root@1.2.3.4" \
  --mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"
Start a model (automatic configuration for known models)

pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
Send a single message to the model

pi agent qwen "What is the Fibonacci sequence?"
Interactive chat mode with file system tools

pi agent qwen -i
Use with any OpenAI-compatible client

export OPENAI_BASE_URL='http://1.2.3.4:8001/v1'
export OPENAI_API_KEY=$PI_API_KEY


Prerequisites
- Node.js 18+
- HuggingFace token (for model downloads)
- GPU pod with:
  - Ubuntu 22.04 or 24.04
  - SSH root access
  - NVIDIA drivers installed
  - Persistent storage for models
Supported Providers
$3
DataCrunch - Best for shared model storage
- NFS volumes sharable across multiple pods in same region
- Models download once, use everywhere
- Ideal for teams or multiple experiments
RunPod - Good persistent storage
- Network volumes persist independently
- Cannot share between running pods simultaneously
- Good for single-pod workflows
$3

- Vast.ai (volumes locked to specific machine)
- Prime Intellect (no persistent storage)
- AWS EC2 (with EFS setup)
- Any Ubuntu machine with NVIDIA GPUs, CUDA driver, and SSH
Commands
$3

`bash pi pods setup "" [options] # Setup new pod --mount "" # Run mount command during setup --models-path # Override extracted path (optional) --vllm release|nightly|gpt-oss # vLLM version (default: release)

pi pods # List all configured pods pi pods active # Switch active pod pi pods remove # Remove pod from local config pi shell [] # SSH into pod pi ssh [] "" # Run command on pod`

Note: When using --mount, the models path is automatically extracted from the mount command's target directory. You only need --models-path if not using --mount or to override the extracted path.

#### vLLM Version Options

- release(default): Stable vLLM release, recommended for most users -nightly: Latest vLLM features, needed for newest models like GLM-4.5 -gpt-oss: Special build for OpenAI's GPT-OSS models only

`$3`

`bash pi start --name [options] # Start a model --memory # GPU memory: 30%, 50%, 90% (default: 90%) --context # Context window: 4k, 8k, 16k, 32k, 64k, 128k --gpus # Number of GPUs to use (predefined models only) --pod # Target specific pod (overrides active) --vllm # Pass custom args directly to vLLM

pi stop [] # Stop model (or all if no name given) pi list # List running models with status pi logs # Stream model logs (tail -f)`

`$3`

`bash pi agent "" # Single message to model pi agent "" "" # Multiple messages in sequence pi agent -i # Interactive chat mode pi agent -i -c # Continue previous session

`Standalone OpenAI-compatible agent (works with any API)`


pi-agent --base-url http://localhost:8000/v1 --model llama-3.1 "Hello"
pi-agent --api-key sk-... "What is 2+2?"  # Uses OpenAI by default
pi-agent --json "What is 2+2?"            # Output event stream as JSONL
pi-agent -i                                # Interactive mode


The agent includes tools for file operations (read, list, bash, glob, rg) to test agentic capabilities, particularly useful for code navigation and analysis tasks.
Predefined Model Configurations

pi includes predefined configurations for popular agentic models, so you do not have to specify --vllm arguments manually. pi will also check if the model you selected can actually run on your pod with respect to the number of GPUs and available VRAM. Run pi start without additional arguments to see a list of predefined models that can run on the active pod.

`$3`

bash
Qwen2.5-Coder-32B - Excellent coding model, fits on single H100/H200

pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
Qwen3-Coder-30B - Advanced reasoning with tool use

pi start Qwen/Qwen3-Coder-30B-A3B-Instruct --name qwen3
Qwen3-Coder-480B - State-of-the-art on 8xH200 (data-parallel mode)

pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen-480b

$3

bash
Requires special vLLM build during setup

pi pods setup gpt-pod "ssh root@1.2.3.4" --models-path /workspace --vllm gpt-oss
GPT-OSS-20B - Fits on 16GB+ VRAM

pi start openai/gpt-oss-20b --name gpt20
GPT-OSS-120B - Needs 60GB+ VRAM

pi start openai/gpt-oss-120b --name gpt120

$3

bash
GLM-4.5 - Requires 8-16 GPUs, includes thinking mode

pi start zai-org/GLM-4.5 --name glm
GLM-4.5-Air - Smaller version, 1-2 GPUs

pi start zai-org/GLM-4.5-Air --name glm-air

$3

For models not in the predefined list, use --vllm to pass arguments directly to vLLM:

`bash

`DeepSeek with custom settings`


pi start deepseek-ai/DeepSeek-V3 --name deepseek --vllm \
  --tensor-parallel-size 4 --trust-remote-code
Mistral with pipeline parallelism

pi start mistralai/Mixtral-8x22B-Instruct-v0.1 --name mixtral --vllm \
  --tensor-parallel-size 8 --pipeline-parallel-size 2
Any model with specific tool parser

pi start some/model --name mymodel --vllm \
  --tool-call-parser hermes --enable-auto-tool-choice


DataCrunch Setup
DataCrunch offers the best experience with shared NFS storage across pods:
$3

- Go to DataCrunch dashboard → Storage → Create SFS
- Choose size and datacenter
- Note the mount command (e.g.,

sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/hf-models-fin02-8ac1bab7 /mnt/hf-models-fin02

)
$3

- Create instance in same datacenter as SFS
- Share the SFS with the instance
- Get SSH command from dashboard
$3

bash
Get mount command from DataCrunch dashboard

pi pods setup dc1 "ssh root@instance.datacrunch.io" \
  --mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"
Models automatically stored in /mnt/hf-models (extracted from mount command)


$3

- Models persist across instance restarts
- Share models between multiple instances in same datacenter
- Download once, use everywhere
- Pay only for storage, not compute time during downloads
RunPod Setup
RunPod offers good persistent storage with network volumes:
$3

- Go to RunPod dashboard → Storage → Create Network Volume
- Choose size and region
$3

- Select "Network Volume" during pod creation (if using)
- Attach your volume to

/runpod-volume


- Get SSH command from pod details
$3

bash
With network volume

pi pods setup runpod "ssh root@pod.runpod.io" --models-path /runpod-volume
Or use workspace (persists with pod but not shareable)

pi pods setup runpod "ssh root@pod.runpod.io" --models-path /workspace



Multi-GPU Support
$3

When running multiple models, pi automatically assigns them to different GPUs:

bash
pi start model1 --name m1  # Auto-assigns to GPU 0
pi start model2 --name m2  # Auto-assigns to GPU 1
pi start model3 --name m3  # Auto-assigns to GPU 2


$3

For predefined models with multiple configurations, use

--gpus

 to control GPU usage:

bash
Run Qwen on 1 GPU instead of all available

pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen --gpus 1
Run GLM-4.5 on 8 GPUs (if it has an 8-GPU config)

pi start zai-org/GLM-4.5 --name glm --gpus 8


If the model doesn't have a configuration for the requested GPU count, you'll see available options.
$3

For models that don't fit on a single GPU:

bash
Use all available GPUs

pi start meta-llama/Llama-3.1-70B-Instruct --name llama70b --vllm \
  --tensor-parallel-size 4
Specific GPU count

pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen480 --vllm \
  --data-parallel-size 8 --enable-expert-parallel


API Integration
All models expose OpenAI-compatible endpoints:

`python from openai import OpenAI

client = OpenAI( base_url="http://your-pod-ip:8001/v1", api_key="your-pi-api-key" )

`Chat completion with tool calling`


response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Coder-32B-Instruct",
    messages=[
        {"role": "user", "content": "Write a Python function to calculate fibonacci"}
    ],
    tools=[{
        "type": "function",
        "function": {
            "name": "execute_code",
            "description": "Execute Python code",
            "parameters": {
                "type": "object",
                "properties": {
                    "code": {"type": "string"}
                },
                "required": ["code"]
            }
        }
    }],
    tool_choice="auto"
)


Standalone Agent CLI

pi includes a standalone OpenAI-compatible agent that can work with any API:

`bash

`Install globally to get pi-agent command`


npm install -g @mariozechner/pi
Use with OpenAI

pi-agent --api-key sk-... "What is machine learning?"
Use with local vLLM

pi-agent --base-url http://localhost:8000/v1 \
         --model meta-llama/Llama-3.1-8B-Instruct \
         --api-key dummy \
         "Explain quantum computing"
Interactive mode

pi-agent -i
Continue previous session

pi-agent --continue "Follow up question"
Custom system prompt

pi-agent --system-prompt "You are a Python expert" "Write a web scraper"
Use responses API (for GPT-OSS models)

pi-agent --api responses --model openai/gpt-oss-20b "Hello"


The agent supports:
- Session persistence across conversations
- Interactive TUI mode with syntax highlighting
- File system tools (read, list, bash, glob, rg) for code navigation
- Both Chat Completions and Responses API formats
- Custom system prompts
Tool Calling Support

pi automatically configures appropriate tool calling parsers for known models:

- Qwen models: hermes parser (Qwen3-Coder uses qwen3_coder) - GLM models:glm4_moeparser with reasoning support - GPT-OSS models: Uses/v1/responses endpoint, as tool calling (function calling in OpenAI parlance) is currently a WIP with the v1/chat/completions endpoint. - Custom models: Specify with--vllm --tool-call-parser --enable-auto-tool-choice

To disable tool calling:`bash pi start model --name mymodel --vllm --disable-tool-call-parser`

`Memory and Context Management`

`$3`


Controls how much GPU memory vLLM pre-allocates:
-

--memory 30%

: High concurrency, limited context
-

--memory 50%

: Balanced (default)
-

--memory 90%

: Maximum context, low concurrency
$3

Sets maximum input + output tokens:
-

--context 4k

: 4,096 tokens total
-

--context 32k

: 32,768 tokens total
-

--context 128k: 131,072 tokens total

Example for coding workload:`bash

`Large context for code analysis, moderate concurrency`


pi start Qwen/Qwen2.5-Coder-32B-Instruct --name coder \
  --context 64k --memory 70%

Note: When using --vllm, the --memory, --context, and --gpus parameters are ignored. You'll see a warning if you try to use them together.

`Session Persistence`

The interactive agent mode (-i) saves sessions for each project directory:

`bash

`Start new session`


pi agent qwen -i
Continue previous session (maintains chat history)

pi agent qwen -i -c

Sessions are stored in ~/.pi/sessions/organized by project path and include: - Complete conversation history - Tool call results - Token usage statistics

`Architecture & Event System`

The agent uses a unified event-based architecture where all interactions flow through AgentEventtypes. This enables: - Consistent UI rendering across console and TUI modes - Session recording and replay - Clean separation between API calls and UI updates - JSON output mode for programmatic integration

Events are automatically converted to the appropriate API format (Chat Completions or Responses) based on the model type.

`$3`

Use --jsonflag to output the event stream as JSONL (JSON Lines) for programmatic consumption:`bash pi-agent --api-key sk-... --json "What is 2+2?"`

Each line is a complete JSON object representing an event:`jsonl {"type":"user_message","text":"What is 2+2?"} {"type":"assistant_start"} {"type":"assistant_message","text":"2 + 2 = 4"} {"type":"token_usage","inputTokens":10,"outputTokens":5,"totalTokens":15,"cacheReadTokens":0,"cacheWriteTokens":0}`

`Troubleshooting`

`$3`


- Reduce

--memory

 percentage
- Use smaller model or quantized version (FP8)
- Reduce

--context

 size
$3

bash
Check GPU usage

pi ssh "nvidia-smi"
Check if port is in use

pi list
Force stop all models

pi stop


$3

- Not all models support tool calling reliably
- Try different parser:

--vllm --tool-call-parser mistral


- Or disable:

--vllm --disable-tool-call-parser


$3

Some models (Llama, Mistral) require HuggingFace access approval. Visit the model page and click "Request access".
$3

If using

--vllm nightly

 fails, try:
- Use

--vllm release

 for stable version
- Check CUDA compatibility with

pi ssh "nvidia-smi"


$3

If the agent shows configuration instead of your message, ensure quotes around messages with special characters:

bash
Good

pi agent qwen "What is this file about?"
Bad (shell might interpret special chars)

pi agent qwen What is this file about?


Advanced Usage
$3

bash
Override active pod for any command

pi start model --name test --pod dev-pod
pi list --pod prod-pod
pi stop test --pod dev-pod

$3

bash
Pass any vLLM argument after --vllm

pi start model --name custom --vllm \
  --quantization awq \
  --enable-prefix-caching \
  --max-num-seqs 256 \
  --gpu-memory-utilization 0.95

$3

bash
Watch GPU utilization

pi ssh "watch -n 1 nvidia-smi"
Check model downloads

pi ssh "du -sh ~/.cache/huggingface/hub/*"
View all logs

pi ssh "ls -la ~/.vllm_logs/"
Check agent session history

ls -la ~/.pi/sessions/


Environment Variables

- HF_TOKEN- HuggingFace token for model downloads -PI_API_KEY- API key for vLLM endpoints -PI_CONFIG_DIR - Config directory (default: ~/.pi) -OPENAI_API_KEY - Used by pi-agent when no --api-key` provided

License

MIT

pi

Deploy and manage LLMs on GPU pods with automatic vLLM configuration for agentic workloads.

Installation

``bash npm install -g @mariozechner/pi`

`What is pi?`

`Quick Start`

`bash

`Set required environment variables`


export HF_TOKEN=your_huggingface_token      # Get from https://huggingface.co/settings/tokens
export PI_API_KEY=your_api_key              # Any string you want for API authentication
Setup a DataCrunch pod with NFS storage (models path auto-extracted)

pi pods setup dc1 "ssh root@1.2.3.4" \
  --mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"
Start a model (automatic configuration for known models)

pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
Send a single message to the model

pi agent qwen "What is the Fibonacci sequence?"
Interactive chat mode with file system tools

pi agent qwen -i
Use with any OpenAI-compatible client

export OPENAI_BASE_URL='http://1.2.3.4:8001/v1'
export OPENAI_API_KEY=$PI_API_KEY


Prerequisites
- Node.js 18+
- HuggingFace token (for model downloads)
- GPU pod with:
  - Ubuntu 22.04 or 24.04
  - SSH root access
  - NVIDIA drivers installed
  - Persistent storage for models
Supported Providers
$3
DataCrunch - Best for shared model storage
- NFS volumes sharable across multiple pods in same region
- Models download once, use everywhere
- Ideal for teams or multiple experiments
RunPod - Good persistent storage
- Network volumes persist independently
- Cannot share between running pods simultaneously
- Good for single-pod workflows
$3

- Vast.ai (volumes locked to specific machine)
- Prime Intellect (no persistent storage)
- AWS EC2 (with EFS setup)
- Any Ubuntu machine with NVIDIA GPUs, CUDA driver, and SSH
Commands
$3

pi pods # List all configured pods pi pods active # Switch active pod pi pods remove # Remove pod from local config pi shell [] # SSH into pod pi ssh [] "" # Run command on pod`

#### vLLM Version Options

`$3`

pi stop [] # Stop model (or all if no name given) pi list # List running models with status pi logs # Stream model logs (tail -f)`

`$3`

`bash pi agent "" # Single message to model pi agent "" "" # Multiple messages in sequence pi agent -i # Interactive chat mode pi agent -i -c # Continue previous session

`Standalone OpenAI-compatible agent (works with any API)`


pi-agent --base-url http://localhost:8000/v1 --model llama-3.1 "Hello"
pi-agent --api-key sk-... "What is 2+2?"  # Uses OpenAI by default
pi-agent --json "What is 2+2?"            # Output event stream as JSONL
pi-agent -i                                # Interactive mode


The agent includes tools for file operations (read, list, bash, glob, rg) to test agentic capabilities, particularly useful for code navigation and analysis tasks.
Predefined Model Configurations

`$3`

bash
Qwen2.5-Coder-32B - Excellent coding model, fits on single H100/H200

pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
Qwen3-Coder-30B - Advanced reasoning with tool use

pi start Qwen/Qwen3-Coder-30B-A3B-Instruct --name qwen3
Qwen3-Coder-480B - State-of-the-art on 8xH200 (data-parallel mode)

pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen-480b

$3

bash
Requires special vLLM build during setup

pi pods setup gpt-pod "ssh root@1.2.3.4" --models-path /workspace --vllm gpt-oss
GPT-OSS-20B - Fits on 16GB+ VRAM

pi start openai/gpt-oss-20b --name gpt20
GPT-OSS-120B - Needs 60GB+ VRAM

pi start openai/gpt-oss-120b --name gpt120

$3

bash
GLM-4.5 - Requires 8-16 GPUs, includes thinking mode

pi start zai-org/GLM-4.5 --name glm
GLM-4.5-Air - Smaller version, 1-2 GPUs

pi start zai-org/GLM-4.5-Air --name glm-air

$3

For models not in the predefined list, use --vllm to pass arguments directly to vLLM:

`bash

`DeepSeek with custom settings`


pi start deepseek-ai/DeepSeek-V3 --name deepseek --vllm \
  --tensor-parallel-size 4 --trust-remote-code
Mistral with pipeline parallelism

pi start mistralai/Mixtral-8x22B-Instruct-v0.1 --name mixtral --vllm \
  --tensor-parallel-size 8 --pipeline-parallel-size 2
Any model with specific tool parser

pi start some/model --name mymodel --vllm \
  --tool-call-parser hermes --enable-auto-tool-choice


DataCrunch Setup
DataCrunch offers the best experience with shared NFS storage across pods:
$3

- Go to DataCrunch dashboard → Storage → Create SFS
- Choose size and datacenter
- Note the mount command (e.g.,

sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/hf-models-fin02-8ac1bab7 /mnt/hf-models-fin02

)
$3

- Create instance in same datacenter as SFS
- Share the SFS with the instance
- Get SSH command from dashboard
$3

bash
Get mount command from DataCrunch dashboard

pi pods setup dc1 "ssh root@instance.datacrunch.io" \
  --mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"
Models automatically stored in /mnt/hf-models (extracted from mount command)


$3

- Models persist across instance restarts
- Share models between multiple instances in same datacenter
- Download once, use everywhere
- Pay only for storage, not compute time during downloads
RunPod Setup
RunPod offers good persistent storage with network volumes:
$3

- Go to RunPod dashboard → Storage → Create Network Volume
- Choose size and region
$3

- Select "Network Volume" during pod creation (if using)
- Attach your volume to

/runpod-volume


- Get SSH command from pod details
$3

bash
With network volume

pi pods setup runpod "ssh root@pod.runpod.io" --models-path /runpod-volume
Or use workspace (persists with pod but not shareable)

pi pods setup runpod "ssh root@pod.runpod.io" --models-path /workspace



Multi-GPU Support
$3

When running multiple models, pi automatically assigns them to different GPUs:

bash
pi start model1 --name m1  # Auto-assigns to GPU 0
pi start model2 --name m2  # Auto-assigns to GPU 1
pi start model3 --name m3  # Auto-assigns to GPU 2


$3

For predefined models with multiple configurations, use

--gpus

 to control GPU usage:

bash
Run Qwen on 1 GPU instead of all available

pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen --gpus 1
Run GLM-4.5 on 8 GPUs (if it has an 8-GPU config)

pi start zai-org/GLM-4.5 --name glm --gpus 8


If the model doesn't have a configuration for the requested GPU count, you'll see available options.
$3

For models that don't fit on a single GPU:

bash
Use all available GPUs

pi start meta-llama/Llama-3.1-70B-Instruct --name llama70b --vllm \
  --tensor-parallel-size 4
Specific GPU count

pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen480 --vllm \
  --data-parallel-size 8 --enable-expert-parallel


API Integration
All models expose OpenAI-compatible endpoints:

`python from openai import OpenAI

client = OpenAI( base_url="http://your-pod-ip:8001/v1", api_key="your-pi-api-key" )

`Chat completion with tool calling`


response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Coder-32B-Instruct",
    messages=[
        {"role": "user", "content": "Write a Python function to calculate fibonacci"}
    ],
    tools=[{
        "type": "function",
        "function": {
            "name": "execute_code",
            "description": "Execute Python code",
            "parameters": {
                "type": "object",
                "properties": {
                    "code": {"type": "string"}
                },
                "required": ["code"]
            }
        }
    }],
    tool_choice="auto"
)


Standalone Agent CLI

pi includes a standalone OpenAI-compatible agent that can work with any API:

`bash

`Install globally to get pi-agent command`


npm install -g @mariozechner/pi
Use with OpenAI

pi-agent --api-key sk-... "What is machine learning?"
Use with local vLLM

pi-agent --base-url http://localhost:8000/v1 \
         --model meta-llama/Llama-3.1-8B-Instruct \
         --api-key dummy \
         "Explain quantum computing"
Interactive mode

pi-agent -i
Continue previous session

pi-agent --continue "Follow up question"
Custom system prompt

pi-agent --system-prompt "You are a Python expert" "Write a web scraper"
Use responses API (for GPT-OSS models)

pi-agent --api responses --model openai/gpt-oss-20b "Hello"


The agent supports:
- Session persistence across conversations
- Interactive TUI mode with syntax highlighting
- File system tools (read, list, bash, glob, rg) for code navigation
- Both Chat Completions and Responses API formats
- Custom system prompts
Tool Calling Support

pi automatically configures appropriate tool calling parsers for known models:

To disable tool calling:`bash pi start model --name mymodel --vllm --disable-tool-call-parser`

`Memory and Context Management`

`$3`


Controls how much GPU memory vLLM pre-allocates:
-

--memory 30%

: High concurrency, limited context
-

--memory 50%

: Balanced (default)
-

--memory 90%

: Maximum context, low concurrency
$3

Sets maximum input + output tokens:
-

--context 4k

: 4,096 tokens total
-

--context 32k

: 32,768 tokens total
-

--context 128k: 131,072 tokens total

Example for coding workload:`bash

`Large context for code analysis, moderate concurrency`


pi start Qwen/Qwen2.5-Coder-32B-Instruct --name coder \
  --context 64k --memory 70%

Note: When using --vllm, the --memory, --context, and --gpus parameters are ignored. You'll see a warning if you try to use them together.

`Session Persistence`

The interactive agent mode (-i) saves sessions for each project directory:

`bash

`Start new session`


pi agent qwen -i
Continue previous session (maintains chat history)

pi agent qwen -i -c

Sessions are stored in ~/.pi/sessions/organized by project path and include: - Complete conversation history - Tool call results - Token usage statistics

`Architecture & Event System`

Events are automatically converted to the appropriate API format (Chat Completions or Responses) based on the model type.

`$3`

Use --jsonflag to output the event stream as JSONL (JSON Lines) for programmatic consumption:`bash pi-agent --api-key sk-... --json "What is 2+2?"`

`Troubleshooting`

`$3`


- Reduce

--memory

 percentage
- Use smaller model or quantized version (FP8)
- Reduce

--context

 size
$3

bash
Check GPU usage

pi ssh "nvidia-smi"
Check if port is in use

pi list
Force stop all models

pi stop


$3

- Not all models support tool calling reliably
- Try different parser:

--vllm --tool-call-parser mistral


- Or disable:

--vllm --disable-tool-call-parser


$3

Some models (Llama, Mistral) require HuggingFace access approval. Visit the model page and click "Request access".
$3

If using

--vllm nightly

 fails, try:
- Use

--vllm release

 for stable version
- Check CUDA compatibility with

pi ssh "nvidia-smi"


$3

If the agent shows configuration instead of your message, ensure quotes around messages with special characters:

bash
Good

pi agent qwen "What is this file about?"
Bad (shell might interpret special chars)

pi agent qwen What is this file about?


Advanced Usage
$3

bash
Override active pod for any command

pi start model --name test --pod dev-pod
pi list --pod prod-pod
pi stop test --pod dev-pod

$3

bash
Pass any vLLM argument after --vllm

pi start model --name custom --vllm \
  --quantization awq \
  --enable-prefix-caching \
  --max-num-seqs 256 \
  --gpu-memory-utilization 0.95

$3

bash
Watch GPU utilization

pi ssh "watch -n 1 nvidia-smi"
Check model downloads

pi ssh "du -sh ~/.cache/huggingface/hub/*"
View all logs

pi ssh "ls -la ~/.vllm_logs/"
Check agent session history

ls -la ~/.pi/sessions/


Environment Variables

License

MIT