Building reliable AI agents is an incredibly challenging endeavor. Unlike traditional software where inputs and outputs are deterministic, AI agents operate in a complex, non-deterministic environment where the smallest changes can have unexpected and far-reaching consequences. A minor prompt modification, a slight adjustment to system instructions, or even a change in model parameters can ripple through the entire system, causing subtle failures that are difficult to detect and diagnose.
Testing AI agents presents unique challenges that traditional testing frameworks cannot adequately address. How do you validate that an agent correctly interprets user intent? How do you ensure tool invocations are appropriate and executed in the right sequence? How do you catch regressions when a prompt change breaks edge cases you didn't anticipate? These questions become exponentially more complex when dealing with multi-turn conversations, code generation, and complex routing decisions.
We built vibe-check internally to rigorously test and validate poof.new. As we iterated on prompts, refined agent behaviors, and added new capabilities, we needed a systematic way to ensure our changes didn't break existing functionality—and to catch issues before they reached production. Traditional testing approaches fell short, so we created a framework specifically designed for AI agent evaluation.
After using vibe-check extensively in our own development process, we're now open-sourcing it to help the broader AI agent development community. We believe that robust testing and evaluation frameworks are essential for building production-ready AI systems, and we hope vibe-check will help others navigate the complexities of agent development with more confidence.
Why vibe-check?
Building reliable AI agents is hard. Traditional testing approaches fall short when evaluating LLM behavior, tool usage, and multi-turn interactions. vibe-check provides a comprehensive framework specifically designed for AI agent evaluation:
- Agent-Native Testing: Evaluate tool calls, code generation, routing decisions, and conversational flows - Learning from Failures: Built-in learning system analyzes failures and suggests prompt improvements - Production-Ready: Parallel execution, retries, isolated workspaces, and detailed reporting - Framework Agnostic: Works with Claude SDK (TypeScript & Python), custom agents, or any LLM-powered system - Developer-First: TypeScript-native with full type safety and intuitive APIs
$3
- 🤖 Agent Development: Validate your AI agent meets requirements before shipping - 📊 Regression Testing: Catch regressions when updating prompts or models - 🔄 A/B Testing: Compare agent performance across different configurations - 📈 Continuous Improvement: Use learning system to systematically improve prompts - 🎯 Benchmarking: Measure and track agent performance over time - 🔍 Pre-deployment Validation: Gate production deployments on eval results
- 5 Eval Categories: Tool usage, code generation, routing, multi-turn conversations, and basic evaluations - 7 Built-in Judges: File existence, tool invocation, pattern matching, syntax validation, skill invocation, and 4 LLM-based judges with rubric support - Automatic Tool Extraction: For claude-code agents, tool calls are automatically extracted from JSONL logs - Extensible Judge System: Create custom judges for specialized validation - Parallel Execution: Run evaluations concurrently with configurable concurrency - Retry Logic: Automatic retries with exponential backoff for flaky tests - Flaky Test Detection: Automatically identifies tests that pass on retry - Isolated Workspaces: Each eval runs in its own temporary directory - Multi-trial Support: Run multiple trials per eval with pass thresholds - Per-Turn Judges: Evaluate each turn independently in multi-turn conversations - Learning System: Analyze failures and generate improvement rules - TypeScript First: Full type safety with comprehensive type exports
Installation
``bash
Using bun (recommended)
bun add @poofnew/vibe-check
Using npm
npm install @poofnew/vibe-check
Using pnpm
pnpm add @poofnew/vibe-check `
Quick Start
$3
`bash bunx vibe-check init `
This creates:
-
vibe-check.config.ts - Configuration file with agent function stub - __evals__/example.eval.json - Example evaluation case
$3
Edit
vibe-check.config.ts to implement your agent function:`typescript import { defineConfig } from "@poofnew/vibe-check";
The agent function receives a prompt and context, and must return an
AgentResult:`typescript interface AgentContext { workingDirectory: string; // Isolated temp directory for this eval evalId: string; // Unique eval case ID evalName: string; // Eval case name sessionId?: string; // For multi-turn sessions timeout?: number; // Eval timeout in ms }
interface AgentResult { output: string; // Agent's text output success: boolean; // Whether agent completed successfully toolCalls?: ToolCall[]; // Record of tool invocations sessionId?: string; // Session ID for multi-turn error?: Error; // Error if failed duration?: number; // Execution time in ms usage?: { inputTokens: number; outputTokens: number; totalCostUsd?: number; }; }
interface ToolCall { toolName: string; input: unknown; output?: unknown; isError?: boolean; timestamp?: number; // When the tool was called duration?: number; // How long the call took (ms) }
The learning system automatically analyzes test failures and generates prompt improvements to enhance your agent's performance over time.
$3
1. Collect Failures: Gathers failed evals from test runs or JSONL logs 2. Generate Explanations: Uses LLM to analyze why each failure occurred 3. Detect Patterns: Groups similar failures into patterns 4. Propose Rules: Generates actionable prompt rules to fix systemic issues 5. Human Review: Allows manual approval before integrating rules 6. Iterate: Re-run evals to validate improvements
$3
Enable learning in your config:
`typescript export default defineConfig({ learning: { enabled: true, ruleOutputDir: "./prompts", // Where to save rules minFailuresForPattern: 2, // Min failures to form a pattern autoApprove: false, // Require manual review }, // ... }); `
$3
`bash
Run full learning iteration
vibe-check learn run --source eval
Analyze failures without generating rules
vibe-check learn analyze --source both
Review pending rules
vibe-check learn review
Show learning statistics
vibe-check learn stats
Auto-approve high-confidence rules (use with caution)
vibe-check learn run --auto-approve `
$3
- eval: Analyze failures from recent eval runs - jsonl: Load failures from JSONL files - both: Combine both sources
autoApprove: false to review all rules manually - Run learning iterations after accumulating 10+ failures - Set minFailuresForPattern: 2 to catch recurring issues - Review rules before integration to avoid over-fitting - Use JSONL source for production failure logs
CLI Commands
$3
Run the evaluation suite.
`bash vibe-check run [options]
Options: -c, --config Path to config file --category Filter by category (tool, code-gen, routing, multi-turn, basic) --tag Filter by tag --id Filter by eval ID -v, --verbose Verbose output
`
Examples:
`bash
Run all evals
vibe-check run
Run only code-gen evals
vibe-check run --category code-gen
Run evals with specific tags
vibe-check run --tag critical --tag regression
Run specific evals by ID
vibe-check run --id create-file --id read-file
Verbose output
vibe-check run -v `
$3
List available eval cases.
`bash vibe-check list [options]
Options: -c, --config Path to config file --category Filter by category --tag Filter by tag --json Output as JSON
Learning system commands for analyzing failures and generating rules.
`bash
Run full learning iteration
vibe-check learn run [options] --source Data source (eval, jsonl, both) --auto-approve Auto-approve high-confidence rules --save-pending Save rules for later review
Analyze failures without generating rules
vibe-check learn analyze [options] --source Data source (eval, jsonl, both)
Review pending rules
vibe-check learn review
Show learning statistics
vibe-check learn stats `
Programmatic API
Use vibe-check programmatically in your code:
`typescript import { defineConfig, EvalRunner, loadConfig, loadEvalCases, } from "@poofnew/vibe-check";
// Load and run const config = await loadConfig("./vibe-check.config.ts"); const runner = new EvalRunner(config);
const result = await runner.run({ categories: ["code-gen"], tags: ["critical"], });
// Adapters (for multi-language support) export { PythonAgentAdapter } from "@poofnew/vibe-check/adapters"; export type { AgentRequest, AgentResponse, PythonAdapterOptions, } from "@poofnew/vibe-check/adapters";
`
Examples
Explore complete working examples in the
examples/ directory:
$3
Simple agent integration with minimal configuration:
`bash cd examples/basic bun install bun run vibe-check run `
Use case: Quick start template, testing custom agents
$3
Full-featured Claude SDK integration with tool tracking (TypeScript):
`bash cd examples/claude-agent-sdk bun install export ANTHROPIC_API_KEY=your_key bun run vibe-check run `
Use case: Production Claude agents, comprehensive testing
$3
Python SDK integration using the process-based adapter:
`bash cd examples/python-agent bun install ./setup.sh # Creates Python venv and installs claude-agent-sdk export ANTHROPIC_API_KEY=your_key bun run vibe-check run `
Use case: Python-based Claude agents, multi-language support
The Python adapter uses a JSON protocol over stdin/stdout to communicate with Python agent scripts:
`typescript import { PythonAgentAdapter } from "@poofnew/vibe-check/adapters";
`bash cd examples/custom-judges bun install bun run vibe-check run `
Use case: Domain-specific validation, custom metrics
$3
Multi-turn conversation testing with session persistence:
`bash cd examples/multi-turn bun install bun run vibe-check run `
Use case: Conversational agents, iterative refinement flows
$3
Demonstrates the learning system with a mock agent that has deliberate flaws:
`bash cd examples/learning bun install bun run vibe-check run # Runs evals (some will fail by design) bun run vibe-check learn stats # Shows learning system status bun run vibe-check learn analyze # Analyzes failures (requires ANTHROPIC_API_KEY) `
Use case: Understanding the learning system, testing failure analysis pipeline
The example includes:
- A mock agent with predictable flaws (uses Read instead of Write, refuses delete operations, etc.) - Pre-configured eval cases designed to fail - Pre-generated results so learning commands work immediately
Performance Tips
Optimize your eval suite for speed and reliability:
$3
`typescript export default defineConfig({ parallel: true, maxConcurrency: 5, // Balance between speed and resource usage }); `
Tip: Higher concurrency = faster but more memory/API usage. Start with 3-5.
$3
`bash
Run only critical tests during development
vibe-check run --tag critical
Run specific categories
vibe-check run --category tool code-gen
Run single test for debugging
vibe-check run --id my-test-id `
$3
`typescript export default defineConfig({ timeout: 60000, // Default for all tests });
// Override per eval case { "id": "quick-test", "timeout": 10000, // Fast tests // ... }
Tip: Enable retries for flaky network/API tests, disable for deterministic tests.
$3
`typescript export default defineConfig({ trials: 3, // Run each test 3 times trialPassThreshold: 0.67, // Pass if 2/3 succeed });
// Or per eval { "id": "flaky-test", "trials": { "count": 5, "passThreshold": 0.8 }, // ... }
`
Tip: Use trials for non-deterministic agent behavior, but avoid over-reliance.
$3
By default, vibe-check creates temporary workspaces and cleans them up after each eval. Use
preserveWorkspaces: true for debugging:`typescript export default defineConfig({ preserveWorkspaces: true, // Keep workspaces for inspection // ... }); `
$3
For full control over workspace lifecycle, use
createWorkspace and cleanupWorkspace hooks:`typescript import { defineConfig, type EvalWorkspace } from "@poofnew/vibe-check"; import * as fs from "fs/promises"; import * as path from "path"; import { execFile } from "child_process"; import { promisify } from "util";
- Use any package manager (npm, yarn, pnpm, bun) - Pre-install dependencies in template for faster workspace setup - Custom setup logic per workspace - Full control over cleanup behavior
Troubleshooting
$3
#### "Cannot find config file"
`bash
Ensure config exists
ls vibe-check.config.ts
Or specify path
vibe-check run --config ./path/to/config.ts `
#### "No eval cases found"
`typescript // Check testDir and testMatch in config export default defineConfig({ testDir: "./__evals__", // Must exist testMatch: ["*/.eval.json"], // Must match file names }); ``bash
// Or per eval { "timeout": 600000 // 10 minutes for slow tests }
`
#### "Module not found" errors with Claude SDK
`bash
Install peer dependencies
bun add @anthropic-ai/sdk @anthropic-ai/claude-agent-sdk
Verify installation
bun pm ls | grep anthropic `
#### Test failures in CI/CD
`typescript // Reduce concurrency for stability export default defineConfig({ parallel: true, maxConcurrency: 2, // Lower for CI environments maxRetries: 3, // More retries for flaky CI networks }); `
#### Out of memory errors
`bash
Reduce concurrency
vibe-check run --config config-with-lower-concurrency.ts
Or run tests in batches
vibe-check run --category tool vibe-check run --category code-gen `
$3
`bash
Enable verbose output
vibe-check run -v
Preserve workspaces for inspection
vibe-check run --config config-with-preserve-workspaces.ts `
async (prompt, context) => AgentResult function. Built-in support for Claude SDK (TypeScript and Python via adapters), but works with LangChain, custom agents, or any LLM framework.
Q: Can I use this with other LLMs (OpenAI, Gemini, etc.)?
A: Yes! The framework is LLM-agnostic. Just implement the agent function to call your preferred LLM.
Q: Do I need Bun or can I use Node/npm?
A: While optimized for Bun, vibe-check works with Node.js 18+ and npm/pnpm. Bun is recommended for best performance.
Q: Can I use Python agents with vibe-check?
A: Yes! Use the
PythonAgentAdapter from @poofnew/vibe-check/adapters. It spawns Python scripts as subprocesses and communicates via JSON over stdin/stdout. See the Python Agent SDK Integration example.
- name: Run vibe-check env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} run: bun run vibe-check run
`
$3
`yaml test: image: oven/bun:latest script: - bun install - bun run vibe-check run variables: ANTHROPIC_API_KEY: $ANTHROPIC_API_KEY `
$3
`yaml version: 2.1
jobs: test: docker: - image: oven/bun:latest steps: - checkout - run: bun install - run: bun run vibe-check run
`
$3
- Use
maxConcurrency: 2-3` for stable CI runs - Set appropriate timeouts for CI environment - Cache dependencies for faster runs - Store API keys in secrets/environment variables - Consider running critical tests only on PRs, full suite on main