A benchmark suite for coding agents. Think pytest, but for evaluating AI assistants.
npm install sniffbench> A custom benchmark suite for coding agents.
When you change your AI coding setup—switching models, adjusting prompts, or trying new tools—you're flying blind. Did it actually get better? Worse? Hard to say without data.
Sniffbench gives you that data. It runs your coding agent through evaluation tasks and measures what matters.
``bashClone and build
git clone https://github.com/answerlayer/sniffbench.git
cd sniffbench
npm install
npm run build
What Works Now
$3
Test how well your agent understands a codebase:
`bash
sniff interview
`This runs your agent through 12 comprehension questions about the codebase architecture. You grade each answer on a 1-10 scale to establish baselines. Future runs compare against your baseline.
`
╭─ sniff interview ────────────────────────────────────────────────╮
│ Comprehension Interview │
│ │
│ Test how well your agent understands this codebase. │
│ You'll grade each answer on a 1-10 scale to establish baselines. │
╰──────────────────────────────────────────────────────────────────╯✔ Found 12 comprehension questions
Questions to cover:
○ not graded comp-001: Project Overview
○ not graded comp-002: How to Add New Features
...
`$3
`bash
List all test cases
sniff casesShow details of a specific case
sniff cases show comp-001List categories
sniff cases categories
`$3
`bash
Check sniffbench configuration
sniff statusRun diagnostics (Docker, dependencies)
sniff doctor
``Sniffbench evaluates agents on behaviors that matter for real-world development:
1. Style Adherence - Does the agent follow existing patterns in the repo?
2. Targeted Changes - Does it make specific, focused changes without over-engineering?
3. Efficient Navigation - Does it research the codebase efficiently?
4. Non-Regression - Do existing tests still pass?
We explicitly do NOT measure generic "best practices" divorced from project context. See VALUES.md for our full philosophy.
| Type | Description | Status |
|------|-------------|--------|
| Comprehension | Questions about codebase architecture | ✅ Ready |
| Bootstrap | Common tasks (fix linting, rename symbols) | 🚧 In Progress |
| Closed Issues | Real issues from your repo's history | 🚧 In Progress |
| Generated | LLM discovers improvement opportunities | 🚧 Planned |
We're building in phases:
1. ✅ Foundation - CLI, Docker sandboxing, case management
2. 🚧 Case Types - Comprehension, bootstrap, closed issues, generated
3. ⬜ Agent Integration - Claude Code, Cursor, Aider wrappers
4. ⬜ Metrics - Comprehensive scoring and comparison
5. ⬜ Multi-Agent - Cross-agent benchmarking
See ROADMAP.md for detailed phases.
We welcome contributions! Areas that need work:
- Agent wrappers - Integrate with OpenCode, Cursor, Gemini, OpenCode, or your favourite CLI-bsed coding agent
- Bootstrap cases - Detection and validation for common tasks
- Closed issues scanner - Extract cases from git history
- Documentation - Examples, tutorials, case studies
See CONTRIBUTING.md to get started.
We researched existing solutions (SWE-Bench, CORE-Bench, Aider benchmarks). See existing_work.md for analysis.
MIT - see LICENSE
Open an issue. We're building this in public and welcome feedback.