Evaluate and benchmark your CLAUDE.md effectiveness with automated testing
npm install ccevalEvaluate and benchmark your CLAUDE.md effectiveness with automated testing.
Your CLAUDE.md file guides how Claude Code behaves in your project. But how do you know if your instructions are actually working?
cceval lets you:
- Auto-generate variations of your CLAUDE.md
- Test them against realistic prompts
- Find what actually improves Claude's behavior
- Iterate quickly on your instructions
``bashGlobal install (recommended)
bun add -g cceval
Requirements:
- Bun >= 1.0
- Claude Code CLI installed and authenticated
Optional (for faster generation):
-
ANTHROPIC_API_KEY environment variable - if set, uses direct API calls for variation generation instead of Claude CLI (faster and more reliable)Quick Start (Turnkey)
Just run
cceval in any project with a CLAUDE.md:`bash
cd your-project
cceval
`That's it! cceval will:
1. Find your CLAUDE.md
2. Use Claude to generate 5 variations (condensed, explicit, prioritized, minimal, structured)
3. Test each variation against 5 realistic prompts
4. Show you which variation performs best
$3
`
š cceval - Turnkey CLAUDE.md Evaluation
============================================================š Source: ./CLAUDE.md
š¤ Model: haiku
š Generating 7 variations...
ā³ Generating "condensed"... ā
ā³ Generating "explicit"... ā
ā³ Generating "prioritized"... ā
ā³ Generating "minimal"... ā
ā³ Generating "structured"... ā
š¦ Cached variations to .cceval-variations.json
============================================================
š Starting Evaluation
============================================================
Testing 7 variations:
⢠baseline
⢠original
⢠condensed
⢠explicit
⢠prioritized
⢠minimal
⢠structured
With 5 test prompts each
Total tests: 35
============================================================
ā baseline/exploreBeforeBuild: $0.0012
ā baseline/bunPreference: $0.0015
...
============================================================
š CLAUDE.md EVALUATION REPORT
============================================================
š·ļø explicit: 72.0% (18/25)
ā
noPermissionSeeking: 5/5
ā
readFilesFirst: 5/5
ā ļø usedBun: 3/5
...
š WINNER: explicit
š° Total cost: $0.42
============================================================
`How It Works
$3
cceval generates these variations from your CLAUDE.md:
| Strategy | Description |
|----------|-------------|
|
baseline | Minimal "You are a helpful coding assistant" |
| original | Your actual CLAUDE.md as-is |
| condensed | Shorter version keeping only critical rules |
| explicit | More explicit version with clear imperatives (MUST, NEVER, ALWAYS) |
| prioritized | Reordered with most important rules first |
| minimal | Just the 3-5 most critical rules |
| structured | Well-organized with clear sections and headers |$3
Default prompts test key behaviors:
| Prompt | What it tests |
|--------|---------------|
|
exploreBeforeBuild | Does it read files before coding? |
| bunPreference | Does it use Bun instead of Node? |
| rootCause | Does it fix root cause instead of adding spinners? |
| simplicity | Does it avoid over-engineering? |
| permissionSeeking | Does it execute without asking permission? |CLI Reference
$3
`bash
Turnkey: auto-detect, generate, test
ccevalSame as above, explicit
cceval autoSpecify a different CLAUDE.md
cceval auto -p ./docs/CLAUDE.mdUse a smarter model for better variations
cceval auto -m sonnetSkip regeneration, use cached variations
cceval auto --skip-generateSee what strategies are available
cceval auto --strategies-only
`$3
`bash
Custom output file
cceval -o my-results.jsonGenerate markdown report too
cceval --markdown REPORT.md
`$3
For full control, use a config file:
`bash
Create starter config
cceval initEdit cceval.config.ts with your prompts/variations
Run with config
cceval run
`$3
`bash
Generate report from previous results
cceval report evaluation-results.json
`Metrics
cceval measures:
| Metric | What it tests |
|--------|---------------|
|
noPermissionSeeking | Does NOT ask "should I...?" or "would you like me to...?" |
| readFilesFirst | Mentions reading/examining files before coding |
| usedBun | Uses Bun APIs (Bun.serve, bun test, etc.) |
| proposedRootCause | For "slow" prompts: fixes root cause instead of adding spinners |
| ranVerification | Mentions running tests or showing output |Cost Estimates
| Model | Cost per test | 35 tests (auto) | 25 tests (manual) |
|-------|---------------|-----------------|-------------------|
| haiku | ~$0.08 | ~$2.80 | ~$2.00 |
| sonnet | ~$0.30 | ~$10.50 | ~$7.50 |
| opus | ~$1.50 | ~$52.50 | ~$37.50 |
We recommend haiku for iteration (it's fast and cheap), then validate findings with sonnet.
Programmatic Usage
`typescript
import {
generateVariations,
variationsToConfig,
runEvaluation,
printConsoleReport
} from "cceval"// Generate variations from a CLAUDE.md
const generated = await generateVariations({
claudeMdPath: "./CLAUDE.md",
model: "haiku",
})
// Convert to config
const variations = variationsToConfig(generated)
// Run evaluation
const results = await runEvaluation({
config: {
prompts: { test: "Create a hello world server." },
variations,
model: "haiku",
},
})
printConsoleReport(results)
`Key Findings from Our Research
Based on evaluating 25+ prompt variations:
$3
Clear pass/fail criteria outperform vague guidance:
`
You are evaluated on gates. Fail any = FAIL.
1. Read files before coding
2. State plan then proceed immediately (don't ask)
3. Run tests and show actual output
`$3
Explicitly saying "never ask permission" increases permission-seeking due to priming. Instead, frame positively:
`
Execute standard operations immediately.
File edits and test runs are routine.
`$3
Adding "Run tests and show actual output" improved verification from 20% to 100%.$3
The winning prompt was just 4 lines. Long instructions get ignored.Workflow
Recommended workflow for optimizing your CLAUDE.md:
1. Baseline: Run
cceval to see how your current CLAUDE.md performs
2. Analyze: Look at which variation scored best and why
3. Apply: Update your CLAUDE.md based on the winning strategy
4. Iterate: Run cceval again to verify improvementFiles Generated
| File | Description |
|------|-------------|
|
.cceval-variations.json | Cached generated variations (re-run faster with --skip-generate) |
| evaluation-results.json | Full test results (JSON) |
| REPORT.md | Markdown report (if --markdown specified) |Add
.cceval-variations.json and evaluation-results.json to .gitignore`.PRs welcome! Ideas:
- More default metrics
- CI/CD integration examples
- Alternative model backends
- Statistical significance testing
MIT