cceval

Evaluate and benchmark your CLAUDE.md effectiveness with automated testing.

Why?

Your CLAUDE.md file guides how Claude Code behaves in your project. But how do you know if your instructions are actually working?

cceval lets you:
- Auto-generate variations of your CLAUDE.md
- Test them against realistic prompts
- Find what actually improves Claude's behavior
- Iterate quickly on your instructions

Installation

``bash

`Global install (recommended)`


bun add -g cceval
Or per-project

bun add -D cceval


Requirements:
- Bun >= 1.0
- Claude Code CLI installed and authenticated

Optional (for faster generation): -ANTHROPIC_API_KEY environment variable - if set, uses direct API calls for variation generation instead of Claude CLI (faster and more reliable)

`Quick Start (Turnkey)`

Just run cceval in any project with a CLAUDE.md:

`bash cd your-project cceval`

That's it! cceval will: 1. Find your CLAUDE.md 2. Use Claude to generate 5 variations (condensed, explicit, prioritized, minimal, structured) 3. Test each variation against 5 realistic prompts 4. Show you which variation performs best

`$3`

`🚀 cceval - Turnkey CLAUDE.md Evaluation ============================================================

📄 Source: ./CLAUDE.md 🤖 Model: haiku

🔄 Generating 7 variations...

⏳ Generating "condensed"... ✓ ⏳ Generating "explicit"... ✓ ⏳ Generating "prioritized"... ✓ ⏳ Generating "minimal"... ✓ ⏳ Generating "structured"... ✓

📦 Cached variations to .cceval-variations.json

============================================================ 📊 Starting Evaluation ============================================================ Testing 7 variations: • baseline • original • condensed • explicit • prioritized • minimal • structured

With 5 test prompts each Total tests: 35 ============================================================

✓ baseline/exploreBeforeBuild: $0.0012 ✓ baseline/bunPreference: $0.0015 ...

============================================================ 📊 CLAUDE.md EVALUATION REPORT ============================================================

🏷️ explicit: 72.0% (18/25) ✅ noPermissionSeeking: 5/5 ✅ readFilesFirst: 5/5 ⚠️ usedBun: 3/5 ...

🏆 WINNER: explicit 💰 Total cost: $0.42 ============================================================`

`How It Works`

`$3`

cceval generates these variations from your CLAUDE.md:

| Strategy | Description | |----------|-------------| |baseline| Minimal "You are a helpful coding assistant" | |original| Your actual CLAUDE.md as-is | |condensed| Shorter version keeping only critical rules | |explicit| More explicit version with clear imperatives (MUST, NEVER, ALWAYS) | |prioritized| Reordered with most important rules first | |minimal| Just the 3-5 most critical rules | |structured | Well-organized with clear sections and headers |

`$3`

Default prompts test key behaviors:

| Prompt | What it tests | |--------|---------------| |exploreBeforeBuild| Does it read files before coding? | |bunPreference| Does it use Bun instead of Node? | |rootCause| Does it fix root cause instead of adding spinners? | |simplicity| Does it avoid over-engineering? | |permissionSeeking | Does it execute without asking permission? |

`CLI Reference`

`$3`

`bash

`Turnkey: auto-detect, generate, test`


cceval
Same as above, explicit

cceval auto
Specify a different CLAUDE.md

cceval auto -p ./docs/CLAUDE.md
Use a smarter model for better variations

cceval auto -m sonnet
Skip regeneration, use cached variations

cceval auto --skip-generate
See what strategies are available

cceval auto --strategies-only

$3

`bash

`Custom output file`


cceval -o my-results.json
Generate markdown report too

cceval --markdown REPORT.md


$3
For full control, use a config file:

`bash

`Create starter config`


cceval init
Edit cceval.config.ts with your prompts/variations
Run with config

cceval run

$3

`bash

`Generate report from previous results`


cceval report evaluation-results.json


Metrics
cceval measures:

| Metric | What it tests | |--------|---------------| |noPermissionSeeking| Does NOT ask "should I...?" or "would you like me to...?" | |readFilesFirst| Mentions reading/examining files before coding | |usedBun| Uses Bun APIs (Bun.serve, bun test, etc.) | |proposedRootCause| For "slow" prompts: fixes root cause instead of adding spinners | |ranVerification | Mentions running tests or showing output |

`Cost Estimates`

| Model | Cost per test | 35 tests (auto) | 25 tests (manual) | |-------|---------------|-----------------|-------------------| | haiku | ~$0.08 | ~$2.80 | ~$2.00 | | sonnet | ~$0.30 | ~$10.50 | ~$7.50 | | opus | ~$1.50 | ~$52.50 | ~$37.50 |

We recommend haiku for iteration (it's fast and cheap), then validate findings with sonnet.

`Programmatic Usage`

`typescript import { generateVariations, variationsToConfig, runEvaluation, printConsoleReport } from "cceval"

// Generate variations from a CLAUDE.md const generated = await generateVariations({ claudeMdPath: "./CLAUDE.md", model: "haiku", })

// Convert to config const variations = variationsToConfig(generated)

// Run evaluation const results = await runEvaluation({ config: { prompts: { test: "Create a hello world server." }, variations, model: "haiku", }, })

printConsoleReport(results)`

`Key Findings from Our Research`

Based on evaluating 25+ prompt variations:

`$3`


Clear pass/fail criteria outperform vague guidance:


You are evaluated on gates. Fail any = FAIL.
1. Read files before coding
2. State plan then proceed immediately (don't ask)
3. Run tests and show actual output


$3

Explicitly saying "never ask permission" increases permission-seeking due to priming. Instead, frame positively:


Execute standard operations immediately.
File edits and test runs are routine.


$3

Adding "Run tests and show actual output" improved verification from 20% to 100%.
$3

The winning prompt was just 4 lines. Long instructions get ignored.
Workflow
Recommended workflow for optimizing your CLAUDE.md:

1. Baseline: Run ccevalto see how your current CLAUDE.md performs 2. Analyze: Look at which variation scored best and why 3. Apply: Update your CLAUDE.md based on the winning strategy 4. Iterate: Runcceval again to verify improvement

`Files Generated`

| File | Description | |------|-------------| |.cceval-variations.json | Cached generated variations (re-run faster with --skip-generate) | |evaluation-results.json| Full test results (JSON) | |REPORT.md | Markdown report (if --markdown specified) |

Add .cceval-variations.json and evaluation-results.json to .gitignore`.

Contributing

PRs welcome! Ideas:
- More default metrics
- CI/CD integration examples
- Alternative model backends
- Statistical significance testing

License

MIT

cceval

Evaluate and benchmark your CLAUDE.md effectiveness with automated testing.

Why?

Your CLAUDE.md file guides how Claude Code behaves in your project. But how do you know if your instructions are actually working?

cceval lets you:
- Auto-generate variations of your CLAUDE.md
- Test them against realistic prompts
- Find what actually improves Claude's behavior
- Iterate quickly on your instructions

Installation

``bash

`Global install (recommended)`


bun add -g cceval
Or per-project

bun add -D cceval


Requirements:
- Bun >= 1.0
- Claude Code CLI installed and authenticated

Optional (for faster generation): -ANTHROPIC_API_KEY environment variable - if set, uses direct API calls for variation generation instead of Claude CLI (faster and more reliable)

`Quick Start (Turnkey)`

Just run cceval in any project with a CLAUDE.md:

`bash cd your-project cceval`

`$3`

`🚀 cceval - Turnkey CLAUDE.md Evaluation ============================================================

📄 Source: ./CLAUDE.md 🤖 Model: haiku

🔄 Generating 7 variations...

⏳ Generating "condensed"... ✓ ⏳ Generating "explicit"... ✓ ⏳ Generating "prioritized"... ✓ ⏳ Generating "minimal"... ✓ ⏳ Generating "structured"... ✓

📦 Cached variations to .cceval-variations.json

With 5 test prompts each Total tests: 35 ============================================================

✓ baseline/exploreBeforeBuild: $0.0012 ✓ baseline/bunPreference: $0.0015 ...

============================================================ 📊 CLAUDE.md EVALUATION REPORT ============================================================

🏷️ explicit: 72.0% (18/25) ✅ noPermissionSeeking: 5/5 ✅ readFilesFirst: 5/5 ⚠️ usedBun: 3/5 ...

🏆 WINNER: explicit 💰 Total cost: $0.42 ============================================================`

`How It Works`

`$3`

cceval generates these variations from your CLAUDE.md:

`$3`

Default prompts test key behaviors:

`CLI Reference`

`$3`

`bash

`Turnkey: auto-detect, generate, test`


cceval
Same as above, explicit

cceval auto
Specify a different CLAUDE.md

cceval auto -p ./docs/CLAUDE.md
Use a smarter model for better variations

cceval auto -m sonnet
Skip regeneration, use cached variations

cceval auto --skip-generate
See what strategies are available

cceval auto --strategies-only

$3

`bash

`Custom output file`


cceval -o my-results.json
Generate markdown report too

cceval --markdown REPORT.md


$3
For full control, use a config file:

`bash

`Create starter config`


cceval init
Edit cceval.config.ts with your prompts/variations
Run with config

cceval run

$3

`bash

`Generate report from previous results`


cceval report evaluation-results.json


Metrics
cceval measures:

`Cost Estimates`

We recommend haiku for iteration (it's fast and cheap), then validate findings with sonnet.

`Programmatic Usage`

`typescript import { generateVariations, variationsToConfig, runEvaluation, printConsoleReport } from "cceval"

// Generate variations from a CLAUDE.md const generated = await generateVariations({ claudeMdPath: "./CLAUDE.md", model: "haiku", })

// Convert to config const variations = variationsToConfig(generated)

// Run evaluation const results = await runEvaluation({ config: { prompts: { test: "Create a hello world server." }, variations, model: "haiku", }, })

printConsoleReport(results)`

`Key Findings from Our Research`

Based on evaluating 25+ prompt variations:

`$3`


Clear pass/fail criteria outperform vague guidance:


You are evaluated on gates. Fail any = FAIL.
1. Read files before coding
2. State plan then proceed immediately (don't ask)
3. Run tests and show actual output


$3

Explicitly saying "never ask permission" increases permission-seeking due to priming. Instead, frame positively:


Execute standard operations immediately.
File edits and test runs are routine.


$3

Adding "Run tests and show actual output" improved verification from 20% to 100%.
$3

The winning prompt was just 4 lines. Long instructions get ignored.
Workflow
Recommended workflow for optimizing your CLAUDE.md:

`Files Generated`

Add .cceval-variations.json and evaluation-results.json to .gitignore`.

Contributing

PRs welcome! Ideas:
- More default metrics
- CI/CD integration examples
- Alternative model backends
- Statistical significance testing

License

MIT