youBencha

A friendly, developer-first CLI framework for evaluating agentic coding tools.

What is youBencha?

youBencha is a testing and benchmarking framework designed to help developers evaluate and compare AI-powered coding agents. It provides:

- Agent-agnostic architecture - Test any agent through pluggable adapters
- Flexible evaluation - Use built-in evaluators or create custom ones
- Reproducible results - Standardized logging and comprehensive result bundles
- Developer-friendly CLI - Simple commands for running evaluations and generating reports

Requirements

- Node.js 20+ - youBencha requires Node.js version 20 or higher
- Git - For cloning repositories during evaluation
- Agent CLI - At least one of:
- GitHub Copilot CLI - For copilot-cli agent type
- Claude Code CLI - For claude-code agent type (install via npm install -g @anthropic-ai/claude-code)

Installation

bash

Install globally

npm install -g youbencha



Or install locally in your project

npm install --save-dev youbencha





$3



If you're developing youBencha locally:

bash

Build the project

npm run build



Link globally to use the yb command

npm link





This creates a global symlink to your local package, making the

yb command available system-wide. Any changes you make require rebuilding (npm run build

) to take effect.



To unlink later:

bash

npm unlink -g youbencha





Quick Start



New to youBencha? Check out the Getting Started Guide for a detailed walkthrough.



$3

bash

npm install -g youbencha





$3



youBencha supports both YAML and JSON formats for configuration files.



Option A: YAML format (

testcase.yaml)

yaml

name: "README Comment Addition"

description: "Tests the agent's ability to add a helpful comment explaining the repository purpose"



repo: https://github.com/youbencha/hello-world.git

branch: main



agent:

  type: copilot-cli

  config:

    prompt: "Add a comment to README explaining what this repository is about"



evaluators:

  - name: git-diff

  - name: agentic-judge

    config:

      type: copilot-cli

      agent_name: agentic-judge

      assertions:

        readme_modified: "README.md was modified. Score 1 if true, 0 if false."

        helpful_comment_added: "A helpful comment was added to README.md. Score 1 if true, 0 if false."





Option B: JSON format (

testcase.json)

json

{

  "name": "README Comment Addition",

  "description": "Tests the agent's ability to add a helpful comment explaining the repository purpose",

  "repo": "https://github.com/youbencha/hello-world.git

",

  "branch": "main",

  "agent": {

    "type": "copilot-cli",

    "config": {

      "prompt": "Add a comment to README explaining what this repository is about"

    }

  },

  "evaluators": [

    { "name": "git-diff" },

    {

      "name": "agentic-judge",

      "config": {

        "type": "copilot-cli",

        "agent_name": "agentic-judge",

        "assertions": {

          "readme_modified": "README.md was modified. Score 1 if true, 0 if false.",

          "helpful_comment_added": "A helpful comment was added to README.md. Score 1 if true, 0 if false."

        }

      }

    }

  ]

}





> 💡 Tip: Both formats support the same features and are validated using the same schema. Choose the format that best fits your workflow or existing tooling.



> 📁 Prompt Files: Instead of inline prompts, you can load them from external files using

prompt_file: ./path/to/prompt.md

. See the Prompt Files Guide for details.



$3

bash

Using YAML format

yb run -c testcase.yaml



Or using JSON format

yb run -c testcase.json



See examples directory for more configurations

yb run -c examples/testcase-simple.yaml

yb run -c examples/testcase-simple.json





The workspace is kept by default for inspection. Add

--delete-workspace

 to clean up after completion.



$3

bash

yb report --from .youbencha-workspace/run-*/artifacts/results.json





That's it! youBencha will clone the repo, run the agent, evaluate the output, and generate a comprehensive report.



---



What Makes youBencha Different?



- Agent-Agnostic: Works with any AI coding agent through pluggable adapters

- Reproducible: Standardized logging captures complete execution context

- Flexible Evaluation: Use built-in evaluators or create custom ones

- Developer-Friendly: Clear error messages, helpful CLI, extensive examples

- Comprehensive Reports: From metrics to human-readable insights



Configuration



youBencha supports configuration files to customize default behavior and define reusable variables.



$3

bash

Create project-level config (.youbencharc)

yb config init



Create user-level config (~/.youbencharc)

yb config init --global



View current configuration

yb config list



Set a configuration value

yb config set workspace_dir /tmp/my-workspace





$3



- Default Workspace Location: Set default workspace and output directories

- Variable Substitution: Define reusable variables for test case configs

- Default Timeouts: Configure default timeouts for agents and operations

- Environment-Specific Settings: Separate user and project configurations



See the Configuration Guide for complete documentation.



Commands



$3



Run a test case with agent execution.

bash

yb run -c





$3



Run evaluators on existing directories without executing an agent. Useful for re-evaluating outputs, testing evaluators, or evaluating manual changes.

bash

yb eval -c





Use cases:

- Re-evaluate agent outputs with different evaluator configurations

- Evaluate manual code changes using youBencha's evaluators

- Test custom evaluators during development

- CI/CD integration with other tools

- Comparative analysis of multiple outputs



See the Eval Command Guide for detailed documentation.



$3



Generate a report from evaluation results.

bash

yb report --from  [--format ] [--output ]



Options:

  --from       Path to results JSON file (required)

  --format   Report format: json, markdown (default: markdown)

  --output     Output path (optional)





$3



Manage youBencha configuration files.

bash

Initialize configuration

yb config init                      # Create .youbencharc in current directory

yb config init --global             # Create ~/.youbencharc in home directory



View configuration

yb config list                      # Show all settings

yb config get workspace_dir         # Get specific setting

yb config get agent.timeout_ms      # Get nested setting (dot notation)



Modify configuration

yb config set workspace_dir /tmp/ws # Set a value

yb config set log_level debug       # Values auto-convert to correct type

yb config unset workspace_dir       # Remove a setting





Use Cases:



- Set default workspace location for all projects

- Define reusable variables for test case configurations

- Configure default timeouts and concurrency settings

- Separate user preferences from project settings



See the Configuration Guide for complete documentation.



$3



Generate test case suggestions using AI agent interaction.

bash

yb suggest-testcase --agent  --output-dir  [--agent-file ]



Options:

  --agent            Agent tool to use (e.g., copilot-cli) (required)

  --output-dir       Path to successful agent output folder (required)

  --agent-file       Custom agent file (default: agents/suggest-testcase.agent.md)

  --save             Path to save generated test case (optional)





Interactive Workflow:



The

suggest-testcase

 command launches an interactive AI agent session that:

1. Analyzes your agent's output folder

2. Asks about your baseline/source for comparison

3. Requests your original instructions/intent

4. Detects patterns in the changes (auth, tests, API, docs, etc.)

5. Recommends appropriate evaluators with reasoning

6. Generates a complete test case configuration



Example Session:

bash

$ yb suggest-testcase --agent copilot-cli --output-dir ./my-feature



🤖 Launching interactive agent session...



Agent: What branch should I use as the baseline for comparison?

You: main



Agent: What were the original instructions you gave to the agent?

You: Add JWT authentication with rate limiting and comprehensive error handling



Agent: I've analyzed the changes and detected:

- Authentication/security code patterns

- New test files added

- Error handling patterns



Here's your suggested testcase.yaml:



[Generated test case configuration with reasoning]



To use this test case:

1. Save as 'testcase.yaml' in your project

2. Run: yb run -c testcase.yaml

3. Review evaluation results





Use Cases:



- After successful agent work - Generate test case for validation

- Quality assurance - Ensure agent followed best practices

- Documentation - Understand what evaluations are appropriate

- Learning - See how different changes map to evaluators



Expected Reference Comparison



youBencha supports comparing agent outputs against an expected reference branch. This is useful when you have a "correct" or "ideal" implementation to compare against.



$3



Add an expected reference to your test case configuration:

yaml

name: "Feature Implementation"

description: "Tests the agent's ability to implement a feature matching the reference implementation"



repo: https://github.com/youbencha/hello-world.git



branch: main

expected_source: branch

expected: feature/completed  # The reference branch



agent:

  type: copilot-cli

  config:

    prompt: "Implement the feature"



evaluators:

  - name: expected-diff

    config:

      threshold: 0.80  # Require 80% similarity to pass





$3



The

threshold

 determines how similar the agent output must be to the expected reference:



- 1.0 (100%) - Exact match (very strict)

- 0.9-0.99 - Very similar with minor differences (strict)

- 0.7-0.89 - Mostly similar with moderate differences (balanced)

- <0.7 - Significantly different (lenient)



Recommended thresholds:

- 0.95+ for generated files (e.g., migrations, configs)

- 0.80-0.90 for implementation code

- 0.70-0.80 for creative tasks with multiple valid solutions



$3



1. Test-Driven Development

yaml

expected: tests-implemented

Compare agent implementation against expected test-driven approach





2. Refactoring Verification

yaml

expected: refactored-solution

Ensure agent refactoring matches expected improvements





3. Bug Fix Validation

yaml

expected: bug-fixed

Compare agent's bug fix with known correct fix





$3



The expected-diff evaluator provides:



- Aggregate Similarity: Overall similarity score (0.0 to 1.0)

- File-level Details: Individual similarity for each file

- Status Counts: matched, changed, added, removed files



Example report section:



$3



| Metric | Value |

|--------|-------|

| Aggregate Similarity | 85.0% |

| Threshold | 80.0% |

| Files Matched | 5 |

| Files Changed | 2 |

| Files Added | 0 |

| Files Removed | 0 |



#### File-level Details



| File | Similarity | Status |

|------|-----------|--------|

| src/main.ts | 75.0% | 🔄 changed |

| src/utils.ts | 100.0% | ✓ matched |





Built-in Evaluators



$3



Analyzes Git changes made by the agent with assertion-based pass/fail thresholds.



Metrics: files_changed, lines_added, lines_removed, total_changes, change_entropy



Supported Assertions:

-

max_files_changed

 - Maximum number of files that can be changed

-

max_lines_added

 - Maximum number of lines that can be added

-

max_lines_removed

 - Maximum number of lines that can be removed

-

max_total_changes

 - Maximum total changes (additions + deletions)

-

min_change_entropy

 - Minimum entropy (enforces distributed changes)

-

max_change_entropy

 - Maximum entropy (enforces focused changes)



Example:

yaml

evaluators:

  - name: git-diff

    config:

      assertions:

        max_files_changed: 5

        max_lines_added: 100

        max_change_entropy: 2.0  # Keep changes focused





$3



Compares agent output against expected reference branch.



Metrics: aggregate_similarity, threshold, files_matched, files_changed, files_added, files_removed, file_similarities



Requires: expected_source and expected configured in test case



$3



Uses an AI agent to evaluate code quality based on custom assertions. The agent reads files, searches for patterns, and makes judgments like a human reviewer.



Features:

- Evaluable assertions as pass/fail

- Supports multiple independent judges for different areas

- Each judge maintains focused context (1-3 assertions recommended)



Metrics: Custom metrics based on your assertions



Multiple Judges: You can define multiple agentic-judge evaluators to break down evaluation into focused areas:

yaml

evaluators:

  # Judge 1: Error Handling

  - name: agentic-judge-error-handling

    config:

      type: copilot-cli

      agent_name: agentic-judge

      assertions:

        has_try_catch: "Code includes try-catch blocks. Score 1 if present, 0 if absent."

        errors_logged: "Errors are properly logged. Score 1 if logged, 0 if not."

  

  # Judge 2: Documentation

  - name: agentic-judge-documentation

    config:

      type: copilot-cli

      agent_name: agentic-judge

      assertions:

        functions_documented: "Functions have JSDoc. Score 1 if documented, 0 if not."





Naming Convention: Use

agentic-judge- or agentic-judge:

 to create specialized judges.



See: docs/multiple-agentic-judges.md for detailed guide



Development



$3

bash

Clone repository

git clone https://github.com/yourusername/youbencha.git

cd youbencha



Install dependencies

npm install



Build

npm run build



Run tests

npm test



Run with coverage

npm test -- --coverage





Post-Evaluators: Exporting and Analyzing Results



Post-evaluations run after evaluation completes, enabling you to export results to external systems, run custom analysis, or trigger downstream workflows.



$3



1. Database Export - Append results to JSONL file for time-series analysis

yaml

post_evaluation:

  - name: database

    config:

      type: json-file

      output_path: ./results-history.jsonl

      include_full_bundle: true

      append: true





2. Webhook - POST results to HTTP endpoint

yaml

post_evaluation:

  - name: webhook

    config:

      url: ${SLACK_WEBHOOK_URL}

      method: POST

      headers:

        Content-Type: "application/json"

      retry_on_failure: true

      timeout_ms: 5000





3. Custom Script - Execute custom analysis or integration

yaml

post_evaluation:

  - name: script

    config:

      command: ./scripts/notify-slack.sh

      args:

        - "${RESULTS_PATH}"

      env:

        SLACK_WEBHOOK_URL: "${SLACK_WEBHOOK_URL}"

      timeout_ms: 30000





$3



Single Result: Immediate feedback on one evaluation

- Quick validation during prompt engineering

- Debugging agent failures

- Understanding scope of changes



Suite of Results: Cross-test comparison

- Identify difficult tasks

- Compare agent configurations

- Aggregate metrics and pass rates



Results Over Time: Regression detection and trends

- Track performance changes across model/prompt updates

- Cost optimization and ROI tracking

- Long-term quality trends



$3



See

examples/scripts/

 for ready-to-use scripts:

-

notify-slack.sh

 - Post results to Slack

-

analyze-trends.sh

 - Analyze time-series data

-

detect-regression.sh

 - Compare last two runs



$3



- Getting Started Guide - Comprehensive walkthrough for new users

- Post-Evaluation Guide - Complete reference for post-evaluation hooks

- Analyzing Results Guide - Analysis patterns and best practices

- Prompt Files Guide - Loading prompts from external files

- Reusable Evaluators Guide - Sharing evaluator configurations

- Multiple Agentic Judges Guide - Using multiple focused evaluators

- Claude Code Adapter - Using Claude Code as an agent



$3



src/

  adapters/      - Agent adapters

  cli/           - CLI commands

  core/          - Core orchestration logic

  evaluators/    - Built-in evaluators

  post-evaluations/ - Post-evaluation exporters

  lib/           - Utility libraries

  reporters/     - Report generators

  schemas/       - Zod schemas for validation

tests/

  contract/      - Contract tests

  integration/   - Integration tests

  unit/          - Unit tests





Architecture



youBencha follows a pluggable architecture:



- Agent-Agnostic: Agent-specific logic isolated in adapters

- Pluggable Evaluators: Add new evaluators without core changes

- Reproducible: Complete execution context captured

- youBencha Log Compliance: Normalized logging format across agents



Security Considerations



$3



Before running evaluations:



1. Test case configurations execute code: Only run test case configurations from trusted sources

2. Agent file system access: Agents have full access to the workspace directory

3. Isolation strongly recommended: Run evaluations in containers or VMs for untrusted code

4. Repository cloning: Validates repository URLs but exercise caution with private repos



$3



We recommend running youBencha in isolated environments:

bash

Docker example

docker run -it --rm \

  -v $(pwd):/workspace \

  -w /workspace \

  node:20 \

  npx youbencha run -c testcase.yaml



Or use dedicated CI/CD runners

$3

Please report security vulnerabilities via GitHub Security Advisories or email security@youbencha.dev. Do not open public issues for security vulnerabilities.

For more details, see SECURITY.md.

License

MIT