NodeBench MCP

Make AI agents catch the bugs they normally ship.

One command gives your agent structured research, risk assessment, 3-layer testing, quality gates, and a persistent knowledge base — so every fix is thorough and every insight compounds into future work.

``

bash

claude mcp add nodebench -- npx -y nodebench-mcp





---



Why — What Bare Agents Miss



We benchmarked 9 real production prompts — things like "The LinkedIn posting pipeline is creating duplicate posts" and "The agent loop hits budget but still gets new events" — comparing a bare agent vs one with NodeBench MCP.



| What gets measured | Bare Agent | With NodeBench MCP |

|---|---|---|

| Issues detected before deploy | 0 | 13 (4 high, 8 medium, 1 low) |

| Research findings before coding | 0 | 21 |

| Risk assessments | 0 | 9 |

| Test coverage layers | 1 | 3 (static + unit + integration) |

| Integration failures caught early | 0 | 4 |

| Regression eval cases created | 0 | 22 |

| Quality gate rules enforced | 0 | 52 |

| Deploys blocked by gate violations | 0 | 4 |

| Knowledge entries banked | 0 | 9 |

| Blind spots shipped to production | 26 | 0 |



The bare agent reads the code, implements a fix, runs tests once, and ships. The MCP agent researches first, assesses risk, tracks issues to resolution, runs 3-layer tests, creates regression guards, enforces quality gates, and banks everything as knowledge for next time.



Every additional tool call produces a concrete artifact — an issue found, a risk assessed, a regression guarded — that compounds across future tasks.



---



Who's Using It



Vision engineer — Built agentic vision analysis using GPT 5.2 with Set-of-Mark (SoM) for boundary boxing, similar to Google Gemini 3 Flash's agentic code execution approach. Uses NodeBench's verification pipeline to validate detection accuracy across screenshot variants before shipping model changes.



QA engineer — Transitioned a manual QA workflow website into an AI agent-driven app for a pet care messaging platform. Uses NodeBench's quality gates, verification cycles, and eval runs to ensure the AI agent handles edge cases that manual QA caught but bare AI agents miss.



Both found different subsets of the 162 tools useful — which is why NodeBench ships with 4

--preset

 levels to load only what you need.



---



How It Works — 3 Real Examples



$3



You type: "The content queue has 40 items stuck in 'judging' status for 6 hours"



Bare agent: Reads the queue code, finds a potential fix, runs tests, ships.



With NodeBench MCP: The agent runs structured recon and discovers 3 blind spots the bare agent misses:

- No retry backoff on OpenRouter rate limits (HIGH)

- JSON regex

match(/\{[\s\S]*\}/) grabs last }

 — breaks on multi-object responses (MEDIUM)

- No timeout on LLM call — hung request blocks entire cron for 15+ min (not detected by unit tests)



All 3 are logged as gaps, resolved, regression-tested, and the patterns banked so the next similar bug is fixed faster.



$3



You type: "I launched 3 Claude Code subagents but they keep overwriting each other's changes"



Without NodeBench: Both agents see the same bug and both implement a fix. The third agent re-investigates what agent 1 already solved. Agent 2 hits context limit mid-fix and loses work.



With NodeBench MCP: Each subagent calls

claim_agent_task

 to lock its work. Roles are assigned so they don't overlap. Context budget is tracked. Progress notes ensure handoff without starting from scratch.



$3



Tasks 1-3 start with zero prior knowledge. By task 9, the agent finds 2+ relevant prior findings before writing a single line of code. Bare agents start from zero every time.



---



Quick Start



$3

bash

Claude Code CLI — all 162 tools (TOON encoding on by default for ~40% token savings)

claude mcp add nodebench -- npx -y nodebench-mcp



Or start with discovery only — 5 tools, agents self-escalate to what they need

claude mcp add nodebench -- npx -y nodebench-mcp --preset meta



Or start lean — 43 tools, ~70% less token overhead

claude mcp add nodebench -- npx -y nodebench-mcp --preset lite





Or add to

~/.claude/settings.json or .claude.json

json

{

  "mcpServers": {

    "nodebench": {

      "command": "npx",

      "args": ["-y", "nodebench-mcp"]

    }

  }

}

$3



See what's available

> Use getMethodology("overview") to see all workflows



Before your next task — search for prior knowledge

> Use search_all_knowledge("what I'm about to work on")



Run the full verification pipeline on a change

> Use getMethodology("mandatory_flywheel") and follow the 6 steps

$3

bash

export GEMINI_API_KEY="your-key"        # Web search + vision (recommended)

export GITHUB_TOKEN="your-token"        # GitHub (higher rate limits)





$3



NodeBench MCP treats tools as "Access". To measure real capability lift, we benchmark baseline (LLM-only) vs tool-augmented accuracy on GAIA (gated).



Notes:

- GAIA fixtures and attachments are written under

.cache/gaia

 (gitignored). Do not commit GAIA content.

- Fixture generation requires

HF_TOKEN or HUGGINGFACE_HUB_TOKEN

.



Web lane (web_search + fetch_url):

bash

npm run mcp:dataset:gaia:capability:refresh

NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:test





File-backed lane (PDF / XLSX / CSV / DOCX / PPTX / JSON / JSONL / TXT / ZIP via

local_file

 tools):

bash

npm run mcp:dataset:gaia:capability:files:refresh

NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:files:test





Modes:

- Stable:

NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag



- More realistic:

NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent





Notes:

- ZIP attachments require

NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent

 (multi-step extract -> parse).



---



What You Get



$3



| When you... | Use this | Impact |

|---|---|---|

| Start any task |

search_all_knowledge

 | Find prior findings — avoid repeating past mistakes |

| Research before coding |

run_recon + log_recon_finding

 | Structured research with surfaced findings |

| Assess risk before acting |

assess_risk

 | Risk tier determines if action needs confirmation |

| Track implementation |

start_verification_cycle + log_gap

 | Issues logged with severity, tracked to resolution |

| Test thoroughly |

log_test_result

 (3 layers) | Static + unit + integration vs running tests once |

| Guard against regression |

start_eval_run + record_eval_result

 | Eval cases that protect this fix in the future |

| Gate before deploy |

run_quality_gate

 | Boolean rules enforced — violations block deploy |

| Bank knowledge |

record_learning

 | Persisted findings compound across future sessions |

| Verify completeness |

run_mandatory_flywheel

 | 6-step minimum — catches dead code and intent mismatches |



$3



| When you... | Use this | Impact |

|---|---|---|

| Prevent duplicate work |

claim_agent_task / release_agent_task

 | Task locks — each task owned by exactly one agent |

| Specialize agents |

assign_agent_role

 | 7 roles: implementer, test_writer, critic, etc. |

| Track context usage |

log_context_budget

 | Prevents context exhaustion mid-fix |

| Validate against reference |

run_oracle_comparison

 | Compare output against known-good oracle |

| Orient new sessions |

get_parallel_status

 | See what all agents are doing and what's blocked |

| Bootstrap any repo |

bootstrap_parallel_agents

 | Auto-detect gaps, scaffold coordination infra |



$3



| When you... | Use this | Impact |

|---|---|---|

| Search the web |

web_search

 | Gemini/OpenAI/Perplexity — latest docs and updates |

| Fetch a URL |

fetch_url

 | Read any page as clean markdown |

| Find GitHub repos |

search_github + analyze_repo

 | Discover and evaluate libraries and patterns |

| Analyze screenshots |

analyze_screenshot

 | AI vision (Gemini/GPT-4o/Claude) for UI QA |



---



Progressive Discovery



162 tools is a lot. The progressive disclosure system helps agents find exactly what they need:



$3



> discover_tools("verify my implementation")

The

discover_tools

 search engine scores tools using 10 parallel strategies:



| Strategy | What it does | Example |

|---|---|---|

| Keyword | Exact/partial word matching on name, tags, description | "benchmark" →

benchmark_models

 |

| Fuzzy | Levenshtein distance — tolerates typos | "verifiy" →

start_verification_cycle

 |

| N-gram | Trigram similarity for partial words | "screen" →

capture_ui_screenshot

 |

| Prefix | Matches tool name starts | "cap" →

capture_*

 tools |

| Semantic | Synonym expansion (30 word families) | "check" also finds "verify", "validate" |

| TF-IDF | Rare tags score higher than common ones | "c-compiler" scores higher than "test" |

| Regex | Pattern matching |

"^run_.*loop$" → run_closed_loop

 |

| Bigram | Phrase matching | "quality gate" matched as unit |

| Domain boost | Related categories boosted together | verification + quality_gate cluster |

| Dense | TF-IDF cosine similarity for vector-like ranking | "audit compliance" surfaces related tools |



7 search modes:

hybrid (default, all strategies), fuzzy, regex, prefix, semantic, exact, dense





Pass

explain: true

 to see exactly which strategies contributed to each score.



$3



Every tool response auto-appends a

_quickRef

 with:

- nextAction: What to do immediately after this tool

- nextTools: Recommended follow-up tools

- methodology: Which methodology guide to consult

- tip: Practical usage advice



Call

get_tool_quick_ref("tool_name")

 for any tool's guidance.



$3



24 pre-built chains for common workflows:



| Chain | Steps | Use case |

|---|---|---|

|

new_feature

 | 12 | End-to-end feature development |

|

fix_bug

 | 6 | Structured debugging |

|

ui_change

 | 7 | Frontend with visual verification |

|

parallel_project

 | 7 | Multi-agent coordination |

|

research_phase

 | 8 | Context gathering |

|

academic_paper

 | 7 | Paper writing pipeline |

|

c_compiler_benchmark

 | 10 | Autonomous capability test |

|

security_audit

 | 9 | Comprehensive security assessment |

|

code_review

 | 8 | Structured code review |

|

deployment

 | 8 | Ship with full verification |

|

migration

 | 10 | SDK/framework upgrade |

|

coordinator_spawn

 | 10 | Parallel coordinator setup |

|

self_setup

 | 8 | Agent self-onboarding |

|

flicker_detection

 | 7 | Android flicker analysis |

|

figma_flow_analysis

 | 5 | Figma prototype flow audit |

|

agent_eval

 | 9 | Evaluate agent performance |

|

contract_compliance

 | 5 | Check agent contract adherence |

|

ablation_eval

 | 10 | Ablation experiment design |

|

session_recovery

 | 6 | Recover context after compaction |

|

attention_refresh

 | 4 | Reload bearings mid-session |

|

task_bank_setup

 | 9 | Create evaluation task banks |

|

pr_review

 | 5 | Pull request review |

|

seo_audit

 | 6 | Full SEO audit |

|

voice_pipeline

 | 6 | Voice pipeline implementation |



Call

get_workflow_chain("new_feature")

 to get the step-by-step sequence.



$3



Start new projects with everything pre-configured:

bash

gh repo create my-project --template HomenShum/nodebench-boilerplate --clone

cd my-project && npm install





Or use the scaffold tool:

scaffold_nodebench_project

 creates AGENTS.md, .mcp.json, package.json, CI, Docker, and parallel agent infra.



---



The Methodology Pipeline



NodeBench MCP isn't just a bag of tools — it's a pipeline. Each step feeds the next:



Research → Risk → Implement → Test (3 layers) → Eval → Gate → Learn → Ship

    ↑                                                              │

    └──────────── knowledge compounds ─────────────────────────────┘





Inner loop (per change): 6-phase verification ensures correctness.

Outer loop (over time): Eval-driven development ensures improvement.

Together: The AI Flywheel — every verification produces eval artifacts, every regression triggers verification.



Ask the agent:

Use getMethodology("overview")

 to see all 20 methodology topics.



---



Parallel Agents with Claude Code



Based on Anthropic's "Building a C Compiler with Parallel Claudes" (Feb 2026).



When to use: Only when running 2+ agent sessions. Single-agent workflows use the standard pipeline above.



How it works with Claude Code's Task tool:



1. COORDINATOR (your main session) breaks work into independent tasks

2. Each Task tool call spawns a subagent with instructions to:

   -

claim_agent_task

 — lock the task

   -

assign_agent_role

 — specialize (implementer, test_writer, critic, etc.)

   - Do the work

   -

release_agent_task

 — handoff with progress note

3. Coordinator calls

get_parallel_status

 to monitor all subagents

4. Coordinator runs

run_quality_gate

 on the aggregate result



MCP Prompts available:

-

claude-code-parallel

 — Step-by-step Claude Code subagent coordination

-

parallel-agent-team

 — Full team setup with role assignment

-

oracle-test-harness

 — Validate outputs against known-good reference

-

bootstrap-parallel-agents

 — Scaffold parallel infra for any repo



---



Toolset Gating



162 tools means tens of thousands of tokens of schema per API call. If you only need core methodology, gate the toolset:



$3



| Preset | Tools | Domains | Use case |

|---|---|---|---|

|

meta | 5 | 0 | Discovery-only front door — agents start here and self-escalate via discover_tools

 |

|

lite

 | 43 | 8 | Core methodology — verification, eval, flywheel, learning, recon, security, boilerplate |

|

core

 | 114 | 22 | Full workflow — adds bootstrap, self-eval, llm, platform, research_writing, flicker_detection, figma_flow, benchmark, session_memory, toon, pattern, git_workflow, seo, voice_bridge |

|

full

 | 162 | 30 | Everything — adds vision, UI capture, web, GitHub, docs, parallel, local files, GAIA solvers |

bash

Meta — 5 tools (discovery-only: findTools, getMethodology, discover_tools, get_tool_quick_ref, get_workflow_chain)

Agents start here and self-escalate to the tools they need

claude mcp add nodebench -- npx -y nodebench-mcp --preset meta



Lite — 43 tools (verification, eval, flywheel, learning, recon, security, boilerplate + meta + discovery)

claude mcp add nodebench -- npx -y nodebench-mcp --preset lite



Core — 114 tools (adds bootstrap, self-eval, llm, platform, research_writing, flicker_detection, figma_flow, benchmark, session_memory, toon, pattern, git_workflow, seo, voice_bridge + meta + discovery)

claude mcp add nodebench -- npx -y nodebench-mcp --preset core



Full — all 162 tools (default, TOON encoding on by default)

claude mcp add nodebench -- npx -y nodebench-mcp





Or in config:

json

{

  "mcpServers": {

    "nodebench": {

      "command": "npx",

      "args": ["-y", "nodebench-mcp", "--preset", "meta"]

    }

  }

}

$3

bash

Include only specific toolsets

npx nodebench-mcp --toolsets verification,eval,recon



Exclude heavy optional-dep toolsets

npx nodebench-mcp --exclude vision,ui_capture,parallel



See all toolsets and presets

npx nodebench-mcp --help





$3



| Toolset | Tools | What it covers |

|---|---|---|

| verification | 8 | Cycles, gaps, triple-verify, status |

| eval | 6 | Eval runs, results, comparison, diff |

| quality_gate | 4 | Gates, presets, history |

| learning | 4 | Knowledge, search, record |

| recon | 7 | Research, findings, framework checks, risk |

| flywheel | 4 | Mandatory flywheel, promote, investigate |

| bootstrap | 11 | Project setup, agents.md, self-implement, autonomous, test runner |

| self_eval | 9 | Trajectory analysis, health reports, task banks, grading, contract compliance |

| parallel | 13 | Task locks, roles, context budget, oracle, agent mailbox (point-to-point + broadcast) |

| vision | 4 | Screenshot analysis, UI capture, diff |

| ui_capture | 2 | Playwright-based capture |

| web | 2 | Web search, URL fetch |

| github | 3 | Repo search, analysis, monitoring |

| docs | 4 | Documentation generation, reports |

| local_file | 19 | Deterministic parsing (CSV/XLSX/PDF/DOCX/PPTX/ZIP/JSON/JSONL/TXT/OCR/audio) |

| llm | 3 | LLM calling, extraction, benchmarking |

| security | 3 | Dependency scanning, code analysis, terminal security scanning |

| platform | 4 | Convex bridge: briefs, funding, research, publish |

| research_writing | 8 | Academic paper polishing, translation, de-AI, logic check, captions, experiment analysis, reviewer simulation |

| flicker_detection | 5 | Android flicker detection + SSIM tooling |

| figma_flow | 4 | Figma flow analysis + rendering |

| boilerplate | 2 | Scaffold NodeBench projects + status |

| benchmark | 3 | Autonomous benchmark lifecycle (C-compiler pattern) |

| session_memory | 3 | Compaction-resilient notes, attention refresh, context reload |

| gaia_solvers | 6 | GAIA media image solvers (red/green deviation, polygon area, fraction quiz, bass clef, storage cost) |

| toon | 2 | TOON encode/decode — Token-Oriented Object Notation (~40% token savings) |

| pattern | 2 | Session pattern mining + risk prediction from historical sequences |

| git_workflow | 3 | Branch compliance, PR checklist review, merge gate enforcement |

| seo | 5 | Technical SEO audit, page performance, content analysis, WordPress detection + updates |

| voice_bridge | 4 | Voice pipeline design, config analysis, scaffold generation, latency benchmarking |



Always included (regardless of gating) — these 5 tools form the

License

MIT

NodeBench MCP

bash

claude mcp add nodebench -- npx -y nodebench-mcp





---



Why — What Bare Agents Miss



We benchmarked 9 real production prompts — things like "The LinkedIn posting pipeline is creating duplicate posts" and "The agent loop hits budget but still gets new events" — comparing a bare agent vs one with NodeBench MCP.



| What gets measured | Bare Agent | With NodeBench MCP |

|---|---|---|

| Issues detected before deploy | 0 | 13 (4 high, 8 medium, 1 low) |

| Research findings before coding | 0 | 21 |

| Risk assessments | 0 | 9 |

| Test coverage layers | 1 | 3 (static + unit + integration) |

| Integration failures caught early | 0 | 4 |

| Regression eval cases created | 0 | 22 |

| Quality gate rules enforced | 0 | 52 |

| Deploys blocked by gate violations | 0 | 4 |

| Knowledge entries banked | 0 | 9 |

| Blind spots shipped to production | 26 | 0 |



The bare agent reads the code, implements a fix, runs tests once, and ships. The MCP agent researches first, assesses risk, tracks issues to resolution, runs 3-layer tests, creates regression guards, enforces quality gates, and banks everything as knowledge for next time.



Every additional tool call produces a concrete artifact — an issue found, a risk assessed, a regression guarded — that compounds across future tasks.



---



Who's Using It



Vision engineer — Built agentic vision analysis using GPT 5.2 with Set-of-Mark (SoM) for boundary boxing, similar to Google Gemini 3 Flash's agentic code execution approach. Uses NodeBench's verification pipeline to validate detection accuracy across screenshot variants before shipping model changes.



QA engineer — Transitioned a manual QA workflow website into an AI agent-driven app for a pet care messaging platform. Uses NodeBench's quality gates, verification cycles, and eval runs to ensure the AI agent handles edge cases that manual QA caught but bare AI agents miss.



Both found different subsets of the 162 tools useful — which is why NodeBench ships with 4

--preset

 levels to load only what you need.



---



How It Works — 3 Real Examples



$3



You type: "The content queue has 40 items stuck in 'judging' status for 6 hours"



Bare agent: Reads the queue code, finds a potential fix, runs tests, ships.



With NodeBench MCP: The agent runs structured recon and discovers 3 blind spots the bare agent misses:

- No retry backoff on OpenRouter rate limits (HIGH)

- JSON regex

match(/\{[\s\S]*\}/) grabs last }

 — breaks on multi-object responses (MEDIUM)

- No timeout on LLM call — hung request blocks entire cron for 15+ min (not detected by unit tests)



All 3 are logged as gaps, resolved, regression-tested, and the patterns banked so the next similar bug is fixed faster.



$3



You type: "I launched 3 Claude Code subagents but they keep overwriting each other's changes"



Without NodeBench: Both agents see the same bug and both implement a fix. The third agent re-investigates what agent 1 already solved. Agent 2 hits context limit mid-fix and loses work.



With NodeBench MCP: Each subagent calls

claim_agent_task

 to lock its work. Roles are assigned so they don't overlap. Context budget is tracked. Progress notes ensure handoff without starting from scratch.



$3



Tasks 1-3 start with zero prior knowledge. By task 9, the agent finds 2+ relevant prior findings before writing a single line of code. Bare agents start from zero every time.



---



Quick Start



$3

bash

Claude Code CLI — all 162 tools (TOON encoding on by default for ~40% token savings)

claude mcp add nodebench -- npx -y nodebench-mcp



Or start with discovery only — 5 tools, agents self-escalate to what they need

claude mcp add nodebench -- npx -y nodebench-mcp --preset meta



Or start lean — 43 tools, ~70% less token overhead

claude mcp add nodebench -- npx -y nodebench-mcp --preset lite





Or add to

~/.claude/settings.json or .claude.json

json

{

  "mcpServers": {

    "nodebench": {

      "command": "npx",

      "args": ["-y", "nodebench-mcp"]

    }

  }

}

$3



See what's available

> Use getMethodology("overview") to see all workflows



Before your next task — search for prior knowledge

> Use search_all_knowledge("what I'm about to work on")



Run the full verification pipeline on a change

> Use getMethodology("mandatory_flywheel") and follow the 6 steps

$3

bash

export GEMINI_API_KEY="your-key"        # Web search + vision (recommended)

export GITHUB_TOKEN="your-token"        # GitHub (higher rate limits)





$3



NodeBench MCP treats tools as "Access". To measure real capability lift, we benchmark baseline (LLM-only) vs tool-augmented accuracy on GAIA (gated).



Notes:

- GAIA fixtures and attachments are written under

.cache/gaia

 (gitignored). Do not commit GAIA content.

- Fixture generation requires

HF_TOKEN or HUGGINGFACE_HUB_TOKEN

.



Web lane (web_search + fetch_url):

bash

npm run mcp:dataset:gaia:capability:refresh

NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:test





File-backed lane (PDF / XLSX / CSV / DOCX / PPTX / JSON / JSONL / TXT / ZIP via

local_file

 tools):

bash

npm run mcp:dataset:gaia:capability:files:refresh

NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:files:test





Modes:

- Stable:

NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag



- More realistic:

NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent





Notes:

- ZIP attachments require

NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent

 (multi-step extract -> parse).



---



What You Get



$3



| When you... | Use this | Impact |

|---|---|---|

| Start any task |

search_all_knowledge

 | Find prior findings — avoid repeating past mistakes |

| Research before coding |

run_recon + log_recon_finding

 | Structured research with surfaced findings |

| Assess risk before acting |

assess_risk

 | Risk tier determines if action needs confirmation |

| Track implementation |

start_verification_cycle + log_gap

 | Issues logged with severity, tracked to resolution |

| Test thoroughly |

log_test_result

 (3 layers) | Static + unit + integration vs running tests once |

| Guard against regression |

start_eval_run + record_eval_result

 | Eval cases that protect this fix in the future |

| Gate before deploy |

run_quality_gate

 | Boolean rules enforced — violations block deploy |

| Bank knowledge |

record_learning

 | Persisted findings compound across future sessions |

| Verify completeness |

run_mandatory_flywheel

 | 6-step minimum — catches dead code and intent mismatches |



$3



| When you... | Use this | Impact |

|---|---|---|

| Prevent duplicate work |

claim_agent_task / release_agent_task

 | Task locks — each task owned by exactly one agent |

| Specialize agents |

assign_agent_role

 | 7 roles: implementer, test_writer, critic, etc. |

| Track context usage |

log_context_budget

 | Prevents context exhaustion mid-fix |

| Validate against reference |

run_oracle_comparison

 | Compare output against known-good oracle |

| Orient new sessions |

get_parallel_status

 | See what all agents are doing and what's blocked |

| Bootstrap any repo |

bootstrap_parallel_agents

 | Auto-detect gaps, scaffold coordination infra |



$3



| When you... | Use this | Impact |

|---|---|---|

| Search the web |

web_search

 | Gemini/OpenAI/Perplexity — latest docs and updates |

| Fetch a URL |

fetch_url

 | Read any page as clean markdown |

| Find GitHub repos |

search_github + analyze_repo

 | Discover and evaluate libraries and patterns |

| Analyze screenshots |

analyze_screenshot

 | AI vision (Gemini/GPT-4o/Claude) for UI QA |



---



Progressive Discovery



162 tools is a lot. The progressive disclosure system helps agents find exactly what they need:



$3



> discover_tools("verify my implementation")

The

discover_tools

 search engine scores tools using 10 parallel strategies:



| Strategy | What it does | Example |

|---|---|---|

| Keyword | Exact/partial word matching on name, tags, description | "benchmark" →

benchmark_models

 |

| Fuzzy | Levenshtein distance — tolerates typos | "verifiy" →

start_verification_cycle

 |

| N-gram | Trigram similarity for partial words | "screen" →

capture_ui_screenshot

 |

| Prefix | Matches tool name starts | "cap" →

capture_*

 tools |

| Semantic | Synonym expansion (30 word families) | "check" also finds "verify", "validate" |

| TF-IDF | Rare tags score higher than common ones | "c-compiler" scores higher than "test" |

| Regex | Pattern matching |

"^run_.*loop$" → run_closed_loop

 |

| Bigram | Phrase matching | "quality gate" matched as unit |

| Domain boost | Related categories boosted together | verification + quality_gate cluster |

| Dense | TF-IDF cosine similarity for vector-like ranking | "audit compliance" surfaces related tools |



7 search modes:

hybrid (default, all strategies), fuzzy, regex, prefix, semantic, exact, dense





Pass

explain: true

 to see exactly which strategies contributed to each score.



$3



Every tool response auto-appends a

_quickRef

 with:

- nextAction: What to do immediately after this tool

- nextTools: Recommended follow-up tools

- methodology: Which methodology guide to consult

- tip: Practical usage advice



Call

get_tool_quick_ref("tool_name")

 for any tool's guidance.



$3



24 pre-built chains for common workflows:



| Chain | Steps | Use case |

|---|---|---|

|

new_feature

 | 12 | End-to-end feature development |

|

fix_bug

 | 6 | Structured debugging |

|

ui_change

 | 7 | Frontend with visual verification |

|

parallel_project

 | 7 | Multi-agent coordination |

|

research_phase

 | 8 | Context gathering |

|

academic_paper

 | 7 | Paper writing pipeline |

|

c_compiler_benchmark

 | 10 | Autonomous capability test |

|

security_audit

 | 9 | Comprehensive security assessment |

|

code_review

 | 8 | Structured code review |

|

deployment

 | 8 | Ship with full verification |

|

migration

 | 10 | SDK/framework upgrade |

|

coordinator_spawn

 | 10 | Parallel coordinator setup |

|

self_setup

 | 8 | Agent self-onboarding |

|

flicker_detection

 | 7 | Android flicker analysis |

|

figma_flow_analysis

 | 5 | Figma prototype flow audit |

|

agent_eval

 | 9 | Evaluate agent performance |

|

contract_compliance

 | 5 | Check agent contract adherence |

|

ablation_eval

 | 10 | Ablation experiment design |

|

session_recovery

 | 6 | Recover context after compaction |

|

attention_refresh

 | 4 | Reload bearings mid-session |

|

task_bank_setup

 | 9 | Create evaluation task banks |

|

pr_review

 | 5 | Pull request review |

|

seo_audit

 | 6 | Full SEO audit |

|

voice_pipeline

 | 6 | Voice pipeline implementation |



Call

get_workflow_chain("new_feature")

 to get the step-by-step sequence.



$3



Start new projects with everything pre-configured:

bash

gh repo create my-project --template HomenShum/nodebench-boilerplate --clone

cd my-project && npm install





Or use the scaffold tool:

scaffold_nodebench_project

 creates AGENTS.md, .mcp.json, package.json, CI, Docker, and parallel agent infra.



---



The Methodology Pipeline



NodeBench MCP isn't just a bag of tools — it's a pipeline. Each step feeds the next:



Research → Risk → Implement → Test (3 layers) → Eval → Gate → Learn → Ship

    ↑                                                              │

    └──────────── knowledge compounds ─────────────────────────────┘





Inner loop (per change): 6-phase verification ensures correctness.

Outer loop (over time): Eval-driven development ensures improvement.

Together: The AI Flywheel — every verification produces eval artifacts, every regression triggers verification.



Ask the agent:

Use getMethodology("overview")

 to see all 20 methodology topics.



---



Parallel Agents with Claude Code



Based on Anthropic's "Building a C Compiler with Parallel Claudes" (Feb 2026).



When to use: Only when running 2+ agent sessions. Single-agent workflows use the standard pipeline above.



How it works with Claude Code's Task tool:



1. COORDINATOR (your main session) breaks work into independent tasks

2. Each Task tool call spawns a subagent with instructions to:

   -

claim_agent_task

 — lock the task

   -

assign_agent_role

 — specialize (implementer, test_writer, critic, etc.)

   - Do the work

   -

release_agent_task

 — handoff with progress note

3. Coordinator calls

get_parallel_status

 to monitor all subagents

4. Coordinator runs

run_quality_gate

 on the aggregate result



MCP Prompts available:

-

claude-code-parallel

 — Step-by-step Claude Code subagent coordination

-

parallel-agent-team

 — Full team setup with role assignment

-

oracle-test-harness

 — Validate outputs against known-good reference

-

bootstrap-parallel-agents

 — Scaffold parallel infra for any repo



---



Toolset Gating



162 tools means tens of thousands of tokens of schema per API call. If you only need core methodology, gate the toolset:



$3



| Preset | Tools | Domains | Use case |

|---|---|---|---|

|

meta | 5 | 0 | Discovery-only front door — agents start here and self-escalate via discover_tools

 |

|

lite

 | 43 | 8 | Core methodology — verification, eval, flywheel, learning, recon, security, boilerplate |

|

core

 | 114 | 22 | Full workflow — adds bootstrap, self-eval, llm, platform, research_writing, flicker_detection, figma_flow, benchmark, session_memory, toon, pattern, git_workflow, seo, voice_bridge |

|

full

 | 162 | 30 | Everything — adds vision, UI capture, web, GitHub, docs, parallel, local files, GAIA solvers |

bash

Meta — 5 tools (discovery-only: findTools, getMethodology, discover_tools, get_tool_quick_ref, get_workflow_chain)

Agents start here and self-escalate to the tools they need

claude mcp add nodebench -- npx -y nodebench-mcp --preset meta



Lite — 43 tools (verification, eval, flywheel, learning, recon, security, boilerplate + meta + discovery)

claude mcp add nodebench -- npx -y nodebench-mcp --preset lite



Core — 114 tools (adds bootstrap, self-eval, llm, platform, research_writing, flicker_detection, figma_flow, benchmark, session_memory, toon, pattern, git_workflow, seo, voice_bridge + meta + discovery)

claude mcp add nodebench -- npx -y nodebench-mcp --preset core



Full — all 162 tools (default, TOON encoding on by default)

claude mcp add nodebench -- npx -y nodebench-mcp





Or in config:

json

{

  "mcpServers": {

    "nodebench": {

      "command": "npx",

      "args": ["-y", "nodebench-mcp", "--preset", "meta"]

    }

  }

}

$3

bash

Include only specific toolsets

npx nodebench-mcp --toolsets verification,eval,recon



Exclude heavy optional-dep toolsets

npx nodebench-mcp --exclude vision,ui_capture,parallel



See all toolsets and presets

npx nodebench-mcp --help





$3



| Toolset | Tools | What it covers |

|---|---|---|

| verification | 8 | Cycles, gaps, triple-verify, status |

| eval | 6 | Eval runs, results, comparison, diff |

| quality_gate | 4 | Gates, presets, history |

| learning | 4 | Knowledge, search, record |

| recon | 7 | Research, findings, framework checks, risk |

| flywheel | 4 | Mandatory flywheel, promote, investigate |

| bootstrap | 11 | Project setup, agents.md, self-implement, autonomous, test runner |

| self_eval | 9 | Trajectory analysis, health reports, task banks, grading, contract compliance |

| parallel | 13 | Task locks, roles, context budget, oracle, agent mailbox (point-to-point + broadcast) |

| vision | 4 | Screenshot analysis, UI capture, diff |

| ui_capture | 2 | Playwright-based capture |

| web | 2 | Web search, URL fetch |

| github | 3 | Repo search, analysis, monitoring |

| docs | 4 | Documentation generation, reports |

| local_file | 19 | Deterministic parsing (CSV/XLSX/PDF/DOCX/PPTX/ZIP/JSON/JSONL/TXT/OCR/audio) |

| llm | 3 | LLM calling, extraction, benchmarking |

| security | 3 | Dependency scanning, code analysis, terminal security scanning |

| platform | 4 | Convex bridge: briefs, funding, research, publish |

| research_writing | 8 | Academic paper polishing, translation, de-AI, logic check, captions, experiment analysis, reviewer simulation |

| flicker_detection | 5 | Android flicker detection + SSIM tooling |

| figma_flow | 4 | Figma flow analysis + rendering |

| boilerplate | 2 | Scaffold NodeBench projects + status |

| benchmark | 3 | Autonomous benchmark lifecycle (C-compiler pattern) |

| session_memory | 3 | Compaction-resilient notes, attention refresh, context reload |

| gaia_solvers | 6 | GAIA media image solvers (red/green deviation, polygon area, fraction quiz, bass clef, storage cost) |

| toon | 2 | TOON encode/decode — Token-Oriented Object Notation (~40% token savings) |

| pattern | 2 | Session pattern mining + risk prediction from historical sequences |

| git_workflow | 3 | Branch compliance, PR checklist review, merge gate enforcement |

| seo | 5 | Technical SEO audit, page performance, content analysis, WordPress detection + updates |

| voice_bridge | 4 | Voice pipeline design, config analysis, scaffold generation, latency benchmarking |



Always included (regardless of gating) — these 5 tools form the

License

MIT

nodebench-mcp

NodeBench MCP

Why — What Bare Agents Miss

Who's Using It

How It Works — 3 Real Examples

$3

$3

$3

Quick Start

$3

Claude Code CLI — all 162 tools (TOON encoding on by default for ~40% token savings)

Or start with discovery only — 5 tools, agents self-escalate to what they need

Or start lean — 43 tools, ~70% less token overhead

$3

See what's available

Before your next task — search for prior knowledge

Run the full verification pipeline on a change

$3

$3

What You Get

$3

$3

$3

Progressive Discovery

$3

$3

$3

$3

The Methodology Pipeline

Parallel Agents with Claude Code

Toolset Gating

$3

Meta — 5 tools (discovery-only: findTools, getMethodology, discover_tools, get_tool_quick_ref, get_workflow_chain)

Agents start here and self-escalate to the tools they need

Lite — 43 tools (verification, eval, flywheel, learning, recon, security, boilerplate + meta + discovery)

Core — 114 tools (adds bootstrap, self-eval, llm, platform, research_writing, flicker_detection, figma_flow, benchmark, session_memory, toon, pattern, git_workflow, seo, voice_bridge + meta + discovery)

Full — all 162 tools (default, TOON encoding on by default)

$3

Include only specific toolsets

Exclude heavy optional-dep toolsets

See all toolsets and presets

$3

$3

TOON on (default)

TOON off

Build from Source

Troubleshooting

License

nodebench-mcp

NodeBench MCP

Why — What Bare Agents Miss

Who's Using It

How It Works — 3 Real Examples

$3

$3

$3

Quick Start

$3

Claude Code CLI — all 162 tools (TOON encoding on by default for ~40% token savings)

Or start with discovery only — 5 tools, agents self-escalate to what they need

Or start lean — 43 tools, ~70% less token overhead

$3

See what's available

Before your next task — search for prior knowledge

Run the full verification pipeline on a change

$3

$3

What You Get

$3

$3

$3

Progressive Discovery

$3

$3

$3

$3

The Methodology Pipeline

Parallel Agents with Claude Code

Toolset Gating

$3