Ralph + Beads + PAI - Autonomous task execution with test-gated verification
npm install rbp-stack


The first autonomous Epic implementation system that prevents AI agents from lying about task completion.
---
You give an AI agent an Epic. It returns "done" with all checkboxes marked complete.
Then you look at the code.
- Tests were never run
- The UI doesn't render
- Half the subtasks were skipped
- There's no audit trail
Sound familiar?
You trusted the agent. The agent lied.
> "We spent 3 months building an AI-powered development workflow. 76 stories later, we discovered a painful truth: agents mark tasks 'complete' without doing the work. Checkboxes are just booleans. There's no proof."
---
After months of frustration, we discovered something simple:
A checkbox is self-reported. A test is objective verification.
If bun test fails, the lie is exposed. Period.
So we built a system around one unbreakable rule:
---
Ralph + Beads + PAI
A verification-first autonomous development system.
| Component | Role |
|:----------|:-----|
| Ralph | Autonomous execution loop that never stops until done |
| Beads | Git-backed task graph — the single source of truth |
| Tests | The gatekeeper that agents cannot bypass |
``
Workflow A (BMAD):
Epic → BMAD Story → Beads → Ralph Loop → Verified Code
Workflow B (Quick-Plan):
Feature Idea → /quick-plan → Spec → Codex Review → Beads → Ralph Loop → Verified Code
Both workflows use the same gatekeeper:
close-with-proof.sh
↓
Tests pass? → Close task
Tests fail? → Keep trying
`

From requirements to verified code. No human intervention required.
---
📺 Demo: Watch Ralph implement a feature autonomously
`bash1. Convert your story to beads
./scripts/rbp/parse-story-to-beads.sh docs/stories/story-001.md
GIF coming soon — star the repo to get notified!
---
Defense in Depth
We don't trust agents. We verify them at every layer.

| Layer | Mechanism | What It Prevents |
|:------|:----------|:-----------------|
| 1 | Objective Acceptance Criteria | Vague "it works" claims |
| 2 | Protocol Mandate | Skipping verification steps |
| 3 | Failure State Injection | "I don't remember what went wrong" |
| 4 | Test Gating (
bun test) | Claims without passing tests |
| 5 | Playwright Verification | UI lies ("looks correct") |
| 6 | Human Code Review | Subtle implementation issues |
| 7 | Beads Audit Trail | Retroactive tampering |
An agent cannot game this system. Either the tests pass or they don't.
---
Quick Start
$3
`bash
Beads - Git-backed task tracker (one-time global install, pick one)
brew install steveyegge/beads/bd # Homebrew (recommended)
or: curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/install.sh | bash
or: npm install -g @beads/bd
or: go install github.com/steveyegge/beads/cmd/bd@latest
Bun - JavaScript runtime (one-time global install)
curl -fsSL https://bun.sh/install | bashClaude Code CLI (one-time global install)
https://claude.ai/download
PAI Observability (optional, for real-time monitoring dashboard)
https://github.com/danielmiessler/Personal_AI_Infrastructure.git
`$3
`bash
Clone the repository
git clone https://github.com/AojdevStudio/rbp-stack.gitInstall into your project
./rbp/install.sh /path/to/your/projectValidate installation
/path/to/your/project/scripts/rbp/validate.sh
`$3
Workflow A: BMAD Stories (structured story-driven)
`bash
Create a story with BMAD
/bmad:bmm:workflows:create-storyConvert to beads
./scripts/rbp/parse-story-to-beads.sh docs/stories/story-001.mdLaunch autonomous execution
./scripts/rbp/ralph.sh
`Workflow B: Quick-Plan Specs (interview-driven)
`bash
Create a spec through codebase analysis + interview
/quick-plan "add user authentication with JWT"Execute with optional Codex pre-flight review
./scripts/rbp/ralph-execute.sh specs/add-user-authentication.mdOr skip the Codex review
./scripts/rbp/ralph-execute.sh specs/add-user-authentication.md --skip-review
`Monitor Progress
`bash
bd status # Task status
bd list --open # Open tasks
bd tree # Task hierarchy
`
---
How It Works

$3
`bash
while tasks_remain:
task = bd ready # Query Beads for next unblocked task
implement(task) # Agent implements the task
close-with-proof.sh # THE GATEKEEPER
├── bun test # Unit tests must pass
├── playwright test # UI tests must pass (if UI task)
└── bd close # Only now can the task close
`$3
`bash
#!/usr/bin/env bash
close-with-proof.sh - The agent cannot bypass this
Run tests
bun run test || exit 1Run Playwright for UI tasks (auto-detected)
if [[ "$TASK_TYPE" == "ui" ]]; then
bunx playwright test || exit 1
fiOnly close if all tests pass
bd close "$BEAD_ID"
echo "✅ Task verified and closed"
`This is script-level enforcement. The agent has no way around it.
---
Failure State Injection
When a task fails its test verification, Ralph automatically injects the failure context into the next attempt:
`
Task Iteration 1:
├── Run tests
├── Tests fail → Append failure notes to bead
└── Ralph continues to next taskTask Iteration 2 (when task becomes ready again):
├── Read previous failure notes from bead
├── Inject "Previous Attempt Failed" section into prompt
├── Agent sees exactly what went wrong
├── Agent fixes the issues
├── Run tests again
└── If pass → Close with proof
`This prevents the agent from making the same mistake twice and enables autonomous error recovery.
---
Atomic Subtasks
When a task contains subtasks, the parser creates them as separate child beads with explicit dependencies:
`
Task: "Create admin dashboard"
├── Subtask 1.1: Build layout structure (no dependencies)
│ └── Bead ID: bd-123.1.1
├── Subtask 1.2: Add sidebar (depends on 1.1)
│ └── Bead ID: bd-123.1.2
├── Subtask 1.3: Implement navigation (depends on 1.2)
│ └── Bead ID: bd-123.1.3
└── Task depends on final subtask (1.3)
`Benefits:
- Clear sequencing: Each subtask has explicit dependencies
- Granular tracking: Each subtask is independently verifiable
- Failure recovery: If subtask 2 fails, only that subtask retries (not 1.1)
- Optimal context: Ralph executes one subtask per iteration
---
Quick-Plan Workflow
Don't have BMAD? Use the Quick-Plan workflow instead.
$3
`
/quick-plan "feature description"
↓
Codebase Analysis (scans your project)
↓
Interview (asks clarifying questions until ZERO gaps remain)
↓
specs/feature-name.md (with mandatory Testing Strategy + Implementation Tasks)
↓
./ralph-execute.sh specs/feature-name.md
↓
[Optional] Codex Pre-Flight Review (GPT-5-Codex analyzes spec)
↓
Parse Spec → Beads (creates task graph with atomic subtasks)
↓
Ralph Loop (bd ready → implement → test → close, repeat)
↓
Verified Code
`$3
Quick-plan generates specs with two mandatory RBP sections:
`markdown
Testing Strategy
$3
bun test (detected from package.json)$3
bun test$3
- [ ] Test: User model validation → File: tests/user.test.ts
- [ ] Test: JWT token generation → File: tests/auth.test.tsImplementation Tasks
$3
- ID: task-001
- Dependencies: none
- Files: src/models/user.ts
- Acceptance: User model with email, password hash, timestamps
- Tests: tests/user.test.ts
- Subtasks:
- [ ] Define TypeScript interfaces
- [ ] Implement validation logic
- [ ] Add timestamp fields$3
- ID: task-002
- Dependencies: task-001
- Files: src/auth/jwt.ts, src/components/LoginForm.tsx
- Acceptance: Login returns valid JWT, stored in httpOnly cookie
- Tests: tests/auth.test.ts
`$3
Before executing,
ralph-execute.sh optionally runs GPT-5-Codex to review the spec:`bash
With Codex review (default)
./scripts/rbp/ralph-execute.sh specs/feature.mdSkip review
./scripts/rbp/ralph-execute.sh specs/feature.md --skip-review
`Codex checks for:
- Missing edge cases
- Wrong technical approaches
- Missing task dependencies
- Incomplete testing strategy
- Security concerns
$3
Tasks tagged with
[UI] or containing UI keywords automatically get the requires-playwright flag. The gatekeeper runs Playwright tests for these tasks.
---
Key Decisions
$3
The agent queries
bd ready instead of reading JSON files.- No stale state — Beads is always current
- No sync issues — Single source of truth
- Git-backed — Full audit trail
$3
We analyzed 76 real BMAD stories:
| Metric | Value |
|:-------|:------|
| Average story size | 3,914 tokens |
| Largest story | 12,962 tokens |
| Context budget used | 12.9% of 100k |
All stories fit in a single context window. For larger stories, our Execution Sequencer groups subtasks into phases of 3-5.
$3
Agents can be told "run tests before closing." They can ignore the instruction.
Scripts cannot be ignored.
close-with-proof.sh runs the tests. Either they pass or the task stays open.
---
What's Included
`
rbp/
├── scripts/
│ ├── ralph.sh # Main execution loop (with failure state injection)
│ ├── ralph-execute.sh # Quick-plan execution (with Codex review)
│ ├── close-with-proof.sh # Test-gated closure (failure notes appending)
│ ├── emit-event.sh # PAI Observability event emitter
│ ├── parse-story-to-beads.sh # BMAD Story → Beads conversion
│ ├── parse-spec-to-beads.sh # Quick-plan Spec → Beads (atomic subtasks)
│ ├── sequencer.sh # Phase grouping for large stories
│ ├── show-active-task.sh # Display current task
│ └── save-progress-to-beads.sh # Sync progress to bead notes
├── commands/rbp/
│ ├── start.md # /rbp:start command
│ ├── status.md # /rbp:status command
│ └── validate.md # /rbp:validate command
├── templates/
│ ├── rbp-config.yaml # Base configuration
│ ├── rbp-config.example.yaml # Documented config
│ └── spec-template.md # Spec format template
├── install.sh # One-line installation
├── validate.sh # Installation checker
├── docs/
│ └── rbp-stack-specification.md # Full technical specification
└── README.md # This file
`Key recent features:
- ralph.sh: Failure state injection reads notes and injects "Previous Attempt Failed" context
- close-with-proof.sh: Appends test failure notes to beads for retry context
- parse-spec-to-beads.sh: Creates atomic subtasks as separate beads with dependency chaining
- prompt.md: Enforcement and Consequences section explains stakes of non-compliance
---
Configuration
`yaml
rbp-config.yaml
project:
name: "your-project"paths:
stories: "docs/stories" # BMAD stories
specs: "specs" # Quick-plan specs
execution:
max_iterations: 10
phase_size: 5
verification:
require_tests: true
require_playwright_for_ui: true
test_command: "bun run test"
quick_plan:
command: "/quick-plan"
spec_template: "templates/spec-template.md"
codex:
enabled: true # Set false if Codex not installed
model: "gpt-5-codex"
reasoning_effort: "high"
skip_by_default: false
observability:
enabled: true # Emit events to PAI dashboard
auto_launch: true # Auto-start dashboard with /rbp:start
`
---
Observability
RBP integrates with PAI (Personal AI Infrastructure) for real-time observability of task execution.
$3
| Feature | Description |
|:--------|:------------|
| Real-time Dashboard | Watch task progress in your browser |
| Event Stream | See RBP:TaskStart, RBP:TestRun, RBP:TestResult events live |
| Debug Visibility | Trace through test failures and errors |
| Multi-Session Support | Run multiple RBP sessions with distinct session IDs |
$3
`bash
1. Install PAI (if not already installed)
git clone https://github.com/danielmiessler/Personal_AI_Infrastructure.git ~/PAI
cd ~/PAI && ./install.sh2. RBP auto-detects PAI and emits events automatically
Events are written to: ~/.claude/history/raw-outputs/YYYY-MM/YYYY-MM-DD_all-events.jsonl
3. Launch dashboard with /rbp:start or manually:
~/.claude/observability/manage.sh start
Dashboard: http://localhost:5172
`$3
| Event | Emitted When |
|:------|:-------------|
|
RBP:LoopStart | Ralph begins execution |
| RBP:TaskStart | A task is picked from bd ready |
| RBP:TaskProgress | Task status changes (executing, iteration_complete) |
| RBP:TaskComplete | Task closed with proof |
| RBP:TestRun | Tests are about to run |
| RBP:TestResult | Tests complete (includes exit code, output) |
| RBP:Error | An error occurred |
| RBP:CodexReview | Codex pre-flight review starts/completes |
| RBP:SpecParsed | Spec parsed to Beads |
| RBP:LoopEnd | Ralph loop completes |$3
RBP works without PAI — observability events are simply not emitted. You can still monitor progress via:
`bash
File-based logs
tail -f scripts/rbp/progress.txtBeads activity
bd activity --followTask status
bd status
``---
I've been using the BMAD Method for a while now. It's probably the best tool I've found for building software projects with AI — structured stories, clear acceptance criteria, the whole workflow. I'm also an avid Claude Code user. These tools changed how I build.
But something was missing.
Every time I kicked off a BMAD story, I'd watch the AI work... then it would stop. Ask a question. Wait for me. I'd answer, it would continue... then stop again. The constant back-and-forth was killing my productivity. I wanted to give it an Epic and walk away. Come back to working code.
I wanted long-running autonomous processes.
Then I discovered Ralph — Geoffrey Huntley's pattern for relentless AI execution loops. And Beads — Steve Yegge's git-backed task graph. Something clicked.
What if I could combine BMAD's structured stories with Ralph's autonomous loops and Beads' persistent memory?
I started building. 76 stories later, I had a working system. But I also discovered something uncomfortable: AI agents lie. They mark tasks "complete" without running tests. They check boxes without doing the work.
The realization hit me: Checkboxes are self-reported. Tests are objective.
An agent can flip a boolean. It cannot fake a passing test.
So I added test-gated closure. No task closes without proof. The script runs the tests — either they pass or the task stays open. The agent has no say in the matter.
Then I realized: when a task fails, the agent needs to see what went wrong. So I added failure state injection. The previous attempt's notes are automatically injected into the retry prompt. Now agents can learn from their mistakes without human guidance.
Finally, I made subtasks atomic. Each subtask is a separate bead with explicit dependencies, not just checklist items. This lets Ralph execute them sequentially with test verification after each one.
The RBP Stack is the result.
What started as a productivity hack became a verification-first autonomous development system. BMAD creates the stories. Beads tracks the state. Ralph drives the execution. Tests guard the gates. Failure notes teach the next attempt.
Now I give it an Epic and walk away. Come back to verified, working code.
---
- [x] Core execution loop (Ralph)
- [x] Test-gated closure
- [x] Story → Beads conversion (BMAD workflow)
- [x] Spec → Beads conversion (Quick-Plan workflow)
- [x] Codex pre-flight review integration
- [x] UI auto-detection (Playwright)
- [x] Execution sequencer for large stories
- [x] Real-time progress dashboard (PAI Observability integration)
- [x] Failure state injection (previous attempt context)
- [x] Atomic subtask creation with dependencies
- [ ] Parallel task execution
- [ ] Integration with more test frameworks
---
Contributions welcome! Please ensure:
1. All scripts have tests
2. Documentation is updated
3. The verification system is never bypassed
See CONTRIBUTING.md for guidelines.
---
- Beads — Git-backed issue tracking by Steve Yegge
- BMAD — Structured story creation framework
- Claude Code — Execution environment
- Ralph Pattern — The original autonomous loop concept by Geoffrey Huntley
---
MIT License — see LICENSE for details.
---
Built with frustration. Verified with tests.
If this helped you, ⭐ star the repo — it helps others find it.
