Agent Evaluation and Observability Framework
npm install @goyamegh/agent-health



An evaluation and observability framework for AI agents. Features real-time trace visualization, "Golden Path" trajectory comparison, and LLM-based evaluation scoring.
- Evals: Real-time agent evaluation with trajectory streaming
- Experiments: Batch evaluation runs with configurable parameters
- Compare: Side-by-side trace comparison with aligned and merged views
- Agent Traces: Table-based trace view with latency histogram, filtering, and detailed flyout with input/output display
- Live Traces: Real-time trace monitoring with auto-refresh and filtering
- Trace Views: Timeline and Flow visualizations for debugging
- Reports: Evaluation reports with LLM judge reasoning
- Connectors: Pluggable protocol adapters for different agent types
For a detailed walkthrough, see Getting Started.
| Connector | Protocol | Description |
|-----------|----------|-------------|
| agui-streaming | AG-UI SSE | ML-Commons agents (default) |
| rest | HTTP POST | Non-streaming REST APIs |
| subprocess | CLI | Command-line tools |
| claude-code | Claude CLI | Claude Code agent comparison |
| mock | In-memory | Demo and testing |
For creating custom connectors, see docs/CONNECTORS.md.
---
``bashStart the web UI
npx @opensearch-project/agent-health
$3
`bash
Check configuration
npx @opensearch-project/agent-health doctorList available agents and connectors
npx @opensearch-project/agent-health list agents
npx @opensearch-project/agent-health list connectorsRun a test case against an agent
npx @opensearch-project/agent-health run -t demo-otel-001 -a demoInitialize a new project
npx @opensearch-project/agent-health init
`For full CLI documentation, see docs/CLI.md.
---
Architecture
---
Authentication (Required)
AWS credentials are required for the Bedrock LLM Judge to score evaluations.
Create a
.env file:
`bash
cp .env.example .env
`Add your AWS credentials:
`bash
AWS_REGION=us-east-1
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_SESSION_TOKEN=your_session_token # if using temporary credentials
`---
Configuration (Optional)
All optional settings have sensible defaults. Configure only what you need.
$3
Agent endpoints default to localhost. Override if your agent runs elsewhere:
`bash
LANGGRAPH_ENDPOINT=http://localhost:3000
HOLMESGPT_ENDPOINT=http://localhost:5050/api/agui/chat
MLCOMMONS_ENDPOINT=http://localhost:9200/_plugins/_ml/agents/{agent_id}/_execute/stream
`$3
For persisting test cases, experiments, and runs. Features gracefully degrade if not configured.
`bash
OPENSEARCH_STORAGE_ENDPOINT=https://your-cluster.opensearch.amazonaws.com
OPENSEARCH_STORAGE_USERNAME=admin
OPENSEARCH_STORAGE_PASSWORD=your_password
OPENSEARCH_STORAGE_TLS_SKIP_VERIFY=false # Set to true for self-signed certificates
`$3
For agent execution traces. Features gracefully degrade if not configured.
`bash
OPENSEARCH_LOGS_ENDPOINT=https://your-logs-cluster.opensearch.amazonaws.com
OPENSEARCH_LOGS_USERNAME=admin
OPENSEARCH_LOGS_PASSWORD=your_password
OPENSEARCH_LOGS_TLS_SKIP_VERIFY=false # Set to true for self-signed certificates
`See
.env.example for all available options.---
Development Commands
| Command | Description |
|---------|-------------|
|
npm install | Install dependencies |
| npm run dev | Start frontend dev server (port 4000) |
| npm run dev:server | Start backend server (port 4001) |
| npm run build | TypeScript compile + Vite production build |
| npm test | Run all tests |
| npm run test:unit | Run unit tests only |
| npm run test:integration | Run integration tests only |
| npm run test:e2e | Run E2E tests with Playwright |
| npm run test:e2e:ui | Run E2E tests with Playwright UI |
| npm run test:all | Run all tests (unit + integration + e2e) |
| npm test -- --coverage | Run tests with coverage report |
| npm run build:all | Build UI + server + CLI |
| npm run build:cli | Build CLI only |$3
`bash
npm run server # Build UI + start single server on port 4001
`Open http://localhost:4001
$3
After publishing, run directly with npx:
`bash
npx @opensearch-project/agent-health # Start server on port 4001
npx @opensearch-project/agent-health --port 8080
npx @opensearch-project/agent-health --env-file .env
`$3
| Mode | Command | Port(s) |
|------|---------|---------|
| Dev (frontend) |
npm run dev | 4000 |
| Dev (backend) | npm run dev:server | 4001 |
| Production | npm run server | 4001 |
| NPX | npx @opensearch-project/agent-health | 4001 (default) |In development, the Vite dev server (4000) proxies
/api requests to the backend (4001).---
Testing
AgentEval uses a comprehensive test suite with three layers:
$3
| Type | Location | Command | Description |
|------|----------|---------|-------------|
| Unit |
tests/unit/ | npm run test:unit | Fast, isolated function tests |
| Integration | tests/integration/ | npm run test:integration | Tests with real backend server |
| E2E | tests/e2e/ | npm run test:e2e | Browser-based UI tests with Playwright |$3
`bash
All tests
npm test # Unit + integration
npm run test:all # Unit + integration + E2EBy type
npm run test:unit # Unit tests only
npm run test:integration # Integration tests (starts server)
npm run test:e2e # E2E tests (starts servers)
npm run test:e2e:ui # E2E with Playwright UI for debuggingWith coverage
npm run test:unit -- --coverageSpecific file
npm test -- path/to/file.test.ts
npx playwright test tests/e2e/dashboard.spec.ts
`$3
E2E tests use Playwright to test the UI in a real browser.
`bash
First time: install browsers
npx playwright installRun all E2E tests
npm run test:e2eInteractive UI mode (recommended for debugging)
npm run test:e2e:uiView test report
npm run test:e2e:report
`Writing E2E Tests:
- Place tests in
tests/e2e/*.spec.ts
- Use data-testid attributes for reliable selectors
- Handle empty states gracefully (check if data exists before asserting)
- See existing tests for patterns$3
All PRs must pass these CI checks:
| Job | What it checks |
|-----|----------------|
|
build-and-test | Build + unit tests + 90% coverage |
| lint-and-typecheck | TypeScript compilation |
| license-check | SPDX headers on all source files |
| integration-tests | Backend integration tests with coverage |
| e2e-tests | Playwright browser tests with pass/fail tracking |
| security-scan | npm audit for vulnerabilities |
| test-summary | Consolidated test results summary |$3
| Test Type | Metric | Threshold |
|-----------|--------|-----------|
| Unit | Lines | ≥ 90% |
| Unit | Branches | ≥ 80% |
| Unit | Functions | ≥ 80% |
| Unit | Statements | ≥ 90% |
| Integration | Lines | Informational (no threshold) |
| E2E | Pass Rate | 100% |
$3
Each CI run produces these artifacts (downloadable from Actions tab):
| Artifact | Contents |
|----------|----------|
|
coverage-report | Unit test coverage (HTML, LCOV) |
| integration-coverage-report | Integration test coverage |
| playwright-report | E2E test report with screenshots/traces |
| test-badges | Badge data JSON for coverage visualization |$3
The E2E test suite includes tests for the complete evaluation flow using mock modes:
- Demo Agent (
mock://demo) - Simulated AG-UI streaming responses
- Demo Model (provider: "demo") - Simulated LLM judge evaluationThis allows testing the full Create Test Case → Create Benchmark → Run Evaluation → View Results flow without requiring AWS credentials or a live agent in CI.
---
Agent Setup
Agent Health supports multiple agent types:
| Agent | Endpoint Variable | Setup |
|-------|-------------------|-------|
| Langgraph |
LANGGRAPH_ENDPOINT | Simple localhost agent |
| HolmesGPT | HOLMESGPT_ENDPOINT | AG-UI compatible RCA agent |
| ML-Commons | MLCOMMONS_ENDPOINT | See ML-Commons Setup |---
Architecture
`
Browser (React UI)
|
v
Backend Server (4001) --> Bedrock LLM Judge
|
v
Agent Endpoint --> Tools --> OpenSearch Data
`---
Troubleshooting
| Issue | Solution |
|-------|----------|
| Cannot connect to backend | Run
npm run dev:server, check curl http://localhost:4001/health |
| AWS credentials expired | Refresh credentials in .env |
| Storage/Traces not working | Check OpenSearch endpoint and credentials in .env |---
Contributing
We welcome contributions! See CONTRIBUTING.md for guidelines.
$3
1. Fork and clone the repository
2. Install dependencies:
npm install
3. Create a feature branch: git checkout -b feature/your-feature
4. Make changes and add tests
5. Run tests: npm test
6. Commit with DCO signoff: git commit -s -m "feat: your message"`All commits require DCO signoff and all PRs must pass CI checks (tests, coverage, linting).
---
- Getting Started - Installation, demo mode, and usage walkthrough
- ML-Commons Agent Setup - Configure ML-Commons agent
- Development Guide - Architecture and coding conventions
- AG-UI Protocol