@wix/eval-assertions

A framework for evaluating AI agent outputs through configurable assertions. Supports skill invocation checks, build validation, and LLM-based judging.

Installation

``bash npm install @wix/eval-assertions

`or`


yarn add @wix/eval-assertions


Features
- Skill Was Called: Verify that specific skills were invoked during agent execution
- Build Passed: Run build commands and verify exit codes
- LLM Judge: Use an LLM to evaluate agent outputs with customizable prompts and scoring
Quick Start

`typescript import { evaluateAssertions, AssertionResultStatus, type Assertion, type AssertionContext, type EvaluationInput } from '@wix/eval-assertions';

// Define your assertions const assertions: Assertion[] = [ { type: 'skill_was_called', skillName: 'my-skill' }, { type: 'build_passed', command: 'npm test', expectedExitCode: 0 }, { type: 'llm_judge', prompt: 'Evaluate if the output correctly implements the requested feature:\n\n{{output}}', minScore: 70 } ];

// Prepare your evaluation input const input: EvaluationInput = { outputText: 'Agent output here...', llmTrace: { id: 'trace-1', steps: [...], summary: {...} }, fileDiffs: [...] };

// Set up context for assertions that need it const context: AssertionContext = { workDir: '/path/to/working/directory', llmConfig: { baseUrl: 'https://api.anthropic.com', headers: { 'x-api-key': 'your-key' } } };

// Run assertions const results = await evaluateAssertions(input, assertions, context);

// Check results for (const result of results) { console.log(${result.assertionName}: ${result.status}); if (result.status === AssertionResultStatus.FAILED) { console.log( Message: ${result.message}); } }`

`Assertion Types`

`$3`

Checks if a specific skill was invoked by examining the LLM trace.

`typescript { type: 'skill_was_called', skillName: 'commit' // Name of the skill that must have been called }`

`$3`

Runs a command in the working directory and checks the exit code. When the command fails, the result details includes stdout and stderr so you can see why the build failed.

`typescript { type: 'build_passed', command: 'yarn build', // Command to run (default: 'yarn build') expectedExitCode: 0 // Expected exit code (default: 0) }`

`$3`

Uses an LLM to evaluate the output with a customizable prompt. The default system prompt instructs the judge to be strict on factual verification: when you ask to verify a specific fact, the judge must compare against the actual data and give 0 or near 0 if there is a mismatch. When the judge returns invalid JSON, the evaluator retries up to 3 times before failing.

`typescript { type: 'llm_judge', prompt: 'Evaluate the quality of this code:\n\n{{output}}', systemPrompt: 'You are a code reviewer...', // Optional custom system prompt minScore: 70, // Minimum passing score (0-100, default: 70) model: 'claude-3-5-haiku-20241022', // Model to use maxTokens: 1024, // Max output tokens temperature: 0 // Temperature (0-1) }`

Tip: When verifying file-related outcomes, include {{changedFiles}} in your prompt so the judge sees the actual files. Without it, the judge still receives this data in the system context, but making it explicit improves accuracy.

Available placeholders in prompts: -{{output}}- The agent's final output text -{{cwd}}- Working directory path -{{changedFiles}}- List of files that were modified -{{trace}} - Formatted LLM trace showing tool calls and completions

`Types`

`$3`

The input data for assertion evaluation:

`typescript interface EvaluationInput { outputText?: string; llmTrace?: LLMTrace; fileDiffs?: Array<{ path: string; content?: string; status?: 'new' | 'modified' }>; }`

When fileDiffs items include status, the {{modifiedFiles}} and {{newFiles}} placeholders are populated for the LLM judge.

`$3`

Optional context for assertions:

`typescript interface AssertionContext { workDir?: string; // For build_passed llmConfig?: { // For llm_judge baseUrl: string; headers: Record; }; generateTextForLlmJudge?: (options) => Promise<{ text: string }>; // For testing }`

`$3`

The result of evaluating an assertion:

`typescript interface AssertionResult { id: string; assertionId: string; assertionType: string; assertionName: string; status: AssertionResultStatus; // 'passed' | 'failed' | 'skipped' | 'error' message?: string; expected?: string; actual?: string; duration?: number; details?: Record; }`

`Creating Custom Evaluators`

You can extend the framework with custom assertion types:

`typescript import { AssertionEvaluator, type AssertionResult, AssertionResultStatus } from '@wix/eval-assertions'; import { z } from 'zod';

// Define your assertion schema export const MyAssertionSchema = z.object({ type: z.literal('my_assertion'), customField: z.string() });

export type MyAssertion = z.infer;

// Implement the evaluator export class MyAssertionEvaluator extends AssertionEvaluator { readonly type = 'my_assertion' as const;

evaluate(assertion, input, context): AssertionResult { // Your evaluation logic here return { id: crypto.randomUUID(), assertionId: crypto.randomUUID(), assertionType: 'my_assertion', assertionName: 'My Custom Assertion', status: AssertionResultStatus.PASSED, message: 'Assertion passed!' }; } }

// Register your evaluator import { registerEvaluator } from '@wix/eval-assertions'; registerEvaluator('my_assertion', new MyAssertionEvaluator());``

License

MIT

@wix/eval-assertions

A framework for evaluating AI agent outputs through configurable assertions. Supports skill invocation checks, build validation, and LLM-based judging.

Installation

``bash npm install @wix/eval-assertions

`or`


yarn add @wix/eval-assertions


Features
- Skill Was Called: Verify that specific skills were invoked during agent execution
- Build Passed: Run build commands and verify exit codes
- LLM Judge: Use an LLM to evaluate agent outputs with customizable prompts and scoring
Quick Start

`typescript import { evaluateAssertions, AssertionResultStatus, type Assertion, type AssertionContext, type EvaluationInput } from '@wix/eval-assertions';

// Prepare your evaluation input const input: EvaluationInput = { outputText: 'Agent output here...', llmTrace: { id: 'trace-1', steps: [...], summary: {...} }, fileDiffs: [...] };

// Run assertions const results = await evaluateAssertions(input, assertions, context);

`Assertion Types`

`$3`

Checks if a specific skill was invoked by examining the LLM trace.

`typescript { type: 'skill_was_called', skillName: 'commit' // Name of the skill that must have been called }`

`$3`

Runs a command in the working directory and checks the exit code. When the command fails, the result details includes stdout and stderr so you can see why the build failed.

`typescript { type: 'build_passed', command: 'yarn build', // Command to run (default: 'yarn build') expectedExitCode: 0 // Expected exit code (default: 0) }`

`$3`

`Types`

`$3`

The input data for assertion evaluation:

`typescript interface EvaluationInput { outputText?: string; llmTrace?: LLMTrace; fileDiffs?: Array<{ path: string; content?: string; status?: 'new' | 'modified' }>; }`

When fileDiffs items include status, the {{modifiedFiles}} and {{newFiles}} placeholders are populated for the LLM judge.

`$3`

Optional context for assertions:

`$3`

The result of evaluating an assertion:

`Creating Custom Evaluators`

You can extend the framework with custom assertion types:

`typescript import { AssertionEvaluator, type AssertionResult, AssertionResultStatus } from '@wix/eval-assertions'; import { z } from 'zod';

// Define your assertion schema export const MyAssertionSchema = z.object({ type: z.literal('my_assertion'), customField: z.string() });

export type MyAssertion = z.infer;

// Implement the evaluator export class MyAssertionEvaluator extends AssertionEvaluator { readonly type = 'my_assertion' as const;

// Register your evaluator import { registerEvaluator } from '@wix/eval-assertions'; registerEvaluator('my_assertion', new MyAssertionEvaluator());``

License

MIT