Assertion framework for AI agent evaluations - supports skill invocation checks, build validation, and LLM-based judging
npm install @wix/eval-assertionsA framework for evaluating AI agent outputs through configurable assertions. Supports skill invocation checks, build validation, and LLM-based judging.
``bash`
npm install @wix/eval-assertionsor
yarn add @wix/eval-assertions
- Skill Was Called: Verify that specific skills were invoked during agent execution
- Build Passed: Run build commands and verify exit codes
- LLM Judge: Use an LLM to evaluate agent outputs with customizable prompts and scoring
`typescript
import {
evaluateAssertions,
AssertionResultStatus,
type Assertion,
type AssertionContext,
type EvaluationInput
} from '@wix/eval-assertions';
// Define your assertions
const assertions: Assertion[] = [
{
type: 'skill_was_called',
skillName: 'my-skill'
},
{
type: 'build_passed',
command: 'npm test',
expectedExitCode: 0
},
{
type: 'llm_judge',
prompt: 'Evaluate if the output correctly implements the requested feature:\n\n{{output}}',
minScore: 70
}
];
// Prepare your evaluation input
const input: EvaluationInput = {
outputText: 'Agent output here...',
llmTrace: {
id: 'trace-1',
steps: [...],
summary: {...}
},
fileDiffs: [...]
};
// Set up context for assertions that need it
const context: AssertionContext = {
workDir: '/path/to/working/directory',
llmConfig: {
baseUrl: 'https://api.anthropic.com',
headers: { 'x-api-key': 'your-key' }
}
};
// Run assertions
const results = await evaluateAssertions(input, assertions, context);
// Check results
for (const result of results) {
console.log(${result.assertionName}: ${result.status}); Message: ${result.message}
if (result.status === AssertionResultStatus.FAILED) {
console.log();`
}
}
Checks if a specific skill was invoked by examining the LLM trace.
`typescript`
{
type: 'skill_was_called',
skillName: 'commit' // Name of the skill that must have been called
}
Runs a command in the working directory and checks the exit code. When the command fails, the result details includes stdout and stderr so you can see why the build failed.
`typescript`
{
type: 'build_passed',
command: 'yarn build', // Command to run (default: 'yarn build')
expectedExitCode: 0 // Expected exit code (default: 0)
}
Uses an LLM to evaluate the output with a customizable prompt. The default system prompt instructs the judge to be strict on factual verification: when you ask to verify a specific fact, the judge must compare against the actual data and give 0 or near 0 if there is a mismatch. When the judge returns invalid JSON, the evaluator retries up to 3 times before failing.
`typescript`
{
type: 'llm_judge',
prompt: 'Evaluate the quality of this code:\n\n{{output}}',
systemPrompt: 'You are a code reviewer...', // Optional custom system prompt
minScore: 70, // Minimum passing score (0-100, default: 70)
model: 'claude-3-5-haiku-20241022', // Model to use
maxTokens: 1024, // Max output tokens
temperature: 0 // Temperature (0-1)
}
Tip: When verifying file-related outcomes, include {{changedFiles}} in your prompt so the judge sees the actual files. Without it, the judge still receives this data in the system context, but making it explicit improves accuracy.
Available placeholders in prompts:
- {{output}} - The agent's final output text{{cwd}}
- - Working directory path{{changedFiles}}
- - List of files that were modified{{trace}}
- - Formatted LLM trace showing tool calls and completions
The input data for assertion evaluation:
`typescript`
interface EvaluationInput {
outputText?: string;
llmTrace?: LLMTrace;
fileDiffs?: Array<{ path: string; content?: string; status?: 'new' | 'modified' }>;
}
When fileDiffs items include status, the {{modifiedFiles}} and {{newFiles}} placeholders are populated for the LLM judge.
Optional context for assertions:
`typescript`
interface AssertionContext {
workDir?: string; // For build_passed
llmConfig?: { // For llm_judge
baseUrl: string;
headers: Record
};
generateTextForLlmJudge?: (options) => Promise<{ text: string }>; // For testing
}
The result of evaluating an assertion:
`typescript`
interface AssertionResult {
id: string;
assertionId: string;
assertionType: string;
assertionName: string;
status: AssertionResultStatus; // 'passed' | 'failed' | 'skipped' | 'error'
message?: string;
expected?: string;
actual?: string;
duration?: number;
details?: Record
}
You can extend the framework with custom assertion types:
`typescript
import { AssertionEvaluator, type AssertionResult, AssertionResultStatus } from '@wix/eval-assertions';
import { z } from 'zod';
// Define your assertion schema
export const MyAssertionSchema = z.object({
type: z.literal('my_assertion'),
customField: z.string()
});
export type MyAssertion = z.infer
// Implement the evaluator
export class MyAssertionEvaluator extends AssertionEvaluator
readonly type = 'my_assertion' as const;
evaluate(assertion, input, context): AssertionResult {
// Your evaluation logic here
return {
id: crypto.randomUUID(),
assertionId: crypto.randomUUID(),
assertionType: 'my_assertion',
assertionName: 'My Custom Assertion',
status: AssertionResultStatus.PASSED,
message: 'Assertion passed!'
};
}
}
// Register your evaluator
import { registerEvaluator } from '@wix/eval-assertions';
registerEvaluator('my_assertion', new MyAssertionEvaluator());
``
MIT