PromptQL Latency and Accuracy Testing Suite

![GitHub](https://github.com/hasura/evalset)
![GitHub](https://github.com/hasura/evalset)

> Note: This suite now uses v2 of the Natural Language API with support for specific LLM model configuration.

Overview

This testing suite provides comprehensive performance and accuracy testing for PromptQL across different environments. It measures latency, component performance, and answer accuracy using Patronus judges.

Features

- Multi-environment Testing: Run tests across dev, staging, and production environments
- Latency Measurement: Track response times and component-level performance
- Accuracy Evaluation: Assess answer quality using Patronus judges (optional)
- Detailed Reporting: Generate comprehensive markdown reports
- Parallel Execution: Run multiple test iterations concurrently
- Memory-Efficient Processing: Incremental results writing with 80-98% memory reduction
- Debug Mode: Enable detailed logging for troubleshooting

Prerequisites

- Node.js environment
- Required environment variables:

``bash # Development Environment PROMPTQL_API_KEY_DEV="your-dev-api-key" PROMPTQL_DATA_PLANE_URL_DEV="https://promptql.dev.private-ddn.hasura.app/api/query" DDN_URL_DEV="https://app-dev.private-ddn.hasura.app/v1/sql" DDN_AUTH_TOKEN_DEV="your-dev-ddn-auth-token" HASURA_PAT_DEV="your-dev-hasura-pat"

# Staging Environment PROMPTQL_API_KEY_STAGING="your-staging-api-key" PROMPTQL_DATA_PLANE_URL_STAGING="https://promptql.staging.private-ddn.hasura.app/api/query" DDN_URL_STAGING="https://app-staging.private-ddn.hasura.app/v1/sql" DDN_AUTH_TOKEN_STAGING="your-staging-ddn-auth-token" HASURA_PAT_STAGING="your-staging-hasura-pat"

# Production Environment PROMPTQL_API_KEY_PRODUCTION="your-production-api-key" PROMPTQL_DATA_PLANE_URL_PRODUCTION="https://promptql.production.private-ddn.hasura.app/api/query" DDN_URL_PRODUCTION="https://app-production.private-ddn.hasura.app/v1/sql" DDN_AUTH_TOKEN_PRODUCTION="your-production-ddn-auth-token" HASURA_PAT_PRODUCTION="your-production-hasura-pat"

# LLM Configuration (optional, per environment) # Both provider and model must be set for custom LLM configuration # If not provided, defaults to anthropic/claude-sonnet-4-20250514 SPECIFIC_LLM_PROVIDER_DEV="openai" SPECIFIC_LLM_MODEL_DEV="gpt-4" SPECIFIC_LLM_PROVIDER_STAGING="openai" SPECIFIC_LLM_MODEL_STAGING="gpt-4" SPECIFIC_LLM_PROVIDER_PRODUCTION="anthropic" SPECIFIC_LLM_MODEL_PRODUCTION="claude-3-opus-20240229"

# Patronus Configuration (optional, shared across environments) # If not provided, the script will run latency tests only PATRONUS_BASE_URL="patronus-backend.internal.example.com" PATRONUS_API_KEY="your-patronus-api-key" PATRONUS_PROJECT_ID="your-patronus-project-id"

# Database Configuration (optional) # Specifies the database type for query ID extraction from spans # If set to "redshift", will look for "redshift.query_id" in span attributes # If not set or empty, will look for ".query_id" in span attributes DATABASE="redshift"`

> Important: Each environment now has its own set of authentication tokens and URLs for better security and isolation.

Note: The URLs follow these patterns: - PromptQL Data Plane URLs:https://promptql.{env}.private-ddn.hasura.app/api/query- DDN URLs:https://app-{env}.private-ddn.hasura.app/v1/sql- Patronus URL:patronus-backend.internal.{domain}

> Note: Patronus configuration is optional. If not provided, the script will automatically run latency tests only. You can also explicitly skip accuracy testing using the --skip-accuracy flag even if Patronus configuration is available.

`Installation`

1. Clone the repository 2. Install dependencies:`bash npm install`3. Create a.env file in the tests directory with the required environment variables (see .env.example for the template)

`Migration from Previous Configuration`

If you're upgrading from a previous version that used shared authentication tokens, you'll need to update your .env file:

`$3`

bash
Shared across environments

DDN_AUTH_TOKEN="your-shared-token"
HASURA_PAT="your-shared-pat"
PROMPTQL_DATA_PLANE_URL_MAIN="..."
PROMPTQL_DATA_PLANE_URL_SECONDARY="..."

$3

bash
Each environment has its own tokens and URLs

DDN_AUTH_TOKEN_DEV="your-dev-token"
DDN_AUTH_TOKEN_STAGING="your-staging-token"
DDN_AUTH_TOKEN_PRODUCTION="your-production-token"
HASURA_PAT_DEV="your-dev-pat"
HASURA_PAT_STAGING="your-staging-pat"
HASURA_PAT_PRODUCTION="your-production-pat"

PROMPTQL_DATA_PLANE_URL_DEV="..." PROMPTQL_DATA_PLANE_URL_STAGING="..." PROMPTQL_DATA_PLANE_URL_PRODUCTION="..."`

Benefits of the new configuration: - Improved Security: Each environment has isolated credentials - Better Access Control: Different permissions per environment - Clearer Configuration: Environment-specific naming prevents confusion - Safer Testing: No risk of accidentally using production tokens in development

`Local Development`

For local development and testing, you can use npm test with the same command line options:

`bash

`Run a single question test`


npm test -- --env "dev" --runs 1 --questions 1
Run multiple questions

npm test -- --env "dev" --runs 3 --questions 1,2,3
Run all questions

npm test -- --env "dev" --runs 3 --all

This is equivalent to using npx promptql-latency-test but runs directly from your local development environment. The -- after npm test is required to pass arguments to the underlying script.

`System Prompt and Environment Setup`

`$3`

1. System Prompt Files:

- Create system prompt files in the system_promptsdirectory - Each file should contain domain-specific instructions that define the AI's role and behavior - You can create multiple system prompt files to test different configurations - Example structure:

`## Domain specific instructions

[Your specific instructions here]`

2. Directory Structure:

`system_prompts/ ├── dev.txt # Base environment prompt ├── dev(8a0a69ff30).txt # Build-specific prompt (optional) ├── staging.txt # Base environment prompt ├── production.txt # Base environment prompt └── marketing.txt # Alternative system prompt`

3. Build-Specific Prompts:

- When testing against a specific build version (e.g., dev(8a0a69ff30)), the system will: 1. First look for a build-specific prompt (e.g.,dev(8a0a69ff30).txt) 2. If not found, fall back to the base environment prompt (e.g.,dev.txt) 3. Fail if neither exists - This allows for testing different prompt configurations across builds while maintaining a default fallback - Example usage:

`bash # Will use dev(8a0a69ff30).txt if it exists, otherwise fall back to dev.txt npx promptql-latency-test --env dev(8a0a69ff30) --runs 3 --all

# Will use dev.txt npx promptql-latency-test --env dev --runs 3 --all`

4. Best Practices: - Keep system prompts focused and specific to the domain - Include clear instructions about data sources and response formats - Document any special handling for specific types of questions - Include citation and artifact formatting requirements - Specify how to handle edge cases and uncertainties - Use build-specific prompts for testing prompt variations without affecting the base environment

`$3`

1. Environment File:

- Create a single .envfile in the root directory - This file contains all environment variables for all environments (dev, staging, production) - The script will automatically use the correct variables based on the--env flag

2. Required Variables:

`bash # Development Environment PROMPTQL_API_KEY_DEV="your-dev-api-key" PROMPTQL_DATA_PLANE_URL_DEV="https://promptql.dev.private-ddn.hasura.app/api/query" DDN_URL_DEV="https://app-dev.private-ddn.hasura.app/v1/sql" DDN_AUTH_TOKEN_DEV="your-dev-ddn-auth-token" HASURA_PAT_DEV="your-dev-hasura-pat"

# Production Environment PROMPTQL_API_KEY_PRODUCTION="your-production-api-key" PROMPTQL_DATA_PLANE_URL_PRODUCTION="https://promptql.production.private-ddn.hasura.app/api/query" DDN_URL_PRODUCTION="https://app-production.private-ddn.hasura.app/v1/sql" DDN_AUTH_TOKEN_PRODUCTION="your-production-ddn-auth-token" HASURA_PAT_PRODUCTION="your-production-hasura-pat" # LLM Configuration (optional, per environment) # Both provider and model must be set for custom LLM configuration # If not provided, defaults to anthropic/claude-sonnet-4-20250514 SPECIFIC_LLM_PROVIDER_DEV="openai" SPECIFIC_LLM_MODEL_DEV="gpt-4" SPECIFIC_LLM_PROVIDER_STAGING="openai" SPECIFIC_LLM_MODEL_STAGING="gpt-4" SPECIFIC_LLM_PROVIDER_PRODUCTION="anthropic" SPECIFIC_LLM_MODEL_PRODUCTION="claude-3-opus-20240229" # Patronus Configuration (optional, shared across environments) # If not provided, the script will run latency tests only PATRONUS_BASE_URL="patronus-backend.internal.example.com" PATRONUS_API_KEY="your-patronus-api-key" PATRONUS_PROJECT_ID="your-patronus-project-id"

3. Environment Selection: - Use the--envflag to specify which environment(s) to test against - You can test multiple environments in a single run - Example:--env dev,staging,production- You can also specify a specific build version:--env production(3a3d68b8c8)

`$3`

1. Test Questions:

- The evalset.csvfile contains the test questions and gold answers - Each row should have: -question: The test question -gold_answer: The expected answer with artifacts and citations - Example format:

`question,gold_answer "What is love?","Baby don't hurt me

Gold Artifacts: [{"id":"123","type":"websearch"}]`

2. Testing Different Configurations: - You can test the same evalset with different system prompts - Compare results to see which prompt performs better - Use different environments to test against different data sources - Analyze which configuration produces the most accurate and consistent results

`Usage`

`$3`

`bash npx promptql-latency-test --env dev,staging,production --runs 3 --all`

`$3`

- --env, -e: Environment(s) to test (comma-separated) - Valid values: dev, staging, production - Can specify build version using parentheses: 'production(3a3d68b8c8)' - Examples: 'dev', 'staging,production', 'production(3a3d68b8c8)' ---runs, -r: Number of runs per question (default: 3) ---questions, -q: Questions to run. Can be: - A single number (e.g.1) - A comma-separated list (e.g.1,2,3) - A range (e.g.1-3) - A search string to match against questions (e.g."WorkPass") ---all, -a: Test all available questions ---output, -o: Output file for results (default: latencyresults[timestamp].json) ---concurrency, -c: Maximum number of concurrent questions to run (default: 5) ---batch-size, -b: Number of questions to process in each batch (default: 10) ---rate-limit: Maximum requests per second (0 for no limit) (default: 0) ---batch-delay: Delay in seconds between batches of runs (default: 0) ---num-batches: Number of batches to run (default: 1) ---skip-accuracy: Skip accuracy testing even if Patronus configuration is available (default: false) ---keep-incremental: Keep incremental result files after completion (default: false - files are cleaned up) ---simple: Additionally generate simple markdown output (questions and responses only) alongside full results (default: false) ---timeout: API request timeout in seconds (default: 60)

`$3`

The testing suite uses incremental writing to minimize memory usage during large test runs. Each question's results are written to disk immediately after completion, rather than storing everything in memory.

#### How It Works

- Incremental Files: Each question's results are saved to individual files (e.g., latency_results_dev_0.json, latency_results_dev_1.json) - Memory Efficiency: Only the current question's data is kept in memory at any time - Final Output: A combined results file is still generated with the same structure as before - Automatic Cleanup: By default, incremental files are automatically removed after successful completion

#### Cleanup Behavior

Default Behavior (Cleanup Enabled):`bash npx promptql-latency-test --env dev --runs 1 --all`- Incremental files are automatically cleaned up after completion - Console output shows cleanup progress:`✅ Results saved: latency_results.json and latency_results_summary.md 🗑️ Cleaned up incremental file: latency_results_dev_0.json 🗑️ Cleaned up incremental file: latency_results_dev_1.json 🧹 Cleanup complete: Removed 2 incremental files`

Preserve Incremental Files:`bash npx promptql-latency-test --env dev --runs 1 --all --keep-incremental`- Incremental files are preserved for debugging or analysis - Console output indicates files are preserved:`✅ Results saved: latency_results.json and latency_results_summary.md 📁 Incremental files preserved: Use --keep-incremental=false to enable cleanup`

#### Memory Usage Benefits

- Small Tests: 80% memory reduction - Large Tests: 98% memory reduction (e.g., 50 questions, 10 runs each) - Scalability: Peak memory usage remains constant regardless of test size

`$3`

Run all questions:

`bash

`Run against default environments`


npx promptql-latency-test --env dev,staging,production --runs 3 --all
Run against specific build version

npx promptql-latency-test --env production(3a3d68b8c8) --runs 3 --all
Run against multiple environments including specific build

npx promptql-latency-test --env dev,staging,production(3a3d68b8c8) --runs 3 --all


Run specific questions:

`bash

`Run single question by number`


npx promptql-latency-test --env dev --runs 1 --questions 1
Run multiple questions by number

npx promptql-latency-test --env dev,staging --runs 3 --questions 1,2,3
Run a range of questions

npx promptql-latency-test --env dev,staging --runs 3 --questions 1-3
Run questions by company name or keyword

npx promptql-latency-test --env dev,staging --runs 3 --questions "WorkPass"
npx promptql-latency-test --env dev,staging --runs 3 --questions "founders"


$3
The script supports three modes for accuracy testing:
1. Full Accuracy Testing (default):

`bash npx promptql-latency-test --env dev --runs 3 --all`

- Runs both latency and accuracy tests - Requires Patronus configuration in.env

2. Skip Accuracy Testing:

`bash npx promptql-latency-test --env dev --runs 3 --all --skip-accuracy`

- Runs only latency tests - Ignores Patronus configuration even if available

3. Automatic Mode:`bash npx promptql-latency-test --env dev --runs 3 --all`- If Patronus configuration is missing, automatically skips accuracy testing - If Patronus configuration is present, runs accuracy tests - No need to specify any flags

`$3`

Here's an example of running a single question test:

`bash $ npx promptql-latency-test --questions 1 --runs 1 --env dev

Total questions available: 19 Requested questions: 1 Requested runs per question: 1 Will run 1 question: 1. [Question text]

ℹ️ Info: Starting latency tests with 1 question, 1 concurrent requests per batch, 1 batch (1 total runs) across environments: dev

ℹ️ Info: Running 1 concurrent requests per batch

================================================================================ 🤔 Question: [Question text]

ℹ️ Info: Starting batch 1/1 with 1 concurrent runs dev Q1/1 R1/1 [Progress bar] ℹ️ Info: Evaluating accuracy for question: [Question text] dev Q1/1 R1/1 ⏱️ 22.30s SQL: 0.25s | LLM: 16.17s | Code: 5.87s

📊 Results for dev ⏱️ Performance: 22.30s avg (22.30s min, 22.30s max) 🔧 Components: SQL 0.25s | LLM 16.17s | Code 5.87s ✅ Accuracy: Fuzzy 100% | Data 100% | Combined 100%

================================================================================ 🏁 Final Summary ================================================================================

📈 Overall: 1/0 runs (1 questions, 1 runs each)

💾 Memory Usage: Initial: 132MB, Current: 129MB, Peak: 133MB

🌍 Environment Comparison --------------------------------------------------------------------------------

❓ Question: [Question text] 1. dev 🏆 22.30s (100% success) [Fuzzy: 100%, Data: 100%, Combined: 100%]

✅ Results saved: latency_results_[timestamp].json and latency_results_[timestamp]_summary.md`

`$3`

Enable detailed logging:

`bash DEBUG=true npx promptql-latency-test --env dev --runs 3 --all`

`Test Results`

The script generates the following output files:

1. JSON file with raw results (always generated) 2. Markdown summary with formatted analysis (always generated) 3. Simple markdown with questions and responses only (generated when--simple flag is used)

`$3`

1. Overall Statistics

- Total questions and runs - Success/failure rates - Performance metrics

2. Accuracy Results

- Per-environment statistics - Fuzzy match and data accuracy scores - Combined pass rates - Detailed failure analysis

3. Performance Analysis

- Environment comparison - Component breakdown (SQL, LLM, Code) - Latency metrics - Performance rankings

4. Per-Question Analysis - Environment-specific results - Accuracy metrics - Component performance - Failure details

`$3`

When using the --simpleflag, the script generates an additional simplified markdown file that contains: - Questions and their responses only - No statistics, metrics, or technical details - Clean, readable format for non-technical stakeholders

Usage:`bash

`Generate full results PLUS simple markdown`


npx promptql-latency-test --env dev --runs 3 --all --simple

Output files: -latency_results_[timestamp].json- Full JSON results (always generated) -latency_results_[timestamp]_summary.md- Detailed markdown summary with statistics (always generated) -latency_results_[timestamp]_simple.md - Simple markdown with Q&A only (generated with --simple flag)

This additive approach ensures you always have complete data for analysis while also providing a simplified view when needed.

`Data Structure`

`$3`

`typescript interface AccuracyResult { fuzzy_match: { passed: boolean; score: number; details: string; }; data_accuracy: { passed: boolean; score: number; details: string; }; }`

`$3`

`typescript interface RunResult { duration: number | null; timestamp: string; run_number: number; trace_id: string | null; iterations: number | null; span_durations: { sql_engine_execute_sql: number | null; call_llm_streaming: number | null; pure_code_execution: number | null; }; query_ids: string[]; accuracy: AccuracyResult | null; raw_request: any; raw_response: any; }`

`Error Handling`

The script includes comprehensive error handling:

- Retry logic for API calls - Detailed error logging - Graceful fallbacks for missing data - Environment validation - Configuration checks

`Contributing`

1. Fork the repository 2. Create a feature branch 3. Commit your changes 4. Push to the branch 5. Create a Pull Request

`Technical Details`

`$3`

The suite integrates with: - PromptQL API v2: For natural language query processing with configurable LLM models - DDN (Data Delivery Network): For data access - Hasura GraphQL: For trace data retrieval - Patronus API: For accuracy evaluation (optional)

`$3`

This suite uses v2 of the Natural Language API, which includes: - Enhanced LLM configuration withspecificLlmfield support - Bearer token authentication in Authorization header - Improved performance and reliability - Backward compatibility with v1 features

`$3`

The v2 API requires authentication via Bearer token in the Authorization header: - Header Format:Authorization: Bearer ${PROMPTQL_API_KEY}- No longer in request body: Thepromptql_api_keyfield has been removed from the request payload - Per-environment keys: Each environment uses its own API key (e.g.,PROMPTQL_API_KEY_DEV, PROMPTQL_API_KEY_STAGING, PROMPTQL_API_KEY_PRODUCTION)

`$3`

The v2 API allows you to specify which LLM provider and model to use via the specificLlm field. This is configured per environment using separate provider and model environment variables:

Development Environment: -SPECIFIC_LLM_PROVIDER_DEV: LLM provider (e.g., "openai", "anthropic") -SPECIFIC_LLM_MODEL_DEV: Model name (e.g., "gpt-4", "claude-3-opus-20240229")

Staging Environment: -SPECIFIC_LLM_PROVIDER_STAGING: LLM provider -SPECIFIC_LLM_MODEL_STAGING: Model name

Production Environment: -SPECIFIC_LLM_PROVIDER_PRODUCTION: LLM provider -SPECIFIC_LLM_MODEL_PRODUCTION: Model name

Default Behavior: - If no environment variables are set, the system defaults to:anthropic/claude-sonnet-4-20250514- Both provider and model must be set together for custom configuration - If only one is set (incomplete configuration), the default will be used

Example Providers and Models: - OpenAI: provider="openai", models="gpt-4", "gpt-4-turbo-preview", "gpt-3.5-turbo" - Anthropic: provider="anthropic", models="claude-3-opus-20240229", "claude-3-sonnet-20240229" - Azure: provider="azure", models="gpt-4", "gpt-35-turbo" - Cohere: provider="cohere", models="command", "command-light"

Request Structure:`json { "version": "v2", "llm": { "provider": "hasura", "specificLlm": { "provider": "openai", // or defaults to "anthropic" "model": "gpt-4" // or defaults to "claude-sonnet-4-20250514" } } }`

Default Configuration (when no env vars are set):`json { "version": "v2", "llm": { "provider": "hasura", "specificLlm": { "provider": "anthropic", "model": "claude-sonnet-4-20250514" } } }`

`Version History`

`$3`

New Features: - API v2 Support: Updated to use v2 of the Natural Language API - Specific LLM Configuration: Added support forspecificLlmfield with provider and model properties - Per-Environment LLM Settings: Configure different LLM providers and models for each environment - Bearer Token Authentication: API key now sent as Bearer token in Authorization header

Technical Improvements: - Enhanced Model Selection: Fine-grained control over both LLM provider and model - Improved API Compatibility: Full support for v2 API features with properspecificLlmobject structure - Secure Authentication: API key moved from request body to Authorization header for better security - Default LLM Configuration: Automatically uses anthropic/claude-sonnet-4-20250514 when no custom LLM is configured - Backward Compatibility: Maintains compatibility with existing test suites (specificLlm is optional)

`$3`

New Features: - Incremental Results Writing: Each question's results are written to disk immediately after completion - Memory Usage Reduction: 80-98% reduction in peak memory usage for large test suites - Automatic Cleanup: Incremental files are automatically cleaned up after successful completion - Flexible Cleanup Control:--keep-incremental flag to preserve files for debugging

Technical Improvements: - Scalability: Peak memory usage remains constant regardless of test size - Reliability: Results are persisted incrementally, reducing risk of data loss - Backward Compatibility: Same output format maintained

CLI Options: ---keep-incremental`: Keep incremental result files after completion (default: false)

$3

- Multi-environment testing support
- Latency and accuracy measurement
- Parallel execution capabilities
- Comprehensive reporting
- Debug mode functionality

PromptQL Latency and Accuracy Testing Suite

![GitHub](https://github.com/hasura/evalset)
![GitHub](https://github.com/hasura/evalset)

> Note: This suite now uses v2 of the Natural Language API with support for specific LLM model configuration.

Overview

Features

Prerequisites

- Node.js environment
- Required environment variables:

> Important: Each environment now has its own set of authentication tokens and URLs for better security and isolation.

`Installation`

1. Clone the repository 2. Install dependencies:`bash npm install`3. Create a.env file in the tests directory with the required environment variables (see .env.example for the template)

`Migration from Previous Configuration`

If you're upgrading from a previous version that used shared authentication tokens, you'll need to update your .env file:

`$3`

bash
Shared across environments

DDN_AUTH_TOKEN="your-shared-token"
HASURA_PAT="your-shared-pat"
PROMPTQL_DATA_PLANE_URL_MAIN="..."
PROMPTQL_DATA_PLANE_URL_SECONDARY="..."

$3

bash
Each environment has its own tokens and URLs

DDN_AUTH_TOKEN_DEV="your-dev-token"
DDN_AUTH_TOKEN_STAGING="your-staging-token"
DDN_AUTH_TOKEN_PRODUCTION="your-production-token"
HASURA_PAT_DEV="your-dev-pat"
HASURA_PAT_STAGING="your-staging-pat"
HASURA_PAT_PRODUCTION="your-production-pat"

PROMPTQL_DATA_PLANE_URL_DEV="..." PROMPTQL_DATA_PLANE_URL_STAGING="..." PROMPTQL_DATA_PLANE_URL_PRODUCTION="..."`

`Local Development`

For local development and testing, you can use npm test with the same command line options:

`bash

`Run a single question test`


npm test -- --env "dev" --runs 1 --questions 1
Run multiple questions

npm test -- --env "dev" --runs 3 --questions 1,2,3
Run all questions

npm test -- --env "dev" --runs 3 --all

This is equivalent to using npx promptql-latency-test but runs directly from your local development environment. The -- after npm test is required to pass arguments to the underlying script.

`System Prompt and Environment Setup`

`$3`

1. System Prompt Files:

`## Domain specific instructions

[Your specific instructions here]`

2. Directory Structure:

3. Build-Specific Prompts:

`bash # Will use dev(8a0a69ff30).txt if it exists, otherwise fall back to dev.txt npx promptql-latency-test --env dev(8a0a69ff30) --runs 3 --all

# Will use dev.txt npx promptql-latency-test --env dev --runs 3 --all`

`$3`

1. Environment File:

2. Required Variables:

`$3`

1. Test Questions:

`question,gold_answer "What is love?","Baby don't hurt me

Gold Artifacts: [{"id":"123","type":"websearch"}]`

`Usage`

`$3`

`bash npx promptql-latency-test --env dev,staging,production --runs 3 --all`

`$3`

#### How It Works

#### Cleanup Behavior

#### Memory Usage Benefits

- Small Tests: 80% memory reduction - Large Tests: 98% memory reduction (e.g., 50 questions, 10 runs each) - Scalability: Peak memory usage remains constant regardless of test size

`$3`

Run all questions:

`bash

`Run against default environments`


npx promptql-latency-test --env dev,staging,production --runs 3 --all
Run against specific build version

npx promptql-latency-test --env production(3a3d68b8c8) --runs 3 --all
Run against multiple environments including specific build

npx promptql-latency-test --env dev,staging,production(3a3d68b8c8) --runs 3 --all


Run specific questions:

`bash

`Run single question by number`


npx promptql-latency-test --env dev --runs 1 --questions 1
Run multiple questions by number

npx promptql-latency-test --env dev,staging --runs 3 --questions 1,2,3
Run a range of questions

npx promptql-latency-test --env dev,staging --runs 3 --questions 1-3
Run questions by company name or keyword

npx promptql-latency-test --env dev,staging --runs 3 --questions "WorkPass"
npx promptql-latency-test --env dev,staging --runs 3 --questions "founders"


$3
The script supports three modes for accuracy testing:
1. Full Accuracy Testing (default):

`bash npx promptql-latency-test --env dev --runs 3 --all`

- Runs both latency and accuracy tests - Requires Patronus configuration in.env

2. Skip Accuracy Testing:

`bash npx promptql-latency-test --env dev --runs 3 --all --skip-accuracy`

- Runs only latency tests - Ignores Patronus configuration even if available

`$3`

Here's an example of running a single question test:

`bash $ npx promptql-latency-test --questions 1 --runs 1 --env dev

Total questions available: 19 Requested questions: 1 Requested runs per question: 1 Will run 1 question: 1. [Question text]

ℹ️ Info: Starting latency tests with 1 question, 1 concurrent requests per batch, 1 batch (1 total runs) across environments: dev

ℹ️ Info: Running 1 concurrent requests per batch

================================================================================ 🤔 Question: [Question text]

📊 Results for dev ⏱️ Performance: 22.30s avg (22.30s min, 22.30s max) 🔧 Components: SQL 0.25s | LLM 16.17s | Code 5.87s ✅ Accuracy: Fuzzy 100% | Data 100% | Combined 100%

================================================================================ 🏁 Final Summary ================================================================================

📈 Overall: 1/0 runs (1 questions, 1 runs each)

💾 Memory Usage: Initial: 132MB, Current: 129MB, Peak: 133MB

🌍 Environment Comparison --------------------------------------------------------------------------------

❓ Question: [Question text] 1. dev 🏆 22.30s (100% success) [Fuzzy: 100%, Data: 100%, Combined: 100%]

✅ Results saved: latency_results_[timestamp].json and latency_results_[timestamp]_summary.md`

`$3`

Enable detailed logging:

`bash DEBUG=true npx promptql-latency-test --env dev --runs 3 --all`

`Test Results`

The script generates the following output files:

`$3`

1. Overall Statistics

- Total questions and runs - Success/failure rates - Performance metrics

2. Accuracy Results

- Per-environment statistics - Fuzzy match and data accuracy scores - Combined pass rates - Detailed failure analysis

3. Performance Analysis

- Environment comparison - Component breakdown (SQL, LLM, Code) - Latency metrics - Performance rankings

4. Per-Question Analysis - Environment-specific results - Accuracy metrics - Component performance - Failure details

`$3`

Usage:`bash

`Generate full results PLUS simple markdown`


npx promptql-latency-test --env dev --runs 3 --all --simple

This additive approach ensures you always have complete data for analysis while also providing a simplified view when needed.

`Data Structure`

`$3`

`typescript interface AccuracyResult { fuzzy_match: { passed: boolean; score: number; details: string; }; data_accuracy: { passed: boolean; score: number; details: string; }; }`

`$3`

`Error Handling`

The script includes comprehensive error handling:

- Retry logic for API calls - Detailed error logging - Graceful fallbacks for missing data - Environment validation - Configuration checks

`Contributing`

1. Fork the repository 2. Create a feature branch 3. Commit your changes 4. Push to the branch 5. Create a Pull Request

`Technical Details`

`$3`

The v2 API allows you to specify which LLM provider and model to use via the specificLlm field. This is configured per environment using separate provider and model environment variables:

Development Environment: -SPECIFIC_LLM_PROVIDER_DEV: LLM provider (e.g., "openai", "anthropic") -SPECIFIC_LLM_MODEL_DEV: Model name (e.g., "gpt-4", "claude-3-opus-20240229")

Staging Environment: -SPECIFIC_LLM_PROVIDER_STAGING: LLM provider -SPECIFIC_LLM_MODEL_STAGING: Model name

Production Environment: -SPECIFIC_LLM_PROVIDER_PRODUCTION: LLM provider -SPECIFIC_LLM_MODEL_PRODUCTION: Model name

Default Configuration (when no env vars are set):`json { "version": "v2", "llm": { "provider": "hasura", "specificLlm": { "provider": "anthropic", "model": "claude-sonnet-4-20250514" } } }`

`Version History`

`$3`

CLI Options: ---keep-incremental`: Keep incremental result files after completion (default: false)

$3

- Multi-environment testing support
- Latency and accuracy measurement
- Parallel execution capabilities
- Comprehensive reporting
- Debug mode functionality