Web Codegen Scorer

Web Codegen Scorer is a tool for evaluating the quality of web code generated by Large Language
Models (LLMs).

You can use this tool to make evidence-based decisions relating to AI-generated code. For example:

* 🔄 Iterate on a system prompt to find most effective instructions for your project.
* ⚖️ Compare the code quality of code produced by different models.
* 📈 Monitor generated code quality over time as models and agents evolve.

Web Codegen Scorer is different from other code benchmarks in that it focuses specifically on _web_
code and relies primarily on well-established measures of code quality.

Features

* ⚙️ Configure your evaluations with different models, frameworks, and tools.
* ✍️ Specify system instructions and add MCP servers.
* 📋 Use built-in checks for build success, runtime errors, accessibility, security, LLM rating, and
coding best practices. (More built-in checks coming soon!)
* 🔧 Automatically attempt to repair issues detected during code generating.
* 📊 View and compare results with an intuitive report viewer UI.

Setup

1. Install the package:

``bash npm install -g web-codegen-scorer`

2. Set up your API keys:

In order to run an eval, you have to specify an API keys for the relevant providers as environment variables:

`bash export GEMINI_API_KEY="YOUR_API_KEY_HERE" # If you're using Gemini models export OPENAI_API_KEY="YOUR_API_KEY_HERE" # If you're using OpenAI models export ANTHROPIC_API_KEY="YOUR_API_KEY_HERE" # If you're using Anthropic models export XAI_API_KEY="YOUR_API_KEY_HERE" # If you're using xAI Grok models`

3. Run an eval:

You can run your first eval using our Angular example with the following command:

`bash web-codegen-scorer eval --env=angular-example`

4. (Optional) Set up your own eval:

If you want to set up a custom eval, instead of using our built-in examples, you can run the following command which will guide you through the process:

`bash web-codegen-scorer init`

5. (Optional) Run an evaluated app locally:

Once you've evaluated an app, you can run it locally with the following command:

`bash web-codegen-scorer run --env=angular-example --prompt=`

`Command-line flags`

You can customize the web-codegen-scorer eval script with the following flags:

- --env= (alias: --environment): (Required) Specifies the path from which to load the environment config. - Example:web-codegen-scorer eval --env=foo/bar/my-env.mjs

- --model=: Specifies the model to use when generating code. Defaults to the value ofDEFAULT_MODEL_NAME. - Example:web-codegen-scorer eval --model=gemini-2.5-flash --env=

- --autorater-model=: Specifies the model to use when automatically rating generated code. Defaults to the value ofDEFAULT_AUTORATER_MODEL_NAME. - Example:web-codegen-scorer eval --model=gemini-2.5-flash --autorater-model=gemini-2.5-flash --env=

- --runner=: Specifies the runner to use to execute the eval. Supported runners areai-sdk (default), gemini-cli, claude-code or codex.

- --local: Runs the script in local mode for the initial code generation request. Instead of calling the LLM, it will attempt to read the initial code from a corresponding file in the.web-codegen-scorer/llm-output directory (e.g., .web-codegen-scorer/llm-output/todo-app.ts). This is useful for re-running assessments or debugging the build/repair process without incurring LLM costs for the initial generation. - Note: You typically need to runweb-codegen-scorer eval once without --localto generate the initial files in.web-codegen-scorer/llm-output. - Theweb-codegen-scorer eval:localscript is a shortcut forweb-codegen-scorer eval --local.

- --limit=: Specifies the number of application prompts to process. Defaults to 5. - Example:web-codegen-scorer eval --limit=10 --env=

- --output-directory= (alias: --output-dir): Specifies which directory to output the generated code under which is useful for debugging. By default, the code will be generated in a temporary directory. - Example:web-codegen-scorer eval --output-dir=test-output --env=

- --concurrency=: Sets the maximum number of concurrent AI API requests. Defaults to 5( as defined byDEFAULT_CONCURRENCY in src/config.ts). - Example:web-codegen-scorer eval --concurrency=3 --env=

- --report-name=: Sets the name for the generated report directory. Defaults to a timestamp (e.g.,2023-10-27T10-30-00-000Z). The name will be sanitized (non-alphanumeric characters replaced with hyphens). - Example:web-codegen-scorer eval --report-name=my-custom-report --env=

- --rag-endpoint=: Specifies a custom RAG (Retrieval-Augmented Generation) endpoint URL. The URL must contain aPROMPTsubstring, which will be replaced with the user prompt. - Example:web-codegen-scorer eval --rag-endpoint="http://localhost:8080/my-rag-endpoint?query=PROMPT" --env=

- --prompt-filter=: String used to filter which prompts should be run. By default, a random sample (controlled by--limit) will be taken from the prompts in the current environment. Setting this can be useful for debugging a specific prompt. - Example:web-codegen-scorer eval --prompt-filter=tic-tac-toe --env=

- --skip-screenshots: Whether to skip taking screenshots of the generated app. Defaults tofalse. - Example:web-codegen-scorer eval --skip-screenshots --env=

- --labels= : Metadata labels that will be attached to the run. - Example:web-codegen-scorer eval --labels my-label another-label --env=

- --mcp: Whether to start an MCP for the evaluation. Defaults to false. - Example:web-codegen-scorer eval --mcp --env=

-- --max-build-repair-attempts: Number of repair attempts when build errors are discovered. Defaults to 1 attempt.

- --help: Prints out usage information about the script.

`$3`

- Environment config reference - How to set up a new model?

`Local development`

If you've cloned this repo and want to work on the tool, you have to install its dependencies by runningpnpm install. Once they're installed, you can run the following commands:

* pnpm run release-build - Creates a release build of the package in distdirectory. *pnpm run npm-publish- Builds the package and publishes it to npm. *pnpm run eval- Runs an eval from source. *pnpm run report- Runs the report app from source. *pnpm run init- Runs the init script from source. *pnpm run format` - Formats the source code using Prettier.

FAQ

$3

This tool is built by the Angular team at Google.

$3

No! You can use this tool with any web library or framework (or none at all) as well as any model.

$3

As more and more developers reach for LLM-based tools to create and modify code, we wanted to be
able to empirically measure the effect of different factors on the quality of generated code. While
many LLM coding benchmarks exist, we found that these were often too broad and didn't measure the
specific quality metrics we cared about.

In the absence of such a tool, we found that many developers based their judgements on codegen with
different models, frameworks, and tools on loosely structured trial-and-error. In contrast, Web
Codegen Scorer gives us a platform to consistently measure codegen across different configurations
with consistency and repeatability.

$3

Yes! We plan to both expand the number of built-in checks and the variety of codegen scenarios.

Our roadmap includes:

* Including _interaction testing_ in the rating, to ensure the generated code performs any requested
behaviors.
* Measure Core Web Vitals.
* Measuring the effectiveness of LLM-driven edits on an existing codebase.