Web Codegen Scorer is a tool for evaluating the quality of web code generated by Large Language Models (LLMs).
npm install web-codegen-scorerWeb Codegen Scorer is a tool for evaluating the quality of web code generated by Large Language
Models (LLMs).
You can use this tool to make evidence-based decisions relating to AI-generated code. For example:
* 🔄 Iterate on a system prompt to find most effective instructions for your project.
* ⚖️ Compare the code quality of code produced by different models.
* 📈 Monitor generated code quality over time as models and agents evolve.
Web Codegen Scorer is different from other code benchmarks in that it focuses specifically on _web_
code and relies primarily on well-established measures of code quality.
* ⚙️ Configure your evaluations with different models, frameworks, and tools.
* ✍️ Specify system instructions and add MCP servers.
* 📋 Use built-in checks for build success, runtime errors, accessibility, security, LLM rating, and
coding best practices. (More built-in checks coming soon!)
* 🔧 Automatically attempt to repair issues detected during code generating.
* 📊 View and compare results with an intuitive report viewer UI.


1. Install the package:
``bash`
npm install -g web-codegen-scorer
2. Set up your API keys:
In order to run an eval, you have to specify an API keys for the relevant providers as
environment variables:
`bash`
export GEMINI_API_KEY="YOUR_API_KEY_HERE" # If you're using Gemini models
export OPENAI_API_KEY="YOUR_API_KEY_HERE" # If you're using OpenAI models
export ANTHROPIC_API_KEY="YOUR_API_KEY_HERE" # If you're using Anthropic models
export XAI_API_KEY="YOUR_API_KEY_HERE" # If you're using xAI Grok models
3. Run an eval:
You can run your first eval using our Angular example with the following command:
`bash`
web-codegen-scorer eval --env=angular-example
4. (Optional) Set up your own eval:
If you want to set up a custom eval, instead of using our built-in examples, you can run the
following command which will guide you through the process:
`bash`
web-codegen-scorer init
5. (Optional) Run an evaluated app locally:
Once you've evaluated an app, you can run it locally with the following command:
`bash`
web-codegen-scorer run --env=angular-example --prompt=
You can customize the web-codegen-scorer eval script with the following flags:
- --env= (alias: --environment): (Required) Specifies the path from which to load theweb-codegen-scorer eval --env=foo/bar/my-env.mjs
environment config.
- Example:
- --model=: Specifies the model to use when generating code. Defaults to the value ofDEFAULT_MODEL_NAME
.web-codegen-scorer eval --model=gemini-2.5-flash --env=
- Example:
- --autorater-model=: Specifies the model to use when automatically rating generated code. Defaults to the value ofDEFAULT_AUTORATER_MODEL_NAME
.web-codegen-scorer eval --model=gemini-2.5-flash --autorater-model=gemini-2.5-flash --env=
- Example:
- --runner=: Specifies the runner to use to execute the eval. Supported runners areai-sdk
(default), gemini-cli, claude-code or codex.
- --local: Runs the script in local mode for the initial code generation request. Instead of.web-codegen-scorer/llm-output
calling the LLM, it will attempt to read the initial code from a corresponding file in the
directory (e.g., .web-codegen-scorer/llm-output/todo-app.ts).web-codegen-scorer eval
This is useful for re-running assessments or debugging the build/repair process without incurring
LLM costs for the initial generation.
- Note: You typically need to run once without --local to.web-codegen-scorer/llm-output
generate the initial files in .web-codegen-scorer eval:local
- The script is a shortcut forweb-codegen-scorer eval --local
.
- --limit=: Specifies the number of application prompts to process. Defaults to 5.web-codegen-scorer eval --limit=10 --env=
- Example:
- --output-directory= (alias: --output-dir): Specifies which directory to output theweb-codegen-scorer eval --output-dir=test-output --env=
generated code under which is useful for debugging. By default, the code will be generated in a
temporary directory.
- Example:
- --concurrency=: Sets the maximum number of concurrent AI API requests. Defaults to 5 (DEFAULT_CONCURRENCY
as defined by in src/config.ts).web-codegen-scorer eval --concurrency=3 --env=
- Example:
- --report-name=: Sets the name for the generated report directory. Defaults to a2023-10-27T10-30-00-000Z
timestamp (e.g., ). The name will be sanitized (non-alphanumericweb-codegen-scorer eval --report-name=my-custom-report --env=
characters replaced with hyphens).
- Example:
- --rag-endpoint=: Specifies a custom RAG (Retrieval-Augmented Generation) endpoint URL. ThePROMPT
URL must contain a substring, which will be replaced with the user prompt.web-codegen-scorer eval --rag-endpoint="http://localhost:8080/my-rag-endpoint?query=PROMPT" --env=
- Example:
- --prompt-filter=: String used to filter which prompts should be run. By default, a random--limit
sample (controlled by ) will be taken from the prompts in the current environment.web-codegen-scorer eval --prompt-filter=tic-tac-toe --env=
Setting this can be useful for debugging a specific prompt.
- Example:
- --skip-screenshots: Whether to skip taking screenshots of the generated app. Defaults tofalse
.web-codegen-scorer eval --skip-screenshots --env=
- Example:
- --labels=: Metadata labels that will be attached to the run.web-codegen-scorer eval --labels my-label another-label --env=
- Example:
- --mcp: Whether to start an MCP for the evaluation. Defaults to false.web-codegen-scorer eval --mcp --env=
- Example:
-- --max-build-repair-attempts: Number of repair attempts when build errors are discovered. Defaults to 1 attempt.
- --help: Prints out usage information about the script.
- Environment config reference
- How to set up a new model?
If you've cloned this repo and want to work on the tool, you have to install its dependencies by
running pnpm install.
Once they're installed, you can run the following commands:
* pnpm run release-build - Creates a release build of the package in dist directory.pnpm run npm-publish
* - Builds the package and publishes it to npm.pnpm run eval
* - Runs an eval from source.pnpm run report
* - Runs the report app from source.pnpm run init
* - Runs the init script from source.pnpm run format` - Formats the source code using Prettier.
*
This tool is built by the Angular team at Google.
No! You can use this tool with any web library or framework (or none at all) as well as any model.
As more and more developers reach for LLM-based tools to create and modify code, we wanted to be
able to empirically measure the effect of different factors on the quality of generated code. While
many LLM coding benchmarks exist, we found that these were often too broad and didn't measure the
specific quality metrics we cared about.
In the absence of such a tool, we found that many developers based their judgements on codegen with
different models, frameworks, and tools on loosely structured trial-and-error. In contrast, Web
Codegen Scorer gives us a platform to consistently measure codegen across different configurations
with consistency and repeatability.
Yes! We plan to both expand the number of built-in checks and the variety of codegen scenarios.
Our roadmap includes:
* Including _interaction testing_ in the rating, to ensure the generated code performs any requested
behaviors.
* Measure Core Web Vitals.
* Measuring the effectiveness of LLM-driven edits on an existing codebase.