Evaluation Harness

Test and compare provider and model performance with the built-in eval system.

Overview

The evaluation harness allows you to run standardized benchmarks against any configured provider/model combination. Use it to compare performance, measure latency, and validate outputs across providers.

Usage

❯ /evals run                 # run the standard eval suite
❯ /evals list                # list available eval benchmarks
❯ /evals results             # show previous eval results

Available Benchmarks

Comparing Providers

Switch providers and re-run the same eval to compare:

❯ /model openai
❯ /evals run

❯ /model deepseek-v4-flash
❯ /evals run

❯ /evals results        # side-by-side comparison