Evaluation Harness
Test and compare provider and model performance with the built-in eval system.
Overview
The evaluation harness allows you to run standardized benchmarks against any configured provider/model combination. Use it to compare performance, measure latency, and validate outputs across providers.
Usage
❯ /evals run # run the standard eval suite
❯ /evals list # list available eval benchmarks
❯ /evals results # show previous eval results
Available Benchmarks
- Code generation — function-level code synthesis
- Tool calling — accuracy of tool selection and argument generation
- Reasoning — multi-step logical reasoning
- Context comprehension — long-context understanding and recall
Comparing Providers
Switch providers and re-run the same eval to compare:
❯ /model openai
❯ /evals run
❯ /model deepseek-v4-flash
❯ /evals run
❯ /evals results # side-by-side comparison