This feature is experimental. Flags, output shape, and behavior may change between releases. Use in controlled workflows and pin versions in CI.Report schema and scoring behavior may evolve as eval workflows mature.
eval commands to benchmark translation quality on a representative eval set and enforce quality thresholds in CI.
Usage
eval run
Run one or more experiment variants across your eval set.What this command does
- Loads the eval dataset from
--eval-set. - Expands experiment variants from your selected profiles, providers, models, and prompt override.
- Executes every case against every experiment variant.
- Writes a JSON report when you pass
--output. - Prints a concise per-experiment summary table.
Flags
--eval-set: path to eval dataset (.json,.jsonc,.csv) (required)--profile: profile name override (repeatable)--provider: provider override (repeatable)--model: model override (repeatable)--prompt-file: prompt file override--prompt: inline prompt override (mutually exclusive with--prompt-file)--output: JSON output report path
Summary table fields
score: average weighted quality score for the experimentpass_rate: successful runs / total runs for the experimentplaceholder_violations: count of placeholder integrity hard-fail violationslatency_ms: average latency for the experiment
Examples
Run with defaults from your configuration and write a report:eval compare
Compare a candidate report with a baseline report. Use this command in CI to prevent quality regressions.Flags
--candidate: candidate report JSON path (required)--baseline: baseline report JSON path (required)--min-score: minimum candidate weighted score--max-regression: maximum allowed score regression from baseline to candidate
CI behavior
The command exits with an error when:- candidate weighted score is below
--min-score, or - score regression exceeds
--max-regression.
Examples
Compare reports and print summary values only:0.82 or regresses by more than 0.02: