This feature is experimental. Flags, output shape, and behavior may change between releases. Use in controlled workflows and pin versions in CI.Report schema and scoring behavior may evolve as eval workflows mature.
eval commands to benchmark translation quality on a representative eval set and enforce quality thresholds in CI.
Usage
eval run
Run one or more experiment variants across your eval set.Eval flow
What this command does
- Loads the eval dataset from
--eval-set. - Expands experiment variants from dataset
experimentsor from your selected profiles, providers, models, and prompt override. - Executes every case against every experiment variant.
- Scores each translation with the built-in heuristic lane.
- Optionally adds an LLM judge lane when you pass both
--eval-providerand--eval-model. - Reconciles heuristic quality and successful judge scores into a per-run
finalScore. - Writes a JSON report when you pass
--output. - Prints a concise per-experiment summary table.
Flags
--eval-set: path to eval dataset (.yaml,.yml) (required)--profile: profile name override (repeatable)--provider: provider override (repeatable)--model: model override (repeatable)--prompt-file: prompt file override--prompt: inline prompt override (mutually exclusive with--prompt-file)--eval-provider: provider for LLM evaluation. Defaults toopenaiwhen judge eval is requested.--eval-model: model for LLM evaluation. Defaults togpt-5.2when judge eval is requested.--eval-prompt-file: evaluation prompt file override--eval-prompt: inline evaluation prompt override (mutually exclusive with--eval-prompt-file)--assertion: judge assertion to run (repeatable). Supported values:llm-rubric,factuality,g-eval,model-graded-closedqa,answer-relevance,context-faithfulness,context-recall--baseline: baseline eval report JSON path for comparison when using--interactive--output: JSON output report path
--eval-provider, --eval-model, --eval-prompt, or --assertion, or implicitly through eval-set judge assertions.
If judge evaluation is requested and you omit provider/model, Hyperlocalise defaults to openai and gpt-5.2.
Your translation flags keep their current meaning. The --eval-* flags only control the judge lane.
Reference translations are optional in LLM evaluation mode. When present, the evaluator uses them as style and tone guidance.
In the YAML eval-set format, reference is the normal target-side field. assert is optional.
Dataset-level experiments and judge are optional. If you define them in YAML and do not pass CLI overrides, eval run uses those settings directly.
If you pass any of --profile, --provider, --model, or --prompt, the CLI experiment matrix overrides dataset experiments.
If you pass any of --eval-provider, --eval-model, --eval-prompt, or --assertion, those CLI values override dataset judge fields.
If you pass no --assertion, the default judge assertion is llm-rubric.
Unknown assertion names fail fast.
How scoring works
quality.weightedAggregateis the built-in heuristic score for the run.judgeAggregateScoreis the average of successful judge assertions for that run.finalScoreis the reconciled score used for diagnosis in the report.decisionis a coarse outcome for the run:pass,review, orfail.
- translation errors force
finalScore=0anddecision=fail - heuristic hard fails force
finalScore=0anddecision=fail - when the judge lane is unavailable,
finalScorefalls back to the heuristic score - when both lanes are available,
finalScore = 0.65 * heuristic + 0.35 * judge
Summary table fields
score: average weighted quality score for the experimentpass_rate: successful runs / total runs for the experimentplaceholder_violations: count of placeholder integrity hard-fail violationslatency_ms: average latency for the experiment
Examples
Run with defaults from your configuration and write a report:Report example
Example eval set:--output:
experimentSummariesis the fastest way to compare model variants byfinalScore,weightedScore, or pass/review/fail mix.aggregate.byLocaleis the fastest way to localize regressions.llmEvaluation.averageScoreByNameshows which assertion family is dragging the judge lane down.assertionResultsshows whether explicit eval-set expectations passed.judgeResultsdetails explain assertion-specific failures, such as hallucinations, unsupported claims, or missing context facts.
eval compare
Compare a candidate report with a baseline report. Use this command in CI to prevent quality regressions. The workflow stays the same: runeval run first, then run eval compare.
Flags
--candidate: candidate report JSON path (required)--baseline: baseline report JSON path (required)--min-score: minimum candidate score--max-regression: maximum allowed score regression from baseline to candidate
CI behavior
eval compare prefers the LLM aggregate score when both reports include a usable LLM judge aggregate. Otherwise, it falls back to the heuristic weighted score.
This means eval compare currently gates on:
- LLM judge aggregate when LLM evaluation was enabled in both reports
- heuristic aggregate when no LLM judge aggregate is available
finalScore.
The command exits with an error when:
- candidate score is below
--min-score, or - score regression exceeds
--max-regression.
Examples
Compare reports and print summary values only:0.82 or regresses by more than 0.02: