Choose a file format
Use YAML for eval sets.- group one shared source string under
tests - define one or more locale variants under
locales - optionally define model variants under
experiments - optionally define judge config under
judge - put trusted target-side text under
reference - add
assertonly when you need explicit pass/fail expectations
YAML example
Minimal example
Format rules
experiments[]is optionalexperiments[].providerandexperiments[].modelare required when an experiment is presentexperiments[].id,experiments[].profile, andexperiments[].promptare optionaljudgeis optionaljudge.provider,judge.model,judge.prompt, andjudge.assertions[]are optionaltests[].idis requiredtests[].vars.sourceis requiredtests[].locales[]must contain at least one localetests[].locales[].localeis requiredvars.queryis accepted as an alias forvars.sourcevars.contextis optionallocales[].referenceis optional but is the normal target-side field when you have a trusted translation- top-level
assertis optional and applies to every locale variant in the test - locale-level
assertis optional and is appended for that locale only
Experiment rules
- use
experimentswhen you want the eval set itself to define which models are run - if CLI experiment flags are unset, dataset
experimentsare used - if you pass CLI
--profile,--provider,--model, or--prompt, the CLI overrides datasetexperiments
Judge rules
- use
judgewhen you want the eval set itself to define the LLM judge configuration judge.assertionsaccepts the same assertion names as CLI--assertion- CLI
--eval-provider,--eval-model,--eval-prompt, and--assertionoverride datasetjudgefield by field - if judge evaluation is requested and neither CLI nor YAML sets provider/model, Hyperlocalise defaults to
openaiandgpt-5.2
Pick representative coverage
Include a mix of string types so you can detect regressions across different content shapes.- short UI strings: buttons, labels, menu items, and concise error text
- long-form strings: onboarding steps, help text, legal copy, and transactional messages
- ICU and complex formatting: plural rules, gender variants, select statements, and date or number formatting placeholders
- placeholders and variables: tokens like
{name},%s, or{{count}}that must survive unchanged
Keep context close to each case
For each case, store a stable id and include enough context for reviewers.- keep the shared source text in
vars.source - include screenshots, feature names, or intent notes in
vars.context - put locale-specific references under each locale entry when you already have trusted translations
- keep ids stable so expanded cases stay comparable across runs
Use assertions intentionally
assert is optional. If you omit it, the eval run still produces heuristic scores, optional judge scores, and report diagnostics.
Use deterministic assertions when you know exactly what must appear in the output.
containsnot_containsequals
judge.translation_qualityjudge.factualityjudge.g_evaljudge.model_graded_closedqajudge.answer_relevancejudge.context_faithfulnessjudge.context_recalljudge.context_relevance
Maintain quality over time
Treat the eval set as production test data.- review and refresh the set when UI or product copy changes
- remove stale cases that no longer map to active features
- keep a balance of easy, medium, and difficult strings
- run the same set repeatedly to compare model or prompt changes fairly