Eval set curation

Use an eval set to measure translation quality on the strings that matter most to you.

Choose a file format

Use YAML for eval sets.

group one shared source string under tests
define one or more locale variants under locales
optionally define model variants under experiments
optionally define judge config under judge
put trusted target-side text under reference
add assert only when you need explicit pass/fail expectations

YAML example

version: "1"
metadata:
  owner: localization
  suite: release-gate
experiments:
  - id: ollama-translategemma
    provider: ollama
    model: translategemma
  - id: ollama-lfm2-24b
    provider: ollama
    model: lfm2:24b
judge:
  provider: openai
  model: gpt-5.2
  assertions:
    - llm-rubric
    - factuality
tests:
  - id: checkout-cta
    vars:
      source: "Save account settings"
      context: "Primary CTA on the checkout settings page"
    locales:
      - locale: fr-FR
        reference: "Enregistrer les parametres du compte"
      - locale: de-DE
        reference: "Kontoeinstellungen speichern"

Minimal example

tests:
  - id: save-button
    vars:
      source: "Save"
    locales:
      - locale: fr-FR
        reference: "Enregistrer"

Format rules

experiments[] is optional
experiments[].provider and experiments[].model are required when an experiment is present
experiments[].id, experiments[].profile, and experiments[].prompt are optional
judge is optional
judge.provider, judge.model, judge.prompt, and judge.assertions[] are optional
tests[].id is required
tests[].vars.source is required
tests[].locales[] must contain at least one locale
tests[].locales[].locale is required
vars.query is accepted as an alias for vars.source
vars.context is optional
locales[].reference is optional but is the normal target-side field when you have a trusted translation
top-level assert is optional and applies to every locale variant in the test
locale-level assert is optional and is appended for that locale only

Experiment rules

use experiments when you want the eval set itself to define which models are run
if CLI experiment flags are unset, dataset experiments are used
if you pass CLI --profile, --provider, --model, or --prompt, the CLI overrides dataset experiments

Judge rules

use judge when you want the eval set itself to define the LLM judge configuration
judge.assertions accepts the same assertion names as CLI --assertion
CLI --eval-provider, --eval-model, --eval-prompt, and --assertion override dataset judge field by field
if judge evaluation is requested and neither CLI nor YAML sets provider/model, Hyperlocalise defaults to openai and gpt-5.2

Pick representative coverage

Include a mix of string types so you can detect regressions across different content shapes.

short UI strings: buttons, labels, menu items, and concise error text
long-form strings: onboarding steps, help text, legal copy, and transactional messages
ICU and complex formatting: plural rules, gender variants, select statements, and date or number formatting placeholders
placeholders and variables: tokens like {name}, %s, or {{count}} that must survive unchanged

Keep context close to each case

For each case, store a stable id and include enough context for reviewers.

keep the shared source text in vars.source
include screenshots, feature names, or intent notes in vars.context
put locale-specific references under each locale entry when you already have trusted translations
keep ids stable so expanded cases stay comparable across runs

Use assertions intentionally

assert is optional. If you omit it, the eval run still produces heuristic scores, optional judge scores, and report diagnostics. Use deterministic assertions when you know exactly what must appear in the output.

contains
not_contains
equals

Use judge assertions when you want threshold-based scoring.

judge.translation_quality
judge.factuality
judge.g_eval
judge.model_graded_closedqa
judge.answer_relevance
judge.context_faithfulness
judge.context_recall
judge.context_relevance

Maintain quality over time

Treat the eval set as production test data.

review and refresh the set when UI or product copy changes
remove stale cases that no longer map to active features
keep a balance of easy, medium, and difficult strings
run the same set repeatedly to compare model or prompt changes fairly

Getting Started

Configuration

Commands

Workflows

Providers

TMS Adapters

Troubleshooting

Reference & Concepts

Contributing

Legal

Choose a file format

YAML example

Minimal example

Format rules

Experiment rules

Judge rules

Pick representative coverage

Keep context close to each case

Use assertions intentionally

Maintain quality over time

​Choose a file format

​YAML example

​Minimal example

​Format rules

​Experiment rules

​Judge rules

​Pick representative coverage

​Keep context close to each case

​Use assertions intentionally

​Maintain quality over time

Choose a file format

YAML example

Minimal example

Format rules

Experiment rules

Judge rules

Pick representative coverage

Keep context close to each case

Use assertions intentionally

Maintain quality over time