Back to glossaryGLOSSARY · Concepts

Evaluation

The practice of measuring whether an LLM output is good. In production, eval sets are curated examples with expected outputs, run on every prompt change to catch regressions. The single most underrated discipline in agent engineering.

How it works

An eval set is a list of (input, expected behaviour) pairs. Each prompt change triggers running the eval set: the agent produces outputs, a grader (often another LLM, sometimes deterministic checks) judges each, the pass rate determines whether the change is shipped. Mature teams have eval sets of 100-500+ cases.

Example

A sales prospecting agent has an eval set of 200 lead profiles paired with the ideal personalised outreach. After every system prompt tweak, the eval set runs in CI and pass rate is reported. Below 90% blocks deployment.

Related terms

Need to actually use Evaluation?

We build production AI systems that put these concepts to work. 30 minutes, we map your use case.