How it works
An eval set is a list of (input, expected behaviour) pairs. Each prompt change triggers running the eval set: the agent produces outputs, a grader (often another LLM, sometimes deterministic checks) judges each, the pass rate determines whether the change is shipped. Mature teams have eval sets of 100-500+ cases.
Example
A sales prospecting agent has an eval set of 200 lead profiles paired with the ideal personalised outreach. After every system prompt tweak, the eval set runs in CI and pass rate is reported. Below 90% blocks deployment.
