How to evaluate AI prompts and models

November 1, 2025

Peter Hartree

I'll write a post on this soon. Meantime:

Shreya uses Google Sheets to run her evals, but Airtable field agents are a better "no-code" option.

If you're a developer, try PromptFoo.

Ideally, your evaluations are based on deterministic code (e.g. "pass" if response contains a string, "fail" otherwise). But you'll often have to use an LLM to judge your outputs. They share an example at 48:50.