- Published on
How to evaluate AI prompts and models
- Authors

- Name
- Peter Hartree
- @peterhartree
I'll write a post on this soon. Meantime:
Shreya uses Google Sheets to run her evals, but Airtable field agents are a better "no-code" option.
If you're a developer, try PromptFoo.
Ideally, your evaluations are based on deterministic code (e.g. "pass" if response contains a string, "fail" otherwise). But you'll often have to use an LLM to judge your outputs. They share an example at 48:50.
