Seems like everyone's number one question with building with AI is "how do you make it not shit?"
The answer is evals. They're like unit tests, but for probabilistic systems.
Here's an imaginary API to explain how they work:

Source: x.com/mattpocockuk/status/1858526867273199924