Setup datasets and experiments in Arize
Updating your prompts can feel like guessing. You find a new prompting technique on arXiv or Twitter which works well on a few examples, only to later run into issues. The reality of AI engineering is that prompting is non-deterministic; it’s easy to make a small change and cause performance regressions in your product.
A better approach is evaluation-driven development; leveraging Arize, you can curate a dataset of key points that you’re trying to test, run your LLM task against those key points, and use code or LLMs or user-generated annotations to evaluate the output with aggregate scores. This allows you to test as you build and verify experiments before you deploy to customers.
I run through a quick demo and accompanying notebook creating a user research AI and how I iterate on the prompts in the video below!