One question we get asked sometimes is whether Arize Phoenix – our OSS for tracing and evaluation – can help with AB testing. The answer is yes, via Projects.
Phoenix uses Projects to group LLM traces. A project can be considered a collection of traces and a container for traces related to a single application or service. You can also have multiple projects with multiple traces.
Projects can be useful for various use-cases such as separating out testing and production, logging evaluation runs versus actual evaluation runs, looking at two different applications, and more.
In this quick tutorial, I dive into LLM application AB testing using Projects in Phoenix for a RAG chatbot in which questions are asked against a pre-built index of Arize’s documentation. While the same questions are asked in each project, what differs in the model – in this case, GPT-3.4 and GPT-4 — and results in terms of hallucination and QA correctness rates.