I'm busy building an evaluation framework that I c...
# 06-technical-discussion
j
I'm busy building an evaluation framework that I can use to test changes in prompts and comparing different models (OpenAI / Google / Anthropic / OSS), and realized this probably already exists somewhere. Any good recommendations for model evaluation out there?
More info on the problem - I'm looking to compare different vision models (GPTV / Gemini 1.5, Claude 3.0, LLaVA etc..) for issue detection. Both: 1. Was an issue of a specific category detected or not (compared to ground truth on test set, showing true/false negative/positives) 2. Was the description of the issue similar to ground truth data (AI evaluated) I'm looking to run a fairly small test set (~100 samples) across different models/prompts to see what works best, and iterate and track progress and which combination is winning.
t
j
Thanks Tom! Helpful paper. Have you had any experience with commercial / OSS systems for this? (e.g. W&B, LLM Spark etc..)
e
Braintrust seems solid. Great customers, investors and the team is active on discord helping.
ty fixed 1