I m busy building an evaluation framework that I can use to Cerebral Valley #06-technical-discussion

I'm busy building an evaluation framework that I c...

Jono Millin

03/12/2024, 10:13 PM

I'm busy building an evaluation framework that I can use to test changes in prompts and comparing different models (OpenAI / Google / Anthropic / OSS), and realized this probably already exists somewhere. Any good recommendations for model evaluation out there?

Jono Millin

03/12/2024, 10:21 PM

More info on the problem - I'm looking to compare different vision models (GPTV / Gemini 1.5, Claude 3.0, LLaVA etc..) for issue detection. Both: 1. Was an issue of a specific category detected or not (compared to ground truth on test set, showing true/false negative/positives) 2. Was the description of the issue similar to ground truth data (AI evaluated) I'm looking to run a fairly small test set (~100 samples) across different models/prompts to see what works best, and iterate and track progress and which combination is winning.

Tom Smoker

03/12/2024, 10:48 PM

https://arxiv.org/pdf/2310.19736.pdf

Jono Millin

03/13/2024, 3:07 PM

Thanks Tom! Helpful paper. Have you had any experience with commercial / OSS systems for this? (e.g. W&B, LLM Spark etc..)

Jono Millin

03/13/2024, 8:49 PM

Took a quick poke and came across: Paid: • Weights and Biases • https://www.braintrustdata.com/ • https://www.getscorecard.ai/ • https://promptlayer.com/ • https://www.traceloop.com/ • https://athina.ai/ • https://www.helicone.ai/ OSS • https://github.com/uptrain-ai/uptrain • https://github.com/Arize-ai/phoenix • https://github.com/promptfoo/promptfoo • (see more: https://github.com/topics/llm-evaluation ) Curious to hear if anyone has direct/personal experience?

Ehren Maedge

04/11/2024, 9:09 PM

Braintrust seems solid. Great customers, investors and the team is active on discord helping.

ty fixed 1

13 Views

Open in Slack

Previous Next