Using LLMs to conduct numeric evals, while increasingly popular, is finicky and unreliable.
That's the main takeaway of my latest blog with Aparna Dhinakaran, which tackles research on how well several major LLMs -- OpenAI's GPT-4, Anthropic's Claude, and Mistral AI's Mixtral-8x7b -- conduct numeric evaluations (in short, not great).
TL;DR research takeaways:
• Numeric score evaluations across LLMs are not consistent, and small differences in prompt templates can lead to massive discrepancies in results.
• Even holding all independent variables (model, prompt template, context) constant can lead to varying results across multiple rounds of testing. LLMs are not deterministic, and some are not at all consistent in their numeric judgements.
• We don't believe ChatGPT, Claude, or Mixtral (the three models we tested) handle continuous ranges well enough to use them for numeric score evals.
Read it:
https://arize.com/blog-course/numeric-evals-for-llm-as-a-judge/