For my latest piece in partnership with Arize AI c...
# 07-self-promotion
e
For my latest piece in partnership with Arize AI co-founder Aparna Dhinakaran, we ran ran several experiments to evaluate and compare GPT-4, Claude 2.1 and Claude 3.0 Opus’s generation capabilities. A few big takeaways: - Inherent model behaviors and prompt engineering matter A LOT in RAG systems. - Simply adding “Please explain yourself then answer the question” to a prompt template significantly improves (more than 2x) GPT-4’s performance at an array of tasks. It’s clear that when an LLM talks answers out, it seems to help in unfolding ideas. It’s possible that by explaining, a model is re-enforcing the right answer in embedding/attention space. - The verbosity of a model’s responses introduces a variable that can significantly influence their perceived performance. This nuance may suggest that future model evaluations should consider the average length of responses as a noted factor, providing a better understanding of a model’s capabilities and ensuring a fairer comparison. Read it: https://arize.com/blog-course/research-techniques-for-better-retrieved-generation-rag/
g
A longer response is also going to be slower and more expensive.