A community of founders and builders creating the next generation of technology.

Cerebral Valley

For my latest piece in partnership with Arize AI co-founder Aparna Dhinakaran, we ran ran several experiments to evaluate and compare GPT-4, Claude 2.1 and Claude 3.0 Opus’s generation capabilities.

A few big takeaways:
- Inherent model behaviors and prompt engineering matter A LOT in RAG systems.
- Simply adding “Please explain yourself then answer the question” to a prompt template significantly improves (more than 2x) GPT-4’s performance at an array of tasks. It’s clear that when an LLM talks answers out, it seems to help in unfolding ideas. It’s possible that by explaining, a model is re-enforcing the right answer in embedding/attention space.
- The verbosity of a model’s responses introduces a variable that can significantly influence their perceived performance. This nuance may suggest that future model evaluations should consider the average length of responses as a noted factor, providing a better understanding of a model’s capabilities and ensuring a fairer comparison.

Read it: <https://arize.com/blog-course/research-techniques-for-better-retrieved-generation-rag/>

A longer response is also going to be slower and more expensive.