A community of founders and builders creating the next generation of technology.

Cerebral Valley

Are there any better ways to weigh the GPT-4 responses (like get Confidence scores) based on domain knowledge? I am trying something on the lines of generating multiple messages and judge which one is the better one. Would love if someone can share some resources or share some examples/experiences etc.

Do you know what the right answers should be?

<https://twitter.com/alanpog/status/1641912565918838786>

gpt4 lacks logprobs but i guess you could try to hack around that haha

Yes <@U054RJ92TRR> we do have Human Edits in those messages, that can serve as a ground truth, curious to learn what you have in mind

On quick fix might be to use vector embeddings to compute a cosine Similarity score between your ground truth and what the model generates.