Are there any better ways to weigh the GPT-4 responses (like get Confidence scores) based on domain knowledge? I am trying something on the lines of generating multiple messages and judge which one is the better one. Would love if someone can share some resources or share some examples/experiences etc.