More info on the problem - I'm looking to compare different vision models (GPTV / Gemini 1.5, Claude 3.0, LLaVA etc..) for issue detection. Both:
1. Was an issue of a specific category detected or not (compared to ground truth on test set, showing true/false negative/positives)
2. Was the description of the issue similar to ground truth data (AI evaluated)
I'm looking to run a fairly small test set (~100 samples) across different models/prompts to see what works best, and iterate and track progress and which combination is winning.