Vincent Koc
05/20/2025, 11:40 AMTiny QA Benchmark++
which is a micro QA dataset for evals along with a synthetic data generator module to generate QA dataset pairs for eval (and potentially training) your models:
• We found that <50 synthetic dataset was able to see a drift in model performance (Gemini, Mistral, Llamma)
• We found that a well rounded 10-20 QA dataset can help act as an initial test before running a bigger eval (Save $$ and time)
• Drift of accuracy differs by topic and/or language BUT this can be used to quickly see if a model has coverage in a topic without having to do some complex testing or analysis (could be a specific body of knowledge in your domain)
Paper: https://huggingface.co/papers/2505.12058
Github: https://github.com/vincentkoc/tiny_qa_benchmark_pp
Hugging Face Datasets: https://huggingface.co/datasets/vincentkoc/tiny_qa_benchmark_pp