A community of founders and builders creating the next generation of technology.

Cerebral Valley

:tada: Been deep into AI research the last few weeks and months, got a huge number of papers dropping soon. First one landed is `Tiny QA Benchmark++` which is a micro QA dataset for evals along with a synthetic data generator module to generate QA dataset pairs for eval (and potentially training) your models:

• We found that <50 synthetic dataset was able to see a drift in model performance (Gemini, Mistral, Llamma)
• We found that a well rounded 10-20 QA dataset can help act as an initial test before running a bigger eval (Save $$ and time)
• Drift of accuracy differs by topic and/or language BUT this can be used to quickly see if a model has coverage in a topic without having to do some complex testing or analysis (could be a specific body of knowledge in your domain)
Paper: <https://huggingface.co/papers/2505.12058>
Github: <https://github.com/vincentkoc/tiny_qa_benchmark_pp>
Hugging Face Datasets: <https://huggingface.co/datasets/vincentkoc/tiny_qa_benchmark_pp>