Reported Scores
| Model | Score | Source paper | Year |
|---|---|---|---|
| Llama-2 | 88.3% | GuiLoMo: Allocating Expert Number and Rank for LoRA-MoE via Bilevel Optimization with GuidedSelection Vectors / arxiv.org | 2025 |
| Llama-2 | 28.9% | SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models / arxiv.org | 2024 |
| Llama-2 | 19.2% | Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations / arxiv.org | 2023 |
| Llama-2 | 14.9% | DONOD: Efficient and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning / arxiv.org | 2025 |
Interpretation
This page groups score claims extracted from papers for the same model and benchmark label. A nonzero spread means the public literature reports different values for this cluster.
Differences are not automatically errors. They may come from prompt choices, dataset versions, evaluation protocol, scoring rule, preprocessing, fine-tuning, or reporting convention. Source papers remain authoritative for their own claims. See the quality guide for how to read evidence links, manifests, and automated assessment fields.
Source coverage is a conservative count of distinct public paper URLs or titles in the cluster. It measures coverage breadth, not correctness.
Source profile reports public URL domains and publication years when they are available in extracted records. It is included for auditability only.