Field Audit Ledger
CONTESTED: factual. The published literature reports irreconcilable numbers for the same model + benchmark pair across ≥ 2 independent papers. No language model is involved; this is a direct observation of the data.
CHALLENGED: open reproducibility challenge (unrebutted). Three independent audit roles found substantial, multi-angle concerns about replicability or methodological scope. Framed as an open challenge; invites rebuttal. This is not a claim that the original paper is wrong.
Open Reproducibility Challenges (8)
Each record below is a CONTESTED benchmark cluster (factual score mismatch already confirmed) where three independent audit roles also found substantial, multi-angle reproducibility or scope concerns. These are unrebutted open challenges, not falsity verdicts.
Llama-3 / Longbench
Sources:
Open reproducibility challenge (unrebutted). Invites rebuttal:
Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.
Last updated: 2026-06-10 20:34 UTC
Qwen2.5 / Ruler
Sources:
Open reproducibility challenge (unrebutted). Invites rebuttal:
Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.
Last updated: 2026-06-10 20:34 UTC
Llama-3.1-8B / Ruler
Sources:
Open reproducibility challenge (unrebutted). Invites rebuttal:
Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.
Last updated: 2026-06-10 20:34 UTC
Qwen2.5 / Docvqa
Sources:
Open reproducibility challenge (unrebutted). Invites rebuttal:
Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.
Last updated: 2026-06-10 20:34 UTC
GPT-3.5 / Rouge-L
Sources:
Open reproducibility challenge (unrebutted). Invites rebuttal:
Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.
Last updated: 2026-06-10 20:34 UTC
GPT-4o / SWE-bench
Sources:
Open reproducibility challenge (unrebutted). Invites rebuttal:
Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.
Last updated: 2026-06-10 20:34 UTC
Qwen3 / MATH
Sources:
Open reproducibility challenge (unrebutted). Invites rebuttal:
Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.
Last updated: 2026-06-10 20:36 UTC
mBERT / Pos
Sources:
Open reproducibility challenge (unrebutted). Invites rebuttal:
Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.
Last updated: 2026-06-10 20:36 UTC
CONTESTED by the Literature (132 total)
Factual observation: ≥ 2 independent papers report irreconcilable scores for the same model + benchmark pair. No judgment is made about which paper is correct. No language model is involved in this determination.
| Sev. | Model | Benchmark | Spread (pp) | Papers | Updated |
|---|---|---|---|---|---|
| HIGH | GPT-4o | HumanEval | 83.23pp | 2026-06-10 20:34 | |
| HIGH | DeepSeek-R1 | MBPP | 75.8pp | 2026-06-10 20:36 | |
| HIGH | GPT-4 | Babilong | 75.0pp | 2026-06-10 20:36 | |
| HIGH | DeepSeek-Coder | MBPP | 74.0pp | 2026-06-10 20:36 | |
| HIGH | Llama-2 | MATH | 73.43pp | 2026-06-10 20:39 | |
| HIGH | Qwen2.5 | GSM8K | 70.74pp | 2026-06-10 20:34 | |
| HIGH | GPT-3.5 | Rouge-2 | 70.7pp | 2026-06-10 20:34 | |
| HIGH | DeepSeek-R1 | MATH | 67.3pp | 2026-06-10 20:34 | |
| HIGH | DeepSeek-R1 | SWE-bench | 67.3pp | 2026-06-10 20:34 | |
| HIGH | Llama-3.1-8B | F1 | 66.6pp | 2026-06-10 20:34 | |
| HIGH | GPT-4o | MATH | 65.98pp | 2026-06-10 20:34 | |
| HIGH | GPT-3.5 | MATH | 64.84pp | 2026-06-10 20:34 | |
| HIGH | GPT-4 | GSM8K | 63.9pp | 2026-06-10 20:34 | |
| HIGH | Qwen2.5 | HumanEval | 63.74pp | 2026-06-10 20:34 | |
| HIGH | Llama-3 | HumanEval | 63.27pp | 2026-06-10 20:34 | |
| HIGH | GPT-4 | Hlce | 61.8pp | 2026-06-10 20:34 | |
| HIGH | CodeLlama-7B | MBPP | 61.8pp | 2026-06-10 20:34 | |
| HIGH | Llama-3.1-70B | GSM8K | 61.5pp | 2026-06-10 20:34 | |
| HIGH | Qwen2.5 | MMLU | 61.11pp | 2026-06-10 20:34 | |
| HIGH | GPT-3.5 | Medqa | 59.68pp | 2026-06-10 20:34 | |
| HIGH | LLaVA-Video-7B | Longvideobench | 59.42pp | 2026-06-10 20:34 | |
| HIGH | Qwen3 | GSM8K | 58.8pp | 2026-06-10 20:34 | |
| HIGH | LLaVA-Video-72B | Longvideobench | 58.43pp | 2026-06-10 20:34 | |
| HIGH | Llama-3 | MATH | 58.01pp | 2026-06-10 20:34 | |
| HIGH | Llama-3.1-8B | MATH | 57.56pp | 2026-06-10 20:34 | |
| HIGH | DeepSeek-R1 | Codereval | 57.3pp | 2026-06-10 20:34 | |
| HIGH | Llama-2 | Svamp | 57.3pp | 2026-06-10 20:34 | |
| HIGH | Qwen3 | Math500 | 57.0pp | 2026-06-10 20:34 | |
| HIGH | GPT-5 | GSM8K | 55.86pp | 2026-06-10 20:34 | |
| HIGH | Gemini-1.5-Flash | HumanEval | 55.6pp | 2026-06-10 20:34 | |
| HIGH | Llama-3 | Belebele | 54.2pp | 2026-06-10 20:34 | |
| HIGH | Qwen2.5 | MATH | 54.0pp | 2026-06-10 20:34 | |
| HIGH | GPT-4o | GSM8K | 53.99pp | 2026-06-10 20:34 | |
| HIGH | GPT-4o | Loong | 53.65pp | 2026-06-10 20:34 | |
| HIGH | Llama-3 | MMLU | 53.0pp | 2026-06-10 20:34 | |
| HIGH | Gemini-1.5-Pro | HumanEval | 52.1pp | 2026-06-10 20:34 | |
| HIGH | Qwen2.5 | Amc | 50.51pp | 2026-06-10 20:34 | |
| HIGH | Llama-3.1-8B | Rouge-L | 50.4pp | 2026-06-10 20:34 | |
| HIGH | Qwen2.5 | HellaSwag | 48.6pp | 2026-06-10 20:34 | |
| HIGH | DeepSeek-Coder | HumanEval | 48.56pp | 2026-06-10 20:34 | |
| HIGH | Llama-2 | Strategyqa | 47.0pp | 2026-06-10 20:34 | |
| HIGH | Qwen2.5 | MBPP | 46.6pp | 2026-06-10 20:34 | |
| HIGH | GPT-4o | MMLU | 46.49pp | 2026-06-10 20:34 | |
| HIGH | GPT-4o | Recall | 45.71pp | 2026-06-10 20:34 | |
| HIGH | DeepSeek-V3 | SWE-bench | 45.67pp | 2026-06-10 20:34 | |
| HIGH | GPT-4 | Leetcode | 45.33pp | 2026-06-10 20:34 | |
| HIGH | ResNet-50 | Imagenet | 44.61pp | 2026-06-10 20:34 | |
| HIGH | Llama-3.1-8B | Avg. Accuracy | 43.83pp | 2026-06-10 20:34 | |
| HIGH | Llama-2 | GSM8K | 43.2pp | 2026-06-10 20:34 | |
| HIGH | PaLM | BBH | 43.2pp | 2026-06-10 20:34 | |
| HIGH | Llama-3 | GSM8K | 41.47pp | 2026-06-10 20:34 | |
| HIGH | Llama-3 | Mt-Bench | 40.98pp | 2026-06-10 20:34 | |
| HIGH | Llama-2 | MMLU | 40.23pp | 2026-06-10 20:34 | |
| HIGH | Llama-3.1-8B | Easy | 40.0pp | 2026-06-10 20:34 | |
| HIGH | GPT-4 | MATH | 39.89pp | 2026-06-10 20:34 | |
| HIGH | CodeLlama-7B | HumanEval | 38.3pp | 2026-06-10 20:34 | |
| HIGH | Llama-2 | Multiarith | 36.2pp | 2026-06-10 20:34 | |
| HIGH | DeepSeek-R1 | Rouge-L | 36.0pp | 2026-06-10 20:34 | |
| HIGH | Gemini-1.5-Flash | MMLU | 35.43pp | 2026-06-10 20:34 | |
| HIGH | Llama-3.1-70B | MMLU | 35.2pp | 2026-06-10 20:34 | |
| HIGH | Phi-3 | MMLU | 34.8pp | 2026-06-10 20:34 | |
| HIGH | Qwen2 | GSM8K | 34.5pp | 2026-06-10 20:34 | |
| HIGH | Llama-3 | Average | 34.35pp | 2026-06-10 20:34 | |
| HIGH | Qwen2.5 | Rouge-L | 34.0pp | 2026-06-10 20:34 | |
| HIGH | Llama-3.1-8B | Longbench | 33.53pp | 2026-06-10 20:34 | |
| HIGH | Llama-3.1-8B | MMLU | 33.32pp | 2026-06-10 20:34 | |
| HIGH | Gemini-1.5-Pro | GSM8K | 33.2pp | 2026-06-10 20:34 | |
| HIGH | Claude-3.5 | Codereval | 33.1pp | 2026-06-10 20:34 | |
| HIGH | Llama-3.1-70B | BBH | 33.0pp | 2026-06-10 20:34 | |
| HIGH | GPT-4 | F1 Score | 32.63pp | 2026-06-10 20:34 | |
| HIGH | GPT-3.5 | GSM8K | 32.0pp | 2026-06-10 20:34 | |
| HIGH | Qwen3 | Longbench | 31.37pp | 2026-06-10 20:34 | |
| HIGH | GPT-4 | MMLU | 30.3pp | 2026-06-10 20:34 | |
| HIGH | Gemini-1.5-Pro | MMLU | 30.3pp | 2026-06-10 20:34 | |
| HIGH | Llama-3.1-8B | Safety | 30.22pp | 2026-06-10 20:34 | |
| HIGH | Llama-3.1-70B | Hotpotqa | 30.2pp | 2026-06-10 20:34 | |
| HIGH | WizardCoder | HumanEval | 30.0pp | 2026-06-10 20:34 | |
| HIGH | Llama-2 | Csqa | 29.63pp | 2026-06-10 20:34 | |
| HIGH | Llama-3.1-8B | GSM8K | 29.46pp | 2026-06-10 20:34 | |
| HIGH | Llama-3 | WinoGrande | 28.8pp | 2026-06-10 20:34 | |
| HIGH | Llama-3.1-8B | Hard | 27.7pp | 2026-06-10 20:34 | |
| HIGH | Gemini-2.0 | GSM8K | 27.58pp | 2026-06-10 20:34 | |
| HIGH | GPT-4o | Codereval | 25.4pp | 2026-06-10 20:34 | |
| HIGH | LLaVA-Video-7B | Videomme | 24.7pp | 2026-06-10 20:34 | |
| HIGH | Qwen2 | MATH | 24.37pp | 2026-06-10 20:34 | |
| HIGH | GPT-3.5 | HumanEval | 24.27pp | 2026-06-10 20:34 | |
| HIGH | Llama-3.1-70B | MATH | 24.2pp | 2026-06-10 20:34 | |
| HIGH | Phi-3 | MMLU-Pro | 24.1pp | 2026-06-10 20:34 | |
| HIGH | Mistral-7B | Rouge-L | 24.0pp | 2026-06-10 20:34 | |
| HIGH | Qwen3 | Olympiadbench | 23.4pp | 2026-06-10 20:34 | |
| HIGH | GPT-4 | MBPP | 22.8pp | 2026-06-10 20:34 | |
| HIGH | GPT-4o | Longvideobench | 19.6pp | 2026-06-10 20:34 | |
| HIGH | Llama-3.1-70B | F1 | 19.3pp | 2026-06-10 20:34 | |
| HIGH | Qwen2.5 | Bleu | 19.21pp | 2026-06-10 20:34 | |
| HIGH | Qwen2.5 | AIME | 18.15pp | 2026-06-10 20:34 | |
| HIGH | Llama-3 | Svamp | 17.87pp | 2026-06-10 20:34 | |
| HIGH | GPT-4 | HumanEval | 17.79pp | 2026-06-10 20:34 | |
| HIGH | Gemini-2.0 | MATH | 17.46pp | 2026-06-10 20:34 | |
| HIGH | Llama-2 | Coin Flip | 17.0pp | 2026-06-10 20:34 | |
| HIGH | Qwen3 | Accuracy | 17.0pp | 2026-06-10 20:34 | |
| HIGH | LLaMA | Ppl | 15.95pp | 2026-06-10 20:34 | |
| HIGH | DistilBERT | Squad | 14.44pp | 2026-06-10 20:34 | |
| HIGH | Llama-2 | Ruler | 14.4pp | 2026-06-10 20:34 | |
| HIGH | Llama-3 | MMLU-Pro | 14.2pp | 2026-06-10 20:34 | |
| HIGH | GPT-4o | Precision | 13.9pp | 2026-06-10 20:34 | |
| HIGH | Llama-2 | Accuracy | 13.33pp | 2026-06-10 20:34 | |
| HIGH | Llama-3 | Ifeval | 12.88pp | 2026-06-10 20:34 | |
| HIGH | GLM-4 | SWE-bench | 11.9pp | 2026-06-10 20:34 | |
| HIGH | GPT-4 | MMLU-Pro | 11.3pp | 2026-06-10 20:34 | |
| HIGH | Gemini-1.5-Pro | Egoschema | 10.7pp | 2026-06-10 20:34 | |
| HIGH | LLaMA-7B | C4 | 10.68pp | 2026-06-10 20:34 | |
| HIGH | LLaDA-MoE-7B-A1B | MATH | 10.4pp | 2026-06-10 20:34 | |
| HIGH | Qwen3 | HellaSwag | 10.4pp | 2026-06-10 20:34 | |
| MEDIUM | LLaMA-13B | C4 | 9.56pp | 2026-06-10 20:34 | |
| MEDIUM | LoRA-FAIR | Domainnet | 8.6pp | 2026-06-10 20:34 | |
| MEDIUM | DeepSeek-R1 | MMLU | 8.3pp | 2026-06-10 20:34 | |
| MEDIUM | IRCoT | Musique | 8.3pp | 2026-06-10 20:34 | |
| MEDIUM | GPT-3.5 | MMLU | 7.7pp | 2026-06-10 20:34 | |
| MEDIUM | DeepSeek-R1 | HumanEval | 7.3pp | 2026-06-10 20:34 | |
| MEDIUM | LLaDA-MoE-7B-A1B | GSM8K | 7.0pp | 2026-06-10 20:34 | |
| MEDIUM | GPT-4 | SWE-bench | 6.8pp | 2026-06-10 20:34 | |
| MEDIUM | DeepSeek-33B | HumanEval | 6.7pp | 2026-06-10 20:34 | |
| MEDIUM | Llama-3 | Accuracy | 6.6pp | 2026-06-10 20:34 | |
| MEDIUM | Qwen3 | LiveCodeBench | 6.57pp | 2026-06-10 20:34 | |
| MEDIUM | Gemini-1.5-Pro | MATH | 6.5pp | 2026-06-10 20:34 | |
| MEDIUM | Llama-3.1-8B | Rewardbench | 6.2pp | 2026-06-10 20:34 | |
| MEDIUM | Llama-3.1-8B | Ifeval | 6.1pp | 2026-06-10 20:34 | |
| MEDIUM | Llama-3.1-70B | Musique | 6.0pp | 2026-06-10 20:34 | |
| MEDIUM | LightGCL | Amazon-Book | 5.76pp | 2026-06-10 20:34 | |
| MEDIUM | Claude-3.5 | SWE-bench | 5.1pp | 2026-06-10 20:34 | |
| LOW | Qwen3 | MMLU | 4.8pp | 2026-06-10 20:34 | |
| LOW | Llama-3.1-8B | HellaSwag | 4.7pp | 2026-06-10 20:34 | |
| LOW | Llama-3.1-70B | Rewardbench | 4.7pp | 2026-06-10 20:34 | |
| LOW | Claude-3.5 | HumanEval | 4.4pp | 2026-06-10 20:34 | |
| LOW | Qwen2.5 | SWE-bench | 3.5pp | 2026-06-10 20:34 | |
| LOW | Qwen3 | AIME | 3.3pp | 2026-06-10 20:34 | |
| LOW | Gemma-2-9B | MMLU-Pro | 3.1pp | 2026-06-10 20:34 |