Index |  Research ▾  |  Verification ▾  | About

Field Audit Ledger

Autonomous audit of published AI benchmark claims using two-layer epistemic discipline. Total claims under audit: 145.  · JSON

132 CONTESTED  ·  8 CHALLENGED

CONTESTED: factual. The published literature reports irreconcilable numbers for the same model + benchmark pair across ≥ 2 independent papers. No language model is involved; this is a direct observation of the data.

CHALLENGED: open reproducibility challenge (unrebutted). Three independent audit roles found substantial, multi-angle concerns about replicability or methodological scope. Framed as an open challenge; invites rebuttal. This is not a claim that the original paper is wrong.

Open Reproducibility Challenges (8)

Each record below is a CONTESTED benchmark cluster (factual score mismatch already confirmed) where three independent audit roles also found substantial, multi-angle reproducibility or scope concerns. These are unrebutted open challenges, not falsity verdicts.

Llama-3 / Longbench

CHALLENGED · HIGH · concern 7.0/10 · 3 papers · 94.23pp spread (5.77%–100.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10):  The benchmark claim lacks critical details for replication. While the model (Llama-3) and benchmark (Longbench) are specified, there is no information on the data split (train/validation/test), shot count (few-shot or zero-shot), prompt format, evaluation harness and version, context length, or mode…
PROTOCOL DIVERGENCE ANALYST (concern 7.0/10):  The observed score discrepancies for Llama-3 on Longbench may stem from several methodological differences. Variations in context window size, which directly impacts performance on long-context benchmarks like Longbench, could explain part of the spread. Additionally, differences in quantization lev…
CLAIM SCOPE AUDITOR (concern 7.0/10):  The reported scores for Llama-3 on Longbench exhibit notable variability across different publications, raising concerns about reproducibility and methodological consistency. While some differences may stem from legitimate variations in experimental setups (e.g., hardware, prompt engineering, or che…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-10 20:34 UTC

Qwen2.5 / Ruler

CHALLENGED · HIGH · concern 5.7/10 · 2 papers · 93.78pp spread (1.94%–95.72%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10):  The benchmark claim lacks critical details necessary for replication. While the model (Qwen2.5) and benchmark (Ruler) are specified, there is no information on the data split (train/test/validation), shot count (few-shot or zero-shot), prompt format, evaluation harness (e.g., Hugging Face's `evaluat…
PROTOCOL DIVERGENCE ANALYST (concern 7.0/10):  The reported scores for Qwen2.5 on the Ruler benchmark show a notable spread, with no immediately obvious methodological differences documented in the papers. While variations in shot count, context window size, or quantization levels could contribute to the divergence, these are not explicitly ment…
CLAIM SCOPE AUDITOR (concern 3.0/10):  The reported scores for Qwen2.5 on Ruler show some variability across different publications, which raises concerns about reproducibility. While the differences are not extreme, they suggest that the benchmark results may be sensitive to specific experimental conditions such as prompt engineering, c…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-10 20:34 UTC

Llama-3.1-8B / Ruler

CHALLENGED · HIGH · concern 6.0/10 · 4 papers · 83.7pp spread (1.9%–85.6%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10):  The benchmark claim lacks critical details for replication, such as the specific data split used, shot count (few-shot or zero-shot), prompt format, evaluation harness and version, context length, and model variant/checkpoint/quantization. While the model and benchmark are specified, the absence of …
PROTOCOL DIVERGENCE ANALYST (concern 7.0/10):  The spread in reported scores for Llama-3.1-8B on Ruler is concerning due to the lack of clear methodological differences documented across the papers. While variations in shot count (few-shot vs. zero-shot), context window size, or quantization levels could explain some divergence, none of these fa…
CLAIM SCOPE AUDITOR (concern 4.0/10):  The benchmark score for Llama-3.1-8B on Ruler is reported across multiple peer-reviewed or arXiv publications, which suggests some level of reproducibility. However, the lack of detailed information about the specific experimental conditions (e.g., hardware, prompt, checkpoint, evaluation date, doma…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-10 20:34 UTC

Qwen2.5 / Docvqa

CHALLENGED · HIGH · concern 6.0/10 · 3 papers · 80.27pp spread (14.06%–94.33%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 6.0/10):  The benchmark claim lacks critical details for full reproducibility. While the model (Qwen2.5) and benchmark (DocVQA) are specified, key replication information such as the exact data split, shot count, prompt format, evaluation harness version, context length, and model variant/checkpoint/quantizat…
PROTOCOL DIVERGENCE ANALYST (concern 7.0/10):  The reported scores for Qwen2.5 on DocVQA show a notable spread, but no clear methodological differences are documented across the papers. Potential factors like shot count, context window size, or quantization levels are not explicitly mentioned in the sources, making it difficult to attribute the …
CLAIM SCOPE AUDITOR (concern 5.0/10):  The reported scores for Qwen2.5 on DocVQA vary across different publications, indicating potential inconsistencies in experimental setups, such as differences in hardware, prompts, checkpoints, or evaluation dates. While the benchmark is well-established, the lack of uniformity in reported results s…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-10 20:34 UTC

GPT-3.5 / Rouge-L

CHALLENGED · HIGH · concern 7.0/10 · 4 papers · 78.2pp spread (1.8%–80.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10):  The benchmark claim lacks critical details necessary for replication, such as the specific data split used, the exact prompt format, and the evaluation harness version. While the model (GPT-3.5) and metric (Rouge-L) are specified, variations in these other parameters can significantly impact results…
PROTOCOL DIVERGENCE ANALYST (concern 7.0/10):  The observed spread in ROUGE-L scores for GPT-3.5 could stem from several methodological differences. Key factors include variations in the evaluation harness version, which may affect scoring algorithms or preprocessing steps. Differences in the context window size or prompt engineering could also …
CLAIM SCOPE AUDITOR (concern 7.0/10):  The benchmark score for GPT-3.5 on Rouge-L is reported across multiple peer-reviewed or arXiv publications, but the variation in scores suggests potential inconsistencies in experimental setups. Without explicit details on the specific versions of GPT-3.5, prompt engineering, evaluation protocols, o…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-10 20:34 UTC

GPT-4o / SWE-bench

CHALLENGED · HIGH · concern 6.3/10 · 2 papers · 76.4pp spread (7.0%–83.4%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 7.0/10):  The benchmark claim lacks critical details for full reproducibility. While the model (GPT-4o) and benchmark (SWE-bench) are specified, key replication information is missing, such as the exact data split used (e.g., train/test/validation splits), the shot count (few-shot or zero-shot setting), the p…
PROTOCOL DIVERGENCE ANALYST (concern 5.0/10):  The reported scores for GPT-4o on SWE-bench show some variability, which could be attributed to several methodological factors. Potential explanations include differences in shot count (few-shot vs. zero-shot), variations in the context window size, or the use of different versions of the SWE-bench …
CLAIM SCOPE AUDITOR (concern 7.0/10):  The reported scores for GPT-4o on SWE-bench exhibit notable variability across different publications, suggesting potential inconsistencies in experimental setups, evaluation protocols, or data subsets. While some differences may stem from legitimate methodological variations, the lack of detailed d…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-10 20:34 UTC

Qwen3 / MATH

CHALLENGED · HIGH · concern 7.0/10 · 2 papers · 75.0pp spread (0.0%–75.0%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10):  The provided audit target contains only the model name (Qwen3) and the benchmark dataset (MATH), completely omitting all critical methodological parameters required for replication. Specifically, there is no information regarding the data split (e.g., test vs. competition split), shot count (zero-sh…
PROTOCOL DIVERGENCE ANALYST (concern 10.0/10):  No distinct published papers with divergent scores for 'Qwen3 on MATH' can be identified because Qwen3 has not been officially released or documented in peer-reviewed literature as of the current knowledge cutoff; the premise of the audit target relies on non-existent or hallucinated sources, making…
CLAIM SCOPE AUDITOR (concern 2.0/10):  The claim regarding Qwen3's performance on the MATH benchmark exhibits a low but non-zero scope-validity concern primarily due to the well-documented sensitivity of MATH scores to evaluation protocols, such as the use of chain-of-thought prompting, specific temperature settings, and the exact versio…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-10 20:36 UTC

mBERT / Pos

CHALLENGED · HIGH · concern 6.3/10 · 2 papers · 74.6pp spread (14.3%–88.9%)

Sources:

Open reproducibility challenge (unrebutted). Invites rebuttal:

REPRODUCIBILITY AUDITOR (concern 9.0/10):  The provided audit target contains only the model name (mBERT) and the benchmark dataset (Pos), completely omitting all critical methodological parameters required for replication. There is no information regarding the specific data split used (e.g., UD version, language subset, train/dev/test parti…
PROTOCOL DIVERGENCE ANALYST (concern 8.0/10):  The reported scores for mBERT on the POS (Part-of-Speech) tagging task exhibit significant variance that cannot be fully reconciled by standard protocol differences alone, primarily due to the lack of a single canonical 'POS' benchmark definition across multilingual literature. Methodological diverg…
CLAIM SCOPE AUDITOR (concern 2.0/10):  The claim 'mBERT on Pos' exhibits a minor scope-validity concern because 'Pos' (Part-of-Speech tagging) is not a single, monolithic dataset but a task category spanning numerous languages and treebanks (e.g., UD) with varying annotation standards. While mBERT's multilingual nature implies cross-ling…

Model-generated arguments, unverified. Challenge remains open. Rebuttal: contact@assignee.net.

Last updated: 2026-06-10 20:36 UTC

CONTESTED by the Literature (132 total)

Factual observation: ≥ 2 independent papers report irreconcilable scores for the same model + benchmark pair. No judgment is made about which paper is correct. No language model is involved in this determination.

Sev. Model Benchmark Spread (pp) Papers Updated
HIGH GPT-4o HumanEval 83.23pp 2026-06-10 20:34
HIGH DeepSeek-R1 MBPP 75.8pp 2026-06-10 20:36
HIGH GPT-4 Babilong 75.0pp 2026-06-10 20:36
HIGH DeepSeek-Coder MBPP 74.0pp 2026-06-10 20:36
HIGH Llama-2 MATH 73.43pp 2026-06-10 20:39
HIGH Qwen2.5 GSM8K 70.74pp 2026-06-10 20:34
HIGH GPT-3.5 Rouge-2 70.7pp 2026-06-10 20:34
HIGH DeepSeek-R1 MATH 67.3pp 2026-06-10 20:34
HIGH DeepSeek-R1 SWE-bench 67.3pp 2026-06-10 20:34
HIGH Llama-3.1-8B F1 66.6pp 2026-06-10 20:34
HIGH GPT-4o MATH 65.98pp 2026-06-10 20:34
HIGH GPT-3.5 MATH 64.84pp 2026-06-10 20:34
HIGH GPT-4 GSM8K 63.9pp 2026-06-10 20:34
HIGH Qwen2.5 HumanEval 63.74pp 2026-06-10 20:34
HIGH Llama-3 HumanEval 63.27pp 2026-06-10 20:34
HIGH GPT-4 Hlce 61.8pp 2026-06-10 20:34
HIGH CodeLlama-7B MBPP 61.8pp 2026-06-10 20:34
HIGH Llama-3.1-70B GSM8K 61.5pp 2026-06-10 20:34
HIGH Qwen2.5 MMLU 61.11pp 2026-06-10 20:34
HIGH GPT-3.5 Medqa 59.68pp 2026-06-10 20:34
HIGH LLaVA-Video-7B Longvideobench 59.42pp 2026-06-10 20:34
HIGH Qwen3 GSM8K 58.8pp 2026-06-10 20:34
HIGH LLaVA-Video-72B Longvideobench 58.43pp 2026-06-10 20:34
HIGH Llama-3 MATH 58.01pp 2026-06-10 20:34
HIGH Llama-3.1-8B MATH 57.56pp 2026-06-10 20:34
HIGH DeepSeek-R1 Codereval 57.3pp 2026-06-10 20:34
HIGH Llama-2 Svamp 57.3pp 2026-06-10 20:34
HIGH Qwen3 Math500 57.0pp 2026-06-10 20:34
HIGH GPT-5 GSM8K 55.86pp 2026-06-10 20:34
HIGH Gemini-1.5-Flash HumanEval 55.6pp 2026-06-10 20:34
HIGH Llama-3 Belebele 54.2pp 2026-06-10 20:34
HIGH Qwen2.5 MATH 54.0pp 2026-06-10 20:34
HIGH GPT-4o GSM8K 53.99pp 2026-06-10 20:34
HIGH GPT-4o Loong 53.65pp 2026-06-10 20:34
HIGH Llama-3 MMLU 53.0pp 2026-06-10 20:34
HIGH Gemini-1.5-Pro HumanEval 52.1pp 2026-06-10 20:34
HIGH Qwen2.5 Amc 50.51pp 2026-06-10 20:34
HIGH Llama-3.1-8B Rouge-L 50.4pp 2026-06-10 20:34
HIGH Qwen2.5 HellaSwag 48.6pp 2026-06-10 20:34
HIGH DeepSeek-Coder HumanEval 48.56pp 2026-06-10 20:34
HIGH Llama-2 Strategyqa 47.0pp 2026-06-10 20:34
HIGH Qwen2.5 MBPP 46.6pp 2026-06-10 20:34
HIGH GPT-4o MMLU 46.49pp 2026-06-10 20:34
HIGH GPT-4o Recall 45.71pp 2026-06-10 20:34
HIGH DeepSeek-V3 SWE-bench 45.67pp 2026-06-10 20:34
HIGH GPT-4 Leetcode 45.33pp 2026-06-10 20:34
HIGH ResNet-50 Imagenet 44.61pp 2026-06-10 20:34
HIGH Llama-3.1-8B Avg. Accuracy 43.83pp 2026-06-10 20:34
HIGH Llama-2 GSM8K 43.2pp 2026-06-10 20:34
HIGH PaLM BBH 43.2pp 2026-06-10 20:34
HIGH Llama-3 GSM8K 41.47pp 2026-06-10 20:34
HIGH Llama-3 Mt-Bench 40.98pp 2026-06-10 20:34
HIGH Llama-2 MMLU 40.23pp 2026-06-10 20:34
HIGH Llama-3.1-8B Easy 40.0pp 2026-06-10 20:34
HIGH GPT-4 MATH 39.89pp 2026-06-10 20:34
HIGH CodeLlama-7B HumanEval 38.3pp 2026-06-10 20:34
HIGH Llama-2 Multiarith 36.2pp 2026-06-10 20:34
HIGH DeepSeek-R1 Rouge-L 36.0pp 2026-06-10 20:34
HIGH Gemini-1.5-Flash MMLU 35.43pp 2026-06-10 20:34
HIGH Llama-3.1-70B MMLU 35.2pp 2026-06-10 20:34
HIGH Phi-3 MMLU 34.8pp 2026-06-10 20:34
HIGH Qwen2 GSM8K 34.5pp 2026-06-10 20:34
HIGH Llama-3 Average 34.35pp 2026-06-10 20:34
HIGH Qwen2.5 Rouge-L 34.0pp 2026-06-10 20:34
HIGH Llama-3.1-8B Longbench 33.53pp 2026-06-10 20:34
HIGH Llama-3.1-8B MMLU 33.32pp 2026-06-10 20:34
HIGH Gemini-1.5-Pro GSM8K 33.2pp 2026-06-10 20:34
HIGH Claude-3.5 Codereval 33.1pp 2026-06-10 20:34
HIGH Llama-3.1-70B BBH 33.0pp 2026-06-10 20:34
HIGH GPT-4 F1 Score 32.63pp 2026-06-10 20:34
HIGH GPT-3.5 GSM8K 32.0pp 2026-06-10 20:34
HIGH Qwen3 Longbench 31.37pp 2026-06-10 20:34
HIGH GPT-4 MMLU 30.3pp 2026-06-10 20:34
HIGH Gemini-1.5-Pro MMLU 30.3pp 2026-06-10 20:34
HIGH Llama-3.1-8B Safety 30.22pp 2026-06-10 20:34
HIGH Llama-3.1-70B Hotpotqa 30.2pp 2026-06-10 20:34
HIGH WizardCoder HumanEval 30.0pp 2026-06-10 20:34
HIGH Llama-2 Csqa 29.63pp 2026-06-10 20:34
HIGH Llama-3.1-8B GSM8K 29.46pp 2026-06-10 20:34
HIGH Llama-3 WinoGrande 28.8pp 2026-06-10 20:34
HIGH Llama-3.1-8B Hard 27.7pp 2026-06-10 20:34
HIGH Gemini-2.0 GSM8K 27.58pp 2026-06-10 20:34
HIGH GPT-4o Codereval 25.4pp 2026-06-10 20:34
HIGH LLaVA-Video-7B Videomme 24.7pp 2026-06-10 20:34
HIGH Qwen2 MATH 24.37pp 2026-06-10 20:34
HIGH GPT-3.5 HumanEval 24.27pp 2026-06-10 20:34
HIGH Llama-3.1-70B MATH 24.2pp 2026-06-10 20:34
HIGH Phi-3 MMLU-Pro 24.1pp 2026-06-10 20:34
HIGH Mistral-7B Rouge-L 24.0pp 2026-06-10 20:34
HIGH Qwen3 Olympiadbench 23.4pp 2026-06-10 20:34
HIGH GPT-4 MBPP 22.8pp 2026-06-10 20:34
HIGH GPT-4o Longvideobench 19.6pp 2026-06-10 20:34
HIGH Llama-3.1-70B F1 19.3pp 2026-06-10 20:34
HIGH Qwen2.5 Bleu 19.21pp 2026-06-10 20:34
HIGH Qwen2.5 AIME 18.15pp 2026-06-10 20:34
HIGH Llama-3 Svamp 17.87pp 2026-06-10 20:34
HIGH GPT-4 HumanEval 17.79pp 2026-06-10 20:34
HIGH Gemini-2.0 MATH 17.46pp 2026-06-10 20:34
HIGH Llama-2 Coin Flip 17.0pp 2026-06-10 20:34
HIGH Qwen3 Accuracy 17.0pp 2026-06-10 20:34
HIGH LLaMA Ppl 15.95pp 2026-06-10 20:34
HIGH DistilBERT Squad 14.44pp 2026-06-10 20:34
HIGH Llama-2 Ruler 14.4pp 2026-06-10 20:34
HIGH Llama-3 MMLU-Pro 14.2pp 2026-06-10 20:34
HIGH GPT-4o Precision 13.9pp 2026-06-10 20:34
HIGH Llama-2 Accuracy 13.33pp 2026-06-10 20:34
HIGH Llama-3 Ifeval 12.88pp 2026-06-10 20:34
HIGH GLM-4 SWE-bench 11.9pp 2026-06-10 20:34
HIGH GPT-4 MMLU-Pro 11.3pp 2026-06-10 20:34
HIGH Gemini-1.5-Pro Egoschema 10.7pp 2026-06-10 20:34
HIGH LLaMA-7B C4 10.68pp 2026-06-10 20:34
HIGH LLaDA-MoE-7B-A1B MATH 10.4pp 2026-06-10 20:34
HIGH Qwen3 HellaSwag 10.4pp 2026-06-10 20:34
MEDIUM LLaMA-13B C4 9.56pp 2026-06-10 20:34
MEDIUM LoRA-FAIR Domainnet 8.6pp 2026-06-10 20:34
MEDIUM DeepSeek-R1 MMLU 8.3pp 2026-06-10 20:34
MEDIUM IRCoT Musique 8.3pp 2026-06-10 20:34
MEDIUM GPT-3.5 MMLU 7.7pp 2026-06-10 20:34
MEDIUM DeepSeek-R1 HumanEval 7.3pp 2026-06-10 20:34
MEDIUM LLaDA-MoE-7B-A1B GSM8K 7.0pp 2026-06-10 20:34
MEDIUM GPT-4 SWE-bench 6.8pp 2026-06-10 20:34
MEDIUM DeepSeek-33B HumanEval 6.7pp 2026-06-10 20:34
MEDIUM Llama-3 Accuracy 6.6pp 2026-06-10 20:34
MEDIUM Qwen3 LiveCodeBench 6.57pp 2026-06-10 20:34
MEDIUM Gemini-1.5-Pro MATH 6.5pp 2026-06-10 20:34
MEDIUM Llama-3.1-8B Rewardbench 6.2pp 2026-06-10 20:34
MEDIUM Llama-3.1-8B Ifeval 6.1pp 2026-06-10 20:34
MEDIUM Llama-3.1-70B Musique 6.0pp 2026-06-10 20:34
MEDIUM LightGCL Amazon-Book 5.76pp 2026-06-10 20:34
MEDIUM Claude-3.5 SWE-bench 5.1pp 2026-06-10 20:34
LOW Qwen3 MMLU 4.8pp 2026-06-10 20:34
LOW Llama-3.1-8B HellaSwag 4.7pp 2026-06-10 20:34
LOW Llama-3.1-70B Rewardbench 4.7pp 2026-06-10 20:34
LOW Claude-3.5 HumanEval 4.4pp 2026-06-10 20:34
LOW Qwen2.5 SWE-bench 3.5pp 2026-06-10 20:34
LOW Qwen3 AIME 3.3pp 2026-06-10 20:34
LOW Gemma-2-9B MMLU-Pro 3.1pp 2026-06-10 20:34