GPT-4 on MMLU

Evidence cluster: 11 reported scores
Source coverage: 11 distinct sources (broad)
Source profile: arxiv.org, doi.org, 2023 to 2026
Reported range: 57.0% to 87.3%
Spread: 30.3 pp (HIGH)

JSON Quality guide

Reported Scores

Model	Score	Source paper	Year
GPT-4	87.3%	Adaptive Self-Prompting in Agentic LLM Frameworks for Code Fault Detection / doi.org	2026
GPT-4	87.3%	ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing / arxiv.org	2026
GPT-4	87.3%	Vendi-RAG: Adaptively Trading-Off Diversity And Quality Significantly Improves Retrieval Augmented Generation With LLMs / arxiv.org	2025
GPT-4	87.3%	A Systematic Evaluation of On-Device LLMs: Quantization, Performance, and Resources / arxiv.org	2025
GPT-4	87.3%	MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / arxiv.org	2024
GPT-4	87.3%	Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities / doi.org	2023
GPT-4	87.3%	Capabilities of GPT-4 on Medical Challenge Problems / arxiv.org	2023
GPT-4	87.3%	Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance / arxiv.org	2023
GPT-4	86.4%	Data Engineering for Scaling Language Models to 128K Context / arxiv.org	2024
GPT-4	83.9%	Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments / arxiv.org	2024
GPT-4	57.0%	Investigating Data Contamination in Modern Benchmarks for Large Language Models / doi.org	2023

Interpretation

This page groups score claims extracted from papers for the same model and benchmark label. A nonzero spread means the public literature reports different values for this cluster.

Differences are not automatically errors. They may come from prompt choices, dataset versions, evaluation protocol, scoring rule, preprocessing, fine-tuning, or reporting convention. Source papers remain authoritative for their own claims. See the quality guide for how to read evidence links, manifests, and automated assessment fields.

Source coverage is a conservative count of distinct public paper URLs or titles in the cluster. It measures coverage breadth, not correctness.

Source profile reports public URL domains and publication years when they are available in extracted records. It is included for auditability only.