Index  |  Benchmarks  |  Mathematics  |  Graph  |  About

Benchmark Tracker

Cross-paper score extraction; automated discrepancy detection; updated continuously

3504 score entries 652 models tracked 1050 benchmarks 91 discrepancies detected

Discrepancies

Same model, same benchmark, different papers report divergent scores (≥3 pp spread). Ordered by spread. Evidence index.

Model Benchmark Δ pp Reported scores N
Codegen-Mono Cwe Detection Recall 100.0 100.0% / 0.0% / evidence 2
Codegen-Mono Cwe Detection F1-Score 99.0 99.0% / 0.0% / evidence 2
Codegen-Mono Cwe Detection Accuracy 99.0 99.0% / 0.0% / evidence 2
Codegen-Mono Cwe Detection Precision 98.1 98.1% / 0.0% / evidence 2
GPT-4o HumanEval 82.8 86.2% / 86.2% / 27.7% / 3.4% / evidence 4
GPT-4o SWE-bench 76.4 83.4% / 7.0% / evidence 2
DeepSeek-R1 MBPP 75.8 92.6% / 16.8% / evidence 2
GPT-4 Babilong 75.0 85.0% / 10.0% / evidence 2
Llama-2 MATH 73.4 88.3% / 28.9% / 19.2% / 14.9% / evidence 4
Qwen2.5 GSM8K 70.7 95.0% / 91.5% / 69.9% / 47.5% / 40.0% / 37.7% / 24.3% / evidence 7
DeepSeek-R1 MATH 67.3 97.3% / 79.8% / 72.2% / 30.0% / evidence 4
DeepSeek-R1 SWE-bench 67.3 72.1% / 44.8% / 4.8% / evidence 3
GPT-4o MATH 66.0 95.0% / 85.9% / 76.7% / 73.4% / 72.7% / 29.0% / evidence 6
GPT-4 GSM8K 63.9 95.0% / 94.9% / 92.7% / 84.2% / 56.2% / 40.0% / 31.1% / evidence 7
Qwen2.5 HumanEval 63.7 96.3% / 82.2% / 79.6% / 59.6% / 59.1% / 41.0% / 32.6% / evidence 7
Llama-3 HumanEval 63.3 81.7% / 18.4% / evidence 2
GPT-4 Hlce 61.8 76.9% / 15.1% / evidence 2
Qwen2.5 MMLU 61.1 86.1% / 74.9% / 70.0% / 25.0% / evidence 4
LLaVA-Video-7B Longvideobench 59.4 62.7% / 3.3% / 3.3% / evidence 3
DeepSeek-Coder MBPP 59.1 94.2% / 35.1% / evidence 2
LLaVA-Video-72B Longvideobench 58.4 61.9% / 3.5% / evidence 2
Llama-3 MATH 58.0 87.0% / 76.6% / 72.2% / 48.9% / 39.3% / 29.0% / evidence 6
Llama-3.1-8B MATH 57.6 82.9% / 30.0% / 25.3% / evidence 3
DeepSeek-R1 Codereval 57.3 59.2% / 1.9% / evidence 2
GPT-5 GSM8K 55.9 97.4% / 41.5% / evidence 2
Gemini-1.5-Flash HumanEval 55.6 73.0% / 17.4% / evidence 2
Qwen2.5 MATH 54.0 81.0% / 75.0% / 71.3% / 62.1% / 46.4% / 45.7% / 44.1% / 34.8% / 27.0% / evidence 9
GPT-4o GSM8K 54.0 95.9% / 93.7% / 91.4% / 41.9% / evidence 4
GPT-4o Loong 53.7 74.0% / 20.3% / evidence 2
Gemini-1.5-Pro HumanEval 52.1 75.0% / 48.0% / 22.9% / evidence 3
GPT-3.5 MATH 51.2 93.7% / 72.2% / 42.5% / evidence 3
Llama-3.1-8B Rouge-L 50.4 53.0% / 2.6% / evidence 2
Qwen2.5 HellaSwag 48.6 87.6% / 39.0% / evidence 2
DeepSeek-Coder HumanEval 48.6 97.6% / 79.3% / 59.6% / 49.0% / evidence 4
Qwen2.5 MBPP 46.6 85.2% / 69.2% / 38.6% / evidence 3
GPT-4 Leetcode 45.3 85.0% / 39.7% / evidence 2
Llama-3 Longbench 43.6 49.4% / 5.8% / evidence 2
PaLM BBH 43.2 49.2% / 6.0% / evidence 2
Llama-3 GSM8K 41.5 95.8% / 95.4% / 62.5% / 54.3% / evidence 4
LongChat-7B-v1.5-32K Ruler 40.0 100.0% / 60.0% / evidence 2
GPT-4 MATH 39.9 69.7% / 60.1% / 48.1% / 42.5% / 29.8% / evidence 5
Llama-2 GSM8K 37.8 66.6% / 53.3% / 52.1% / 40.0% / 35.0% / 28.8% / evidence 6
DeepSeek-R1 Rouge-L 36.0 60.0% / 24.0% / evidence 2
Gemini-1.5-Flash MMLU 35.4 43.6% / 8.2% / evidence 2
Llama-3.1-70B MMLU 35.2 85.2% / 74.0% / 50.0% / evidence 3
Qwen2 GSM8K 34.5 89.5% / 55.0% / evidence 2
Qwen2.5 Rouge-L 34.0 52.0% / 18.0% / evidence 2
Llama-3.1-8B MMLU 33.3 78.2% / 66.9% / 63.5% / 44.9% / evidence 4
Gemini-1.5-Pro GSM8K 33.2 91.7% / 85.0% / 58.5% / evidence 3
Claude-3.5 Codereval 33.1 85.9% / 52.8% / evidence 2
Llama-3.1-70B BBH 33.0 93.0% / 60.0% / evidence 2
Gemini-1.5-Pro MMLU 30.3 75.3% / 45.0% / evidence 2
GPT-4 MMLU 30.3 87.3% / 87.3% / 87.3% / 87.3% / 87.3% / 86.4% / 57.0% / evidence 7
Gemini-2.0 GSM8K 27.6 95.6% / 68.0% / evidence 2
Gemma-2-2B MMLU 26.5 65.4% / 38.9% / 38.9% / evidence 3
GPT-4o Codereval 25.4 84.6% / 59.2% / evidence 2
LLaVA-Video-7B Videomme 24.7 90.0% / 65.3% / evidence 2
Qwen2 MATH 24.4 67.4% / 54.8% / 43.0% / evidence 3
Llama-3.1-70B MATH 24.2 67.0% / 60.0% / 42.8% / evidence 3
Phi-3 MMLU-Pro 24.1 78.9% / 54.8% / evidence 2
Mistral-7B Rouge-L 24.0 57.0% / 33.0% / evidence 2
Llama-2 Svamp 21.7 52.4% / 45.2% / 30.7% / evidence 3
GPT-4o Longvideobench 19.6 66.7% / 47.1% / evidence 2
Llama-3 Svamp 17.9 72.2% / 54.3% / evidence 2
Gemini-2.0 MATH 17.5 91.5% / 74.1% / evidence 2
GPT-3.5 GSM8K 17.1 92.0% / 74.9% / evidence 2
Llama-3 MMLU 16.5 63.5% / 59.9% / 50.7% / 50.2% / 47.0% / evidence 5
Llama-2 Strategyqa 14.4 50.0% / 35.6% / evidence 2
Llama-3 MMLU-Pro 14.2 56.2% / 42.0% / evidence 2
Llama-2 Accuracy 13.3 83.3% / 70.0% / evidence 2
GLM-4 SWE-bench 11.9 47.6% / 35.7% / evidence 2
LLaMA-7B C4 10.7 17.8% / 7.1% / evidence 2
LLaDA-MoE-7B-A1B MATH 10.4 55.0% / 44.6% / evidence 2
Gemma-2-9B MMLU 9.6 75.0% / 75.0% / 65.4% / evidence 3
LLaMA-13B C4 9.6 16.2% / 6.6% / evidence 2
LLAVA-NEXT-34B-NH Logicvista 9.4 29.9% / 20.6% / evidence 2
Llama-2 MMLU 8.9 46.7% / 44.8% / 43.9% / 37.8% / evidence 4
GPT-4 HumanEval 8.8 87.3% / 87.2% / 80.0% / 78.5% / evidence 4
Qwen3 GSM8K 8.6 88.5% / 86.5% / 79.9% / evidence 3
DeepSeek-R1 MMLU 8.3 90.8% / 82.5% / evidence 2
DeepSeek-R1 HumanEval 7.3 90.7% / 83.4% / evidence 2
LLaDA-MoE-7B-A1B GSM8K 7.0 65.8% / 58.8% / evidence 2
GPT-4 SWE-bench 6.8 78.8% / 72.0% / evidence 2
Llama-3 Accuracy 6.6 79.4% / 72.8% / evidence 2
Qwen3 LiveCodeBench 6.6 54.6% / 48.0% / evidence 2
Gemini-1.5-Pro MATH 6.5 65.0% / 62.2% / 58.5% / evidence 3
Phi-3 MMLU 5.3 69.0% / 63.7% / evidence 2
Claude-3.5 SWE-bench 5.1 56.4% / 51.3% / evidence 2
Claude-3.5 HumanEval 4.4 89.0% / 84.6% / evidence 2
Qwen2.5 SWE-bench 3.5 43.5% / 40.0% / evidence 2
Gemma-2-9B MMLU-Pro 3.1 69.1% / 66.0% / evidence 2

Score Database

Showing top 15 benchmarks by coverage (of 1050 total). Search below to filter by benchmark or model.

Accuracy (17)

Model Score Source paper Year
Codegen-Mono 99.0% Case Study: Fine-tuning Small Language Models for Accurate and Private CWE Detection in Python Code / evidence 2025
codegen-mono 99.0% Case Study: Fine-tuning Small Language Models for Accurate and Private CWE Detection in Python Code / evidence 2025
Qwen3 94.0% Emissions and Performance Trade-off Between Small and Large Language Models / evidence 2025
FlanT5 88.0% Emissions and Performance Trade-off Between Small and Large Language Models / evidence 2025
GPT-4 86.1% The impact of fine tuning in LLaMA on hallucinations for named entity extraction in legal documentation / evidence 2025
Llama-2 83.3% The impact of fine tuning in LLaMA on hallucinations for named entity extraction in legal documentation / evidence 2025
DeepSeek-R1 80.0% Emissions and Performance Trade-off Between Small and Large Language Models / evidence 2025
7B 80.0% Can LLMs Solve longer Math Word Problems Better? / evidence 2024
13B 80.0% Can LLMs Solve longer Math Word Problems Better? / evidence 2024
70B 80.0% Can LLMs Solve longer Math Word Problems Better? / evidence 2024
Llama-3 79.4% The impact of fine tuning in LLaMA on hallucinations for named entity extraction in legal documentation / evidence 2025
GPT-3.5 79.4% The impact of fine tuning in LLaMA on hallucinations for named entity extraction in legal documentation / evidence 2025
Llama-3 72.8% Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training / evidence 2025
Llama-2 70.0% Can LLMs Solve longer Math Word Problems Better? / evidence 2024
Mistral-7B-Instruct-v0.2 62.0% Emissions and Performance Trade-off Between Small and Large Language Models / evidence 2025
Mistral-7B 62.0% Emissions and Performance Trade-off Between Small and Large Language Models / evidence 2025
Qwen2.5 56.0% Emissions and Performance Trade-off Between Small and Large Language Models / evidence 2025

Babilong (20)

Model Score Source paper Year
GPT-4 85.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
Llama-3.1-70B 85.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
Qwen2.5 85.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
Gemini-Pro-1.5 85.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
Mistral-v0.2 85.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
Mixtral 85.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
Yi-34B-200k 80.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
Yi-9B-200k 30.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
Gemini-2.5-Pro 20.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
DeepSeek-R1 18.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
Qwen 16.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
Llama-3 15.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
Mistral 14.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
o3 13.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
Claude-3 12.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
o1 11.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
GPT-4 10.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
LongChat 5.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
LongAlpaca 5.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024
Llama-2 5.0% BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack / evidence 2024

GSM8K (99)

Model Score Source paper Year
Kimi-K2 97.9% Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks / evidence 2026
GPT-5 97.4% Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance? / evidence 2026
DeepSeek-V3 97.1% Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks / evidence 2026
Claude-3.5-Sonnet 96.4% AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling / evidence 2024
GPT-4o 95.9% AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling / evidence 2024
Llama-3 95.8% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
Gemini-2.0 95.6% Beyond Meta-Reasoning: Metacognitive Consolidation for Self-Improving LLM Reasoning / evidence 2026
Llama-3 95.4% Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks / evidence 2026
Gemini-2.5-Pro 95.2% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
GPT-4 95.0% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
Qwen2.5 95.0% Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models / evidence 2025
GPT-4 94.9% Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance? / evidence 2026
Llama-3.1-405B 93.9% AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling / evidence 2024
GPT-4o 93.7% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
GPT-4 92.7% Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition / evidence 2024
GPT-3.5 92.0% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
Gemini-1.5-Pro 91.7% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
Qwen2.5 91.5% Qwen2.5 Technical Report / evidence 2024
GPT-4o 91.4% Beyond Meta-Reasoning: Metacognitive Consolidation for Self-Improving LLM Reasoning / evidence 2026
Claude-3-Opus 90.8% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
Qwen2 89.5% Qwen2 Technical Report / evidence 2024
Qwen3 88.5% DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs / evidence 2026
Qwen3 86.5% Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks / evidence 2026
Gemini-1.5-Pro 85.0% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
GPT-4 84.2% Vendi-RAG: Adaptively Trading-Off Diversity And Quality Significantly Improves Retrieval Augmented Generation With LLMs / evidence 2025
Gemini-1.5-Flash 83.0% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
Qwen3 79.9% Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models / evidence 2026
MathCoder 79.9% MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning / evidence 2023
Phi-4 78.9% Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models / evidence 2025
Llama-3.1-8B 77.6% Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct / evidence 2026
OpenChat-3.5 77.3% Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition / evidence 2024
Dream 77.0% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
GPT-3.5 74.9% Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition / evidence 2024
Phi-3 74.5% SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models / evidence 2024
Phi-3 73.5% Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models / evidence 2026
ChatGLM3-6B 72.3% Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition / evidence 2024
Qwen2.5 69.9% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
LLaDA 69.8% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
Gemini-2.0 68.0% CROP: Token-Efficient Reasoning in Large Language Models via Regularized Prompt Optimization / evidence 2026
Llama-2 66.6% Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations / evidence 2023
LLaDA-MoE-7B-A1B 65.8% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
MathCoder-L 64.2% MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning / evidence 2023
Gemini-3.1-Pro 64.0% CROP: Token-Efficient Reasoning in Large Language Models via Regularized Prompt Optimization / evidence 2026
Llama-3 62.5% DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs / evidence 2026
LLaDA-8B 61.6% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Qwen-14B 60.1% Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition / evidence 2024
LLaDA-MoE-7B-A1B 58.8% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Gemini-1.5-Pro 58.5% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
SmolLM-3B 56.7% Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks / evidence 2026
GPT-4 56.2% Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations / evidence 2023
BaseRL-7B 55.8% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Gemma-2-2B 55.5% Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models / evidence 2026
BaseRL-32B 55.3% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Qwen2 55.0% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
Llama-3 54.3% Understanding Reasoning in Chain-of-Thought from the Hopfieldian View / evidence 2024
Phi-2-2.7B 53.4% SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models / evidence 2024
Llama-2 53.3% SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models / evidence 2024
Mistral-7B 53.2% Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models / evidence 2026
Llama-2 52.1% Understanding Reasoning in Chain-of-Thought from the Hopfieldian View / evidence 2024
Gemma-2-it-9B 51.0% Building Math Agents with Multi-Turn Iterative Preference Learning / evidence 2024
SwS-32B 50.0% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Qwen2.5 47.5% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Llama-1-RFT 46.5% MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning / evidence 2023
CodeGemma-1.1-it-7B 46.4% Building Math Agents with Multi-Turn Iterative Preference Learning / evidence 2024
Gemma-1.1-it-7B 46.1% Building Math Agents with Multi-Turn Iterative Preference Learning / evidence 2024
Dream-v0-Instruct-7B 43.3% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
LLaDA-1.5 42.8% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Mistral-7B-v0.3 42.7% Building Math Agents with Multi-Turn Iterative Preference Learning / evidence 2024
Llama-4-17B-128E 42.3% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
GPT-4o 41.9% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
GPT-5 41.5% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
Qwen2.5 40.0% TokenSkip: Controllable Chain-of-Thought Compression in LLMs / evidence 2025
GPT-4 40.0% MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation / evidence 2023
Llama-2 40.0% Making Large Language Models Better Reasoners with Alignment / evidence 2023
Qwen2.5 37.7% Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models / evidence 2026
Llama-4-17B-16E 36.8% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
InternVL3.5-241B-A28B 36.8% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
InternVL3.5-38B 36.0% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
PRIME-RL-7B 35.7% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Llama-2 35.0% Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents / evidence 2024
SwS-7B 35.0% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Open-Reasoner-32B 34.9% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Oat-Zero-7B 31.4% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
GPT-4 31.1% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
QVQ-Max-Latest 30.5% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
SimpleRL-Base-7B 30.5% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Llama-3.1-70B 30.1% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
SimpleRL-Base-32B 29.8% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
InternVL3.5-30B-A3B 29.4% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
Llama-2 28.8% TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models / evidence 2023
Open-Reasoner-7B 27.6% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Qwen2.5 24.3% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
DeepSeek-V2 20.0% MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation / evidence 2023
Claude3-Sonnet 20.0% MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation / evidence 2023
SwS-3B 19.9% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
BaseRL-3B 18.8% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
InternVL3.5-8B 16.9% GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts / evidence 2025
Vicuna-13B 11.3% Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition / evidence 2024
MiMo-v2-Flash 8.0% Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks / evidence 2026

HellaSwag (17)

Model Score Source paper Year
Llama-3.1-70B 94.0% "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization / evidence 2024
GLM-4.5 88.9% GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models / evidence 2025
Qwen2.5 87.6% Qwen2.5 Technical Report / evidence 2024
Llama-3.1-8B 72.5% Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct / evidence 2026
Phi-3 59.0% Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models / evidence 2026
Gemma-2-2B 52.0% Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models / evidence 2026
Qwen3 52.0% Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models / evidence 2026
Qwen2.5 39.0% Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models / evidence 2026
UB-SMoE-OLMo-1B 35.4% UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models / evidence 2026
T0-3B 27.8% Prompt Consistency for Zero-Shot Task Generalization / evidence 2022
FLAME-MoE-115M-459M 27.7% FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models / evidence 2025
FLAME-MoE-98M-349M 26.3% FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models / evidence 2025
FLAME-MoE-38M-100M 25.9% FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models / evidence 2025
FLoRA-OLMo-1B 19.1% UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models / evidence 2026
FlexLoRA-OLMo-1B 13.6% UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models / evidence 2026
FLoRIST-OLMo-1B 12.9% UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models / evidence 2026
HetLoRA-OLMo-1B 11.0% UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models / evidence 2026

HumanEval (103)

Model Score Source paper Year
Code-Llama-7B 100.0% Code Llama: Open Foundation Models for Code / evidence 2023
Code-Llama-13B 100.0% Code Llama: Open Foundation Models for Code / evidence 2023
Code-Llama-34B 100.0% Code Llama: Open Foundation Models for Code / evidence 2023
Seed-Coder-8B 98.2% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
Claude-3.7-Sonnet 97.8% SIMCOPILOT: Evaluating Large Language Models for Copilot-Style Code Generation / evidence 2025
DeepSeek-Coder 97.6% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
Qwen2.5 96.3% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
o1 96.2% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
Qwen3 95.1% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
GPT-5.1 95.1% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
Llama-3.1-70B 95.0% "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization / evidence 2024
ReflexiCoder-8B 94.5% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
WizardCoder-CodeLlama 92.1% Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation / evidence 2023
Phind-CodeLlama 91.7% Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation / evidence 2023
DeepSeek-R1 90.7% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
LeDex-RL-13B 90.0% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
Claude-3.5 89.0% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
GPT-4 87.3% Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation / evidence 2023
GPT-4 87.2% NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts / evidence 2024
GPT-4o 86.2% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
GPT-4o 86.2% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
Starcoder-3B 85.7% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025
ChatGPT 85.2% Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation / evidence 2023
Claude-3.5 84.6% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
DeepSeek-R1 83.4% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
Qwen2.5 82.2% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
Llama-3 81.7% NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts / evidence 2024
GLM-4 81.4% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
WizardCoder-34B 81.2% WizardCoder: Empowering Code Large Language Models with Evol-Instruct / evidence 2023
Claude-3-Opus 80.5% NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts / evidence 2024
GPT-4 80.0% Reflexion: Language Agents with Verbal Reinforcement Learning / evidence 2023
Qwen2.5 79.6% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
DeepSeek-Coder 79.3% NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts / evidence 2024
GLM-4 79.2% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
DeepSeek-33B 78.6% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025
GPT-4 78.5% HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization / evidence 2024
WizardCoder-15B 78.1% WizardCoder: Empowering Code Large Language Models with Evol-Instruct / evidence 2023
LLaMA-3B 76.9% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025
Gemini-1.5-Pro 75.0% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
Gemini-1.5-Flash 73.0% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
Llama-3.1-8B 72.6% SIMCOPILOT: Evaluating Large Language Models for Copilot-Style Code Generation / evidence 2025
Code-Llama 67.0% Code Llama: Open Foundation Models for Code / evidence 2023
CodeLlama-7B 67.0% Code Llama: Open Foundation Models for Code / evidence 2023
CodeLlama-13B 67.0% Code Llama: Open Foundation Models for Code / evidence 2023
Code-Llama---Python-7B 67.0% Code Llama: Open Foundation Models for Code / evidence 2023
Code-Llama---Python 67.0% Code Llama: Open Foundation Models for Code / evidence 2023
GPT-3.5 65.2% NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts / evidence 2024
Qwen2 64.6% Qwen2 Technical Report / evidence 2024
CodeT5+-2B 63.0% HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization / evidence 2024
GPT-3.5 62.5% HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization / evidence 2024
WaveCoder-Ultra-6.7B 61.4% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
CodeGeeX-13B-FT 61.2% CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X / evidence 2023
CodeGeeX-13B 60.9% CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X / evidence 2023
DeepSeek-Coder 59.6% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
Qwen2.5 59.6% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
Qwen2.5 59.1% Qwen2.5 Technical Report / evidence 2024
Yi-Coder-9B 57.9% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
OpenCoder-8B 56.1% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
Claud-3.5-Sonnet 55.3% HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks / evidence 2024
Phi1-1.3B 52.0% Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks / evidence 2025
Magicoder-S-DS-6.7B 50.9% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
CodeRM-8B 50.0% Dynamic Scaling of Unit Tests for Code Reward Modeling / evidence 2025
DeepSeek-Coder 49.0% Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks / evidence 2025
OpenCodeInterpreter-1.3B 49.0% Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks / evidence 2025
Gemini-1.5-Pro 48.0% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
Yi-Coder-1.5B 45.0% Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks / evidence 2025
PolyCoder-2.7B 45.0% Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks / evidence 2025
LLaDA-MoE-7B-A1B 44.6% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
LLaDA-MoE-7B-A1B 43.3% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Qwen2.5 41.0% Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks / evidence 2025
CodeGemma-2.0B 41.0% Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks / evidence 2025
StarCoder 40.0% StarCoder: may the source be with you! / evidence 2023
LLaDA-8B 38.4% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Claude-3.5-Sonnet 36.8% HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks / evidence 2024
LLaDA-1.5 35.3% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Dream 34.1% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
Dream-v0-Instruct-7B 33.7% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Qwen2.5 32.6% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
LLaDA 31.7% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
CodeGen-16B 31.7% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
CodeGen-16B-mono 30.5% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
GPT-4o 27.7% HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks / evidence 2024
CodeGen-6B-mono 26.2% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
CodeGen-2B-mono 23.2% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
Gemini-1.5-Pro 22.9% HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks / evidence 2024
Cycle-2.7B 21.9% CYCLE: Learning to Self-Refine the Code Generation / evidence 2024
Pixtral-124B 21.3% HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks / evidence 2024
CodeGen2-16B 20.8% HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization / evidence 2024
CodeGen-6B-multi 19.5% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
CodeGen-16B-multi 19.5% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
Llama-3 18.4% Dynamic Scaling of Unit Tests for Code Reward Modeling / evidence 2025
Gemini-1.5-Flash 17.4% HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks / evidence 2024
Cycle-1B 15.9% CYCLE: Learning to Self-Refine the Code Generation / evidence 2024
CodeGen2-3.7B 15.4% HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization / evidence 2024
GPT-J-6B 15.2% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
CodeGen-2B-multi 14.0% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
InCoder-6B 10.4% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
InCoder-1B 10.4% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
InCoder-1.3B 9.0% Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks / evidence 2025
GPT-4o 3.4% Dynamic Scaling of Unit Tests for Code Reward Modeling / evidence 2025
PolyCoder-0.4B 3.0% Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks / evidence 2025
LLaMA-1B 1.3% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025
LLaMA-8B 1.1% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025

Logicvista (33)

Model Score Source paper Year
otter-9B 31.8% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
otter9B 31.8% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVA-7B 29.9% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLaVA-7B 29.9% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLaVA-NeXT-34B-NH 29.9% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVA7B 29.9% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVA-NEXT-7B-vicuna 26.2% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVANEXT-7B-vicuna 26.2% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLaVA-NeXT-7B-vicuna 26.2% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
GPT-4 23.4% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
instructBLIP-flan-t5-xl 23.4% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVA-NEXT-13B-vicuna 22.4% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVANEXT-13B-vicuna 22.4% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLaVA-NeXT-13B-vicuna 22.4% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVA-NEXT-34B-NH 20.6% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVANEXT-34B-NH 20.6% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVA-13B 18.7% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLaVA-13B 18.7% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVA13B 18.7% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
BLIP-2 17.8% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
instructBLIP-flan-t5-xxl 17.8% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
BLIP2 17.8% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVA-NEXT-7B-Mistral 16.8% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLAVANEXT-7B-mistral 16.8% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
LLaVA-NeXT-7B-mistral 16.8% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
miniGPT-vicuna-13B 13.1% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
miniGPTvicuna13B 13.1% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
miniGPTvicuna-13B 13.1% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
miniGPT-vicuna-7B 10.3% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
miniGPTvicuna7B 10.3% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
miniGPTvicuna-7B 10.3% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
instructBLIP-vicuna-7B 4.7% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024
instructBLIP-vicuna-13B 3.7% LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts / evidence 2024

Longvideobench (18)

Model Score Source paper Year
Qwen2 90.0% MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding / evidence 2025
GPT-4o 66.7% Adaptive Keyframe Sampling for Long Video Understanding / evidence 2025
AdaptToken-7B 65.2% AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding / evidence 2026
AdaptToken-Lite-7B 65.1% AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding / evidence 2026
Gemini-1.5-Pro 64.0% Adaptive Keyframe Sampling for Long Video Understanding / evidence 2025
AdaptToken-Lite-8B 63.8% AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding / evidence 2026
AdaptToken-8B 63.7% AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding / evidence 2026
Qwen2VL 62.7% Adaptive Keyframe Sampling for Long Video Understanding / evidence 2025
LLaVA-Video-7B 62.7% Adaptive Keyframe Sampling for Long Video Understanding / evidence 2025
LLaVA-Video-72B 61.9% Adaptive Keyframe Sampling for Long Video Understanding / evidence 2025
Gemini-1.5-Flash 61.6% Adaptive Keyframe Sampling for Long Video Understanding / evidence 2025
GPT-4V 61.3% Adaptive Keyframe Sampling for Long Video Understanding / evidence 2025
LLaVA-OneVision-72B 56.5% T*: Re-thinking Temporal Search for Long-Form Video Understanding / evidence 2025
GPT-4o 47.1% T*: Re-thinking Temporal Search for Long-Form Video Understanding / evidence 2025
LLaVA-Video-72B 3.5% MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding / evidence 2026
LLaVA-Video-7B 3.3% MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding / evidence 2025
LLaVA-Video-7B 3.3% MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding / evidence 2026
Qwen-VL-2B 1.5% MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding / evidence 2025

MATH (116)

Model Score Source paper Year
Qwen-Max 98.6% MARIO Eval: Evaluate Your Math LLM with your Math LLM--A mathematical dataset evaluation toolkit / evidence 2024
DeepSeek-R1 97.3% Quantitative Analysis of Performance Drop in DeepSeek Model Quantization / evidence 2025
Kimi-k1.5 96.2% Kimi k1.5: Scaling Reinforcement Learning with LLMs / evidence 2025
QwQ-32B-Preview 95.0% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
GPT-4o 95.0% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
Kimi-k1.5-Short-CoT 94.6% Kimi k1.5: Scaling Reinforcement Learning with LLMs / evidence 2025
GPT-3.5 93.7% MARIO Eval: Evaluate Your Math LLM with your Math LLM--A mathematical dataset evaluation toolkit / evidence 2024
Gemini-2.0 91.5% Beyond Meta-Reasoning: Metacognitive Consolidation for Self-Improving LLM Reasoning / evidence 2026
Claude-3.5-Sonnet 89.8% AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling / evidence 2024
Mistral-7B 89.2% GuiLoMo: Allocating Expert Number and Rank for LoRA-MoE via Bilevel Optimization with GuidedSelection Vectors / evidence 2025
Llama-2 88.3% GuiLoMo: Allocating Expert Number and Rank for LoRA-MoE via Bilevel Optimization with GuidedSelection Vectors / evidence 2025
Llama-3 87.0% GuiLoMo: Allocating Expert Number and Rank for LoRA-MoE via Bilevel Optimization with GuidedSelection Vectors / evidence 2025
GPT-4o 85.9% AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling / evidence 2024
Llama-3.1-405B 84.9% AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling / evidence 2024
Gemma-2-it-9B 84.1% Building Math Agents with Multi-Turn Iterative Preference Learning / evidence 2024
Llama-3.1-8B 82.9% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
Qwen2.5 81.0% LANPO: Bootstrapping Language and Numerical Feedback for Reinforcement Learning in LLMs / evidence 2025
DeepSeek-R1 79.8% LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems / evidence 2026
Mistral-7B-v0.3 77.8% Building Math Agents with Multi-Turn Iterative Preference Learning / evidence 2024
Gemma-1.1-it-7B 77.5% Building Math Agents with Multi-Turn Iterative Preference Learning / evidence 2024
CodeGemma-1.1-it-7B 77.3% Building Math Agents with Multi-Turn Iterative Preference Learning / evidence 2024
GPT-4o 76.7% Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification / evidence 2025
Llama-3 76.6% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
Qwen2.5 75.0% Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models / evidence 2025
Qwen3 75.0% DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs / evidence 2026
Phi-4 74.1% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
Gemini-2.0 74.1% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
GPT-4o 73.4% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
GPT-4o 72.7% Beyond Meta-Reasoning: Metacognitive Consolidation for Self-Improving LLM Reasoning / evidence 2026
DeepSeek-V3 72.2% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
DeepSeek-R1 72.2% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
Llama-3 72.2% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
GPT-3.5 72.2% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
SmolLM-3B 72.0% Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks / evidence 2026
o3 71.3% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
o1 71.3% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
Qwen2.5 71.3% Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence 2025
GPT-4 69.7% Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification / evidence 2023
MINT-CoT-7B 69.6% MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning / evidence 2025
LLaVA-OV-1.5-RL 69.4% LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training / evidence 2025
Claude-3-Opus 67.7% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
Qwen2 67.4% MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning / evidence 2025
Llama-3.1-70B 67.0% "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization / evidence 2024
Gemini-1.5-Pro 65.0% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
GLM-4.5 64.0% GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models / evidence 2025
Gemini-1.5-Flash 63.0% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
Gemini-1.5-Pro 62.2% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
Qwen2.5 62.1% Qwen2.5 Technical Report / evidence 2024
LLaVA-OV-1.5 61.5% LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training / evidence 2025
GPT-4 60.1% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
Llama-3.1-70B 60.0% Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification / evidence 2025
Gemini-1.5-Pro 58.5% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
Skywork-RLHFlow 55.9% Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification / evidence 2025
BaseRL-7B 55.8% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
BaseRL-32B 55.3% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
LLaDA-MoE-7B-A1B 55.0% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Qwen2 54.8% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
LLaDA-8B 52.4% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
SwS-32B 50.0% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Qwen-VL-Max 49.9% CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence 2024
Skywork 48.9% Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification / evidence 2025
Llama-3 48.9% Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification / evidence 2025
GPT-4 48.1% Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations / evidence 2023
Qwen2.5 46.4% MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning / evidence 2025
MathCoder 45.9% MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning / evidence 2023
Qwen2.5 45.7% MathMixup: Boosting LLM Mathematical Reasoning with Difficulty-Controllable Data Synthesis and Curriculum Learning / evidence 2026
LLaDA-MoE-7B-A1B 44.6% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Qwen2.5 44.1% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Qwen2 43.0% CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence 2024
Llama-3.1-70B 42.8% Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus / evidence 2024
GPT-3.5 42.5% Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence 2024
GPT-4 42.5% Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance / evidence 2023
Gemini 41.9% CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence 2024
Llama-3 39.3% DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs / evidence 2026
Dream 38.7% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
LLaDA-1.5 37.2% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Dream-v0-Instruct-7B 37.0% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
PRIME-RL-7B 35.7% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
SwS-7B 35.0% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Open-Reasoner-32B 34.9% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Qwen2.5 34.8% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Oat-Zero-7B 31.4% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
GPT-5 30.5% Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance? / evidence 2026
SimpleRL-Base-7B 30.5% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
LLaDA 30.2% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
DeepSeek-R1 30.0% Token-Hungry, Yet Precise: DeepSeek R1 Highlights the Need for Multi-Step Reasoning Over Speed in MATH / evidence 2025
Llama-3.1-8B 30.0% TokenSkip: Controllable Chain-of-Thought Compression in LLMs / evidence 2025
GPT-4 29.8% Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance? / evidence 2026
SimpleRL-Base-32B 29.8% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
GPT-4o 29.0% CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence 2024
Llama-3 29.0% Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks / evidence 2026
Llama-2 28.9% SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models / evidence 2024
Open-Reasoner-7B 27.6% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
Qwen2.5 27.0% s1: Simple test-time scaling / evidence 2025
s1-32B 27.0% s1: Simple test-time scaling / evidence 2025
Phi-3 26.5% SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models / evidence 2024
Llama-3.1-8B 25.3% MathMixup: Boosting LLM Mathematical Reasoning with Difficulty-Controllable Data Synthesis and Curriculum Learning / evidence 2026
MathCoder-L 23.3% MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning / evidence 2023
InternLM2.5-7B 22.9% MathMixup: Boosting LLM Mathematical Reasoning with Difficulty-Controllable Data Synthesis and Curriculum Learning / evidence 2026
InternLM-VL 19.8% CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence 2024
Llama-2 19.2% Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations / evidence 2023
LLaVA-1.5-13B 18.9% AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting / evidence 2024
LLaVA-v1.5 18.1% CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence 2024
LLaVA-v1.6-mistral 16.8% CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence 2024
Phi-2-2.7B 16.1% SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models / evidence 2024
Llama-2 14.9% DONOD: Efficient and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning / evidence 2025
GPT-2-1.5B 8.3% Measuring Mathematical Problem Solving With the MATH Dataset / evidence 2021
GPT-3-175B 7.7% Measuring Mathematical Problem Solving With the MATH Dataset / evidence 2021
GPT-2-0.7B 6.9% Measuring Mathematical Problem Solving With the MATH Dataset / evidence 2021
GPT-3-13B-FineTuned 6.8% Measuring Mathematical Problem Solving With the MATH Dataset / evidence 2021
Llama-1-RFT 6.7% MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning / evidence 2023
GPT-2-0.3B 6.7% Measuring Mathematical Problem Solving With the MATH Dataset / evidence 2021
GPT-2-0.1B 5.2% Measuring Mathematical Problem Solving With the MATH Dataset / evidence 2021
GPT-3-13B 4.1% Measuring Mathematical Problem Solving With the MATH Dataset / evidence 2021
SwS-3B 2.2% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025
BaseRL-3B 0.0% SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence 2025

MBPP (60)

Model Score Source paper Year
Code-Llama-7B 100.0% Code Llama: Open Foundation Models for Code / evidence 2023
Code-Llama-13B 100.0% Code Llama: Open Foundation Models for Code / evidence 2023
Code-Llama-34B 100.0% Code Llama: Open Foundation Models for Code / evidence 2023
DeepSeek-Coder 94.2% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
DeepSeek-R1 92.6% Quantitative Analysis of Performance Drop in DeepSeek Model Quantization / evidence 2025
Qwen2.5 85.2% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
Qen2.5-72B 84.7% Qwen2.5 Technical Report / evidence 2024
Qwen3 84.0% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
GPT-5.1 84.0% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
ReflexiCoder-8B 81.8% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
Seed-Coder-8B 76.8% ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / evidence 2026
WizardCoder-34B 75.4% WizardCoder: Empowering Code Large Language Models with Evol-Instruct / evidence 2023
Starcoder-3B 75.0% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025
WizardCoder-15B 72.3% WizardCoder: Empowering Code Large Language Models with Evol-Instruct / evidence 2023
Qwen2.5 69.2% Qwen2.5 Technical Report / evidence 2024
Code-Llama 65.0% Code Llama: Open Foundation Models for Code / evidence 2023
Code-L-Llama 65.0% Code Llama: Open Foundation Models for Code / evidence 2023
CodeLlama-7B 65.0% Code Llama: Open Foundation Models for Code / evidence 2023
CodeLlama-13B 65.0% Code Llama: Open Foundation Models for Code / evidence 2023
Code-Llama---Python-7B 65.0% Code Llama: Open Foundation Models for Code / evidence 2023
Code-Llama---Python 65.0% Code Llama: Open Foundation Models for Code / evidence 2023
DeepSeek-33B 62.5% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025
DeepSeek-1.3B 62.1% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025
CodeGeeX-13B 61.3% CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X / evidence 2023
Gemini-1.5-Pro 54.7% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
Dream 54.2% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
GPT-4o 50.0% Dynamic Scaling of Unit Tests for Code Reward Modeling / evidence 2025
LLaMA-8B 42.9% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025
LLaDA 40.8% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
CodeGen-16B-mono 40.7% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
Qwen2.5 38.6% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
LLaDA-MoE-7B-A1B 38.4% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
CodeGen-16B 36.1% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
CodeGen-6B-mono 36.1% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
DeepSeek-Coder 35.1% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
Cycle-2.7B 34.7% CYCLE: Learning to Self-Refine the Code Generation / evidence 2024
Magicoder-S-DS-6.7B 33.3% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
WaveCoder-Ultra-6.7B 26.3% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
Cycle-1B 25.8% CYCLE: Learning to Self-Refine the Code Generation / evidence 2024
InCoder-6B 24.1% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
CodeGen-16B-multi 24.1% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
CodeGen-6B-multi 22.1% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
Yi-Coder-9B 21.1% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
GPT-J-6B 19.9% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
CodeGen-2B-mono 19.1% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
CodeGen-2B-multi 19.1% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
GPT-4 16.8% DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation / evidence 2025
Llama-3 16.8% DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation / evidence 2025
Gemini-2.5-Pro 16.8% DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation / evidence 2025
Claude-3 16.8% DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation / evidence 2025
DeepSeek-R1 16.8% DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation / evidence 2025
Mistral 16.8% DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation / evidence 2025
Qwen 16.8% DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation / evidence 2025
o1 16.8% DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation / evidence 2025
o3 16.8% DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation / evidence 2025
InCoder-1B 12.8% ReCode: Robustness Evaluation of Code Generation Models / evidence 2022
OpenCoder-8B 10.5% HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / evidence 2024
DeepSeek-6.7B 1.5% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025
LLaMA-3B 1.1% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025
LLaMA-1B 1.1% Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation / evidence 2025

MMLU (77)

Model Score Source paper Year
DeepSeek-R1 90.8% Quantitative Analysis of Performance Drop in DeepSeek Model Quantization / evidence 2025
o3 89.0% Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities / evidence 2023
Gemini-2.5-Pro 88.1% Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities / evidence 2023
GLM-4.5 87.8% GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models / evidence 2025
GPT-4 87.3% Vendi-RAG: Adaptively Trading-Off Diversity And Quality Significantly Improves Retrieval Augmented Generation With LLMs / evidence 2025
GPT-4 87.3% Adaptive Self-Prompting in Agentic LLM Frameworks for Code Fault Detection / evidence 2026
GPT-4 87.3% Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities / evidence 2023
GPT-4 87.3% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
GPT-4 87.3% Capabilities of GPT-4 on Medical Challenge Problems / evidence 2023
o1 86.9% Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities / evidence 2023
GPT-4 86.4% Data Engineering for Scaling Language Models to 128K Context / evidence 2024
Qwen2.5 86.1% Qwen2.5 Technical Report / evidence 2024
Llama-3.1-70B 85.2% Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities / evidence 2023
OLMoE-1B-7B-0125 84.3% Task-Conditioned Routing Signatures in Sparse Mixture-of-Experts Transformers / evidence 2026
Qwen2 84.2% Qwen2 Technical Report / evidence 2024
Claude-3 84.0% Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities / evidence 2023
Qwen 83.6% Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities / evidence 2023
DeepSeek-R1 82.5% Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities / evidence 2023
Mistral 80.7% Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities / evidence 2023
Foundation-Sec-8B-Reasoning 78.2% Llama-3.1-FoundationAI-SecurityLLM-Reasoning-8B Technical Report / evidence 2026
Llama-3.1-8B 78.2% Llama-3.1-FoundationAI-SecurityLLM-Reasoning-8B Technical Report / evidence 2026
MEDITRON-70B 76.0% MEDITRON-70B: Scaling Medical Pretraining for Large Language Models / evidence 2023
Gemini-1.5-Pro 75.3% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
Flan-PaLM-540B 75.2% Scaling Instruction-Finetuned Language Models / evidence 2022
Flan-PaLM 75.2% Scaling Instruction-Finetuned Language Models / evidence 2022
Gemma-2-9B 75.0% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
gemma-2-9b 75.0% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
Qwen2.5 74.9% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
GLM-4 74.3% SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization / evidence 2024
Llama-3.1-70B 74.0% "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization / evidence 2024
Dream 72.6% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
InternVL-2.5 72.0% Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling / evidence 2024
LLaDA-MoE-7B-A1B 70.0% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
LLaDA-MoE-7B-A1B 70.0% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Qwen2.5 70.0% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
Phi-3 69.0% Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone / evidence 2024
GPT-3.5 67.3% Data Engineering for Scaling Language Models to 128K Context / evidence 2024
Llama-3.1-8B 66.9% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
LLaDA-8B 66.2% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
PaLM 65.8% Scaling Instruction-Finetuned Language Models / evidence 2022
gemma-2-2b 65.4% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
gemma-2-9B 65.4% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
Phi-3 63.7% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
Llama-3.1-8B 63.5% Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct / evidence 2026
Llama-3 63.5% SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization / evidence 2024
Dream-v0-Instruct-7B 63.1% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
LLaDA-1.5 63.1% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
gemma2-9b 63.0% Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation / evidence 2024
gemma2:9b 63.0% Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation / evidence 2024
LLaDA 62.1% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
Phi-4 61.8% Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment / evidence 2026
Llama-3 59.9% Unifying Block-wise PTQ and Distillation-based QAT for Progressive Quantization toward 2-bit Instruction-Tuned LLMs / evidence 2025
YaRN-Mistral-7B 59.4% Data Engineering for Scaling Language Models to 128K Context / evidence 2024
GPT-4 57.0% Investigating Data Contamination in Modern Benchmarks for Large Language Models / evidence 2023
Mistral-7B-Instruct-v0.3 55.7% Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment / evidence 2026
ChatGPT 52.0% Investigating Data Contamination in Modern Benchmarks for Large Language Models / evidence 2023
Llama-3 50.7% Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment / evidence 2026
Llama-3 50.2% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
LongLoRA-13B 50.1% Data Engineering for Scaling Language Models to 128K Context / evidence 2024
Llama-3.1-70B 50.0% Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus / evidence 2024
GPT-4o 49.4% Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark / evidence 2025
Llama-3 47.0% Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation / evidence 2024
Llama-2 46.7% Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs / evidence 2024
Vicuna-7B-V1.5 46.2% Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes / evidence 2024
Gemini-1.5-Pro 45.0% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
Llama-3.1-8B 44.9% Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment / evidence 2026
Llama-2 44.8% Data Engineering for Scaling Language Models to 128K Context / evidence 2024
Llama-2 43.9% Rotated Robustness: A Training-Free Defense against Bit-Flip Attacks on Large Language Models / evidence 2026
Gemini-1.5-Flash 43.6% Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark / evidence 2025
LongChat-v1.5-7B 42.3% Data Engineering for Scaling Language Models to 128K Context / evidence 2024
Gemma-2-2B 38.9% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
gemma-2-2B 38.9% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
LongLoRA-7B 37.9% Data Engineering for Scaling Language Models to 128K Context / evidence 2024
Llama-2 37.8% Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes / evidence 2024
Qwen2.5 25.0% Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark / evidence 2025
Gemini-1.5-Flash 8.2% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
Mamba-2-Hybrid 2.6% An Empirical Study of Mamba-based Language Models / evidence 2024

MMLU-Pro (47)

Model Score Source paper Year
Kimi-Linear 84.3% Kimi Linear: An Expressive, Efficient Attention Architecture / evidence 2025
Gemma-2-70B 80.2% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Phi-3 78.9% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Qwen2 78.9% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Claude-3-Opus 78.0% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
o3 78.0% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Yi-Large 76.3% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
o1 76.3% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
GPT-4o 72.6% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Llama-3.1-70B 70.0% "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization / evidence 2024
GPT-4 69.7% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Llama-3.1-70B 69.7% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Gemma-2-9B 69.1% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
Claude-3-Sonnet 68.5% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Gemini-2.5-Pro 66.0% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Gemma-2-27B 66.0% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Gemma-2-9B 66.0% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Llama-2 64.8% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Mixtral 63.7% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Qwen 61.5% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Qwen-14B 61.4% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Qwen2.5 60.6% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
Qwen-1.5-7B 59.9% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Mistral-7B-Instruct-v0.1 58.7% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Qwen-32B 58.7% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Qwen2.5 58.1% Qwen2.5 Technical Report / evidence 2024
Llama-3.1-8B 57.1% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
Llama-3 56.2% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Phi-3 54.8% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
Mistral-7B-Instruct-v0.2 53.5% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Gemma-7B 49.9% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Qwen-34B 45.2% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Gemini-2 43.0% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Qwen-1.5-72B 43.0% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Llama-3 42.0% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
Qwen-72B 35.2% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
DeepSeek-V2 34.2% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Qwen-72-7B 34.2% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Gemma-2-2B 31.2% Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark / evidence 2025
Qwen-1.5-14B 28.7% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Mistral-7B 28.7% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Qwen-1.5-34B 25.5% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Gemma-2B 25.5% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Dream 24.1% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
LLaDA 23.3% Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence 2025
Qwen-14.8B 22.4% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024
Mistral-7B-v0.1 20.0% MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark / evidence 2024

Pass@1 (17)

Model Score Source paper Year
GPT-3.5 86.8% Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models / evidence 2025
GPT-4o 86.7% Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models / evidence 2025
DeepSeek-R1 65.9% Emissions and Performance Trade-off Between Small and Large Language Models / evidence 2025
DeepSeek-Coder 49.4% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
ChatGPT 40.0% Structured Chain-of-Thought Prompting for Code Generation / evidence 2023
codellama-7b-hf-float16 37.9% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
Mistral-7B-Instruct-v0.2 36.0% Emissions and Performance Trade-off Between Small and Large Language Models / evidence 2025
Mistral-7B 36.0% Emissions and Performance Trade-off Between Small and Large Language Models / evidence 2025
Qwen3 29.8% Emissions and Performance Trade-off Between Small and Large Language Models / evidence 2025
Qwen2.5 29.8% Emissions and Performance Trade-off Between Small and Large Language Models / evidence 2025
codellama-7b 28.7% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
codellambda-7b-hf-float16 28.6% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
starcoderbase-3b 20.2% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
starcoderbase-1b 15.3% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
codegen-2b 14.3% Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data / evidence 2026
CodeT5+ 12.0% Emissions and Performance Trade-off Between Small and Large Language Models / evidence 2025
CodeT5-small 12.0% Emissions and Performance Trade-off Between Small and Large Language Models / evidence 2025

Rouge-L (22)

Model Score Source paper Year
Mistral 91.0% Making Knowledge Accessible: Divergent Readability-Accuracy Strategies of Mistral and QWen in Biomedical Text Simplification / evidence 2025
QWen 91.0% Making Knowledge Accessible: Divergent Readability-Accuracy Strategies of Mistral and QWen in Biomedical Text Simplification / evidence 2025
CodeT5-small 68.0% Emissions and Performance Trade-off Between Small and Large Language Models / evidence 2025
GPT-2-Medium 67.9% Differentially Private Fine-tuning of Language Models / evidence 2021
GPT-2-Large 67.8% Differentially Private Fine-tuning of Language Models / evidence 2021
DeepSeek-R1 60.0% LLMs in Code Vulnerability Analysis: A Proof of Concept / evidence 2026
Mistral-7B 57.0% LLMs in Code Vulnerability Analysis: A Proof of Concept / evidence 2026
Codestral-22B 55.0% LLMs in Code Vulnerability Analysis: A Proof of Concept / evidence 2026
Llama-3.1-8B 53.0% LLMs in Code Vulnerability Analysis: A Proof of Concept / evidence 2026
DeepSeek-V2 53.0% LLMs in Code Vulnerability Analysis: A Proof of Concept / evidence 2026
Qwen2.5 52.0% LLMs in Code Vulnerability Analysis: A Proof of Concept / evidence 2026
CodeLlama-7B 50.0% LLMs in Code Vulnerability Analysis: A Proof of Concept / evidence 2026
Gemma-7B 49.0% LLMs in Code Vulnerability Analysis: A Proof of Concept / evidence 2026
CodeGemma-7B 48.0% LLMs in Code Vulnerability Analysis: A Proof of Concept / evidence 2026
Mistral-7B-Instruct-v0.2 33.0% Emissions and Performance Trade-off Between Small and Large Language Models / evidence 2025
Mistral-7B 33.0% Emissions and Performance Trade-off Between Small and Large Language Models / evidence 2025
DeepSeek-R1 24.0% Emissions and Performance Trade-off Between Small and Large Language Models / evidence 2025
T5 20.4% Automated Literature Review Using NLP Techniques and LLM-Based Retrieval-Augmented Generation / evidence 2024
GPT-3.5 18.1% Automated Literature Review Using NLP Techniques and LLM-Based Retrieval-Augmented Generation / evidence 2024
Qwen2.5 18.0% Emissions and Performance Trade-off Between Small and Large Language Models / evidence 2025
Qwen3 18.0% Emissions and Performance Trade-off Between Small and Large Language Models / evidence 2025
Llama-3.1-8B 2.6% Parameter Efficient Fine Tuning Llama 3.1 for Answering Arabic Legal Questions: A Case Study on Jordanian Laws / evidence 2026

SWE-bench (28)

Model Score Source paper Year
GPT-4o 83.4% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
GPT-4 78.8% SWE-ABS: Adversarial Benchmark Strengthening Exposes Inflated Success Rates on Test-based Benchmark / evidence 2026
Gemini-3-Pro 76.2% Kodezi Chronos: A Debugging-First Language Model for Repository-Scale Code Understanding / evidence 2025
Claude-4.5-Opus 74.4% Kodezi Chronos: A Debugging-First Language Model for Repository-Scale Code Understanding / evidence 2025
Claude-4.5-Sonnet 72.7% Kodezi Chronos: A Debugging-First Language Model for Repository-Scale Code Understanding / evidence 2025
Claude-4.1-Opus 72.5% Kodezi Chronos: A Debugging-First Language Model for Repository-Scale Code Understanding / evidence 2025
DeepSeek-R1 72.1% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
GPT-4 72.0% SWE-bench Goes Live! / evidence 2025
Qwen3 69.6% Open-Source vs. Commercial Coding Assistants: A 2025 Comparison of DeepSeek R1, Qwen 2.5 and Claude 3.7 / evidence 2025
Claude-3.5 56.4% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
Gemini-1.5-Pro 55.0% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
Gemini-1.5-Flash 53.0% Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence 2024
Claude-3.5 51.3% SGAgent: Suggestion-Guided LLM-Based Multi-Agent Framework for Repository-Level Software Repair / evidence 2026
GLM-4 47.6% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
DeepSeek-R1 44.8% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
Qwen2.5 43.5% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
Qwen2.5 40.0% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
LLaDA-MoE-7B-A1B 37.2% LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence 2025
GLM-4 35.7% FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / evidence 2025
Claude-3.7-Sonnet 30.3% SWE-bench Goes Live! / evidence 2025
SWE-Gym-32B 15.3% Training Software Engineering Agents and Verifiers with SWE-Gym / evidence 2024
SWE-Gym-14B 12.7% Training Software Engineering Agents and Verifiers with SWE-Gym / evidence 2024
DeepSeek-V3 11.3% SWE-bench Goes Live! / evidence 2025
SWE-Gym-7B 10.0% Training Software Engineering Agents and Verifiers with SWE-Gym / evidence 2024
GPT-4o 7.0% SWE-bench Goes Live! / evidence 2025
Gemini-2.0 5.1% Are They All Good? Evaluating the Quality of CoTs in LLM-based Code Generation / evidence 2025
DeepSeek-R1 4.8% Are They All Good? Evaluating the Quality of CoTs in LLM-based Code Generation / evidence 2025
o1 1.9% Are They All Good? Evaluating the Quality of CoTs in LLM-based Code Generation / evidence 2025

Videoeval-Pro (19)

Model Score Source paper Year
Gemini-1.5-Pro 47.2% VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation / evidence 2025
GPT-4 40.8% VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation / evidence 2025
Gemini-1.5-Flash 39.9% VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation / evidence 2025
GPT-4o 39.3% VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation / evidence 2025
Gemini-2.5-Flash 35.1% VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation / evidence 2025
Qwen2.5 33.9% VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation / evidence 2025
Qwen2 33.3% VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation / evidence 2025
VideoChat-Flash-7B 33.3% VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation / evidence 2025
InternVL2.5-8B 31.7% VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation / evidence 2025
InternVL3-8B 28.8% VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation / evidence 2025
LLaVA-Video-7B 28.5% VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation / evidence 2025
Vamba-10B 28.1% VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation / evidence 2025
LongVU-7B 25.9% VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation / evidence 2025
Video-XL-7B 22.3% VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation / evidence 2025
LongLLaVA-9B 21.7% VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation / evidence 2025
LongVA-7B 20.5% VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation / evidence 2025
Phi-4 19.2% VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation / evidence 2025
Mantis-Idefics2-8B 17.8% VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation / evidence 2025
Video-LLaVA-8B 13.2% VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation / evidence 2025