| Qwen-Max |
98.6% |
MARIO Eval: Evaluate Your Math LLM with your Math LLM--A mathematical dataset evaluation toolkit / evidence |
2024 |
| DeepSeek-R1 |
97.3% |
Quantitative Analysis of Performance Drop in DeepSeek Model Quantization / evidence |
2025 |
| Kimi-k1.5 |
96.2% |
Kimi k1.5: Scaling Reinforcement Learning with LLMs / evidence |
2025 |
| QwQ-32B-Preview |
95.0% |
Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence |
2025 |
| GPT-4o |
95.0% |
Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence |
2025 |
| Kimi-k1.5-Short-CoT |
94.6% |
Kimi k1.5: Scaling Reinforcement Learning with LLMs / evidence |
2025 |
| GPT-3.5 |
93.7% |
MARIO Eval: Evaluate Your Math LLM with your Math LLM--A mathematical dataset evaluation toolkit / evidence |
2024 |
| Gemini-2.0 |
91.5% |
Beyond Meta-Reasoning: Metacognitive Consolidation for Self-Improving LLM Reasoning / evidence |
2026 |
| Claude-3.5-Sonnet |
89.8% |
AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling / evidence |
2024 |
| Mistral-7B |
89.2% |
GuiLoMo: Allocating Expert Number and Rank for LoRA-MoE via Bilevel Optimization with GuidedSelection Vectors / evidence |
2025 |
| Llama-2 |
88.3% |
GuiLoMo: Allocating Expert Number and Rank for LoRA-MoE via Bilevel Optimization with GuidedSelection Vectors / evidence |
2025 |
| Llama-3 |
87.0% |
GuiLoMo: Allocating Expert Number and Rank for LoRA-MoE via Bilevel Optimization with GuidedSelection Vectors / evidence |
2025 |
| GPT-4o |
85.9% |
AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling / evidence |
2024 |
| Llama-3.1-405B |
84.9% |
AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling / evidence |
2024 |
| Gemma-2-it-9B |
84.1% |
Building Math Agents with Multi-Turn Iterative Preference Learning / evidence |
2024 |
| Llama-3.1-8B |
82.9% |
Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence |
2025 |
| Qwen2.5 |
81.0% |
LANPO: Bootstrapping Language and Numerical Feedback for Reinforcement Learning in LLMs / evidence |
2025 |
| DeepSeek-R1 |
79.8% |
LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems / evidence |
2026 |
| Mistral-7B-v0.3 |
77.8% |
Building Math Agents with Multi-Turn Iterative Preference Learning / evidence |
2024 |
| Gemma-1.1-it-7B |
77.5% |
Building Math Agents with Multi-Turn Iterative Preference Learning / evidence |
2024 |
| CodeGemma-1.1-it-7B |
77.3% |
Building Math Agents with Multi-Turn Iterative Preference Learning / evidence |
2024 |
| GPT-4o |
76.7% |
Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification / evidence |
2025 |
| Llama-3 |
76.6% |
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence |
2024 |
| Qwen2.5 |
75.0% |
Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models / evidence |
2025 |
| Qwen3 |
75.0% |
DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs / evidence |
2026 |
| Phi-4 |
74.1% |
Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence |
2025 |
| Gemini-2.0 |
74.1% |
Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence |
2025 |
| GPT-4o |
73.4% |
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence |
2024 |
| GPT-4o |
72.7% |
Beyond Meta-Reasoning: Metacognitive Consolidation for Self-Improving LLM Reasoning / evidence |
2026 |
| DeepSeek-V3 |
72.2% |
Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence |
2025 |
| DeepSeek-R1 |
72.2% |
Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence |
2025 |
| Llama-3 |
72.2% |
Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence |
2025 |
| GPT-3.5 |
72.2% |
Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence |
2025 |
| SmolLM-3B |
72.0% |
Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks / evidence |
2026 |
| o3 |
71.3% |
Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence |
2025 |
| o1 |
71.3% |
Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence |
2025 |
| Qwen2.5 |
71.3% |
Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions / evidence |
2025 |
| GPT-4 |
69.7% |
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification / evidence |
2023 |
| MINT-CoT-7B |
69.6% |
MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning / evidence |
2025 |
| LLaVA-OV-1.5-RL |
69.4% |
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training / evidence |
2025 |
| Claude-3-Opus |
67.7% |
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence |
2024 |
| Qwen2 |
67.4% |
MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning / evidence |
2025 |
| Llama-3.1-70B |
67.0% |
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization / evidence |
2024 |
| Gemini-1.5-Pro |
65.0% |
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence |
2024 |
| GLM-4.5 |
64.0% |
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models / evidence |
2025 |
| Gemini-1.5-Flash |
63.0% |
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence |
2024 |
| Gemini-1.5-Pro |
62.2% |
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context / evidence |
2024 |
| Qwen2.5 |
62.1% |
Qwen2.5 Technical Report / evidence |
2024 |
| LLaVA-OV-1.5 |
61.5% |
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training / evidence |
2025 |
| GPT-4 |
60.1% |
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence |
2024 |
| Llama-3.1-70B |
60.0% |
Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification / evidence |
2025 |
| Gemini-1.5-Pro |
58.5% |
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence |
2024 |
| Skywork-RLHFlow |
55.9% |
Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification / evidence |
2025 |
| BaseRL-7B |
55.8% |
SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence |
2025 |
| BaseRL-32B |
55.3% |
SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence |
2025 |
| LLaDA-MoE-7B-A1B |
55.0% |
LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence |
2025 |
| Qwen2 |
54.8% |
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence |
2024 |
| LLaDA-8B |
52.4% |
LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence |
2025 |
| SwS-32B |
50.0% |
SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence |
2025 |
| Qwen-VL-Max |
49.9% |
CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence |
2024 |
| Skywork |
48.9% |
Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification / evidence |
2025 |
| Llama-3 |
48.9% |
Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification / evidence |
2025 |
| GPT-4 |
48.1% |
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations / evidence |
2023 |
| Qwen2.5 |
46.4% |
MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning / evidence |
2025 |
| MathCoder |
45.9% |
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning / evidence |
2023 |
| Qwen2.5 |
45.7% |
MathMixup: Boosting LLM Mathematical Reasoning with Difficulty-Controllable Data Synthesis and Curriculum Learning / evidence |
2026 |
| LLaDA-MoE-7B-A1B |
44.6% |
LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence |
2025 |
| Qwen2.5 |
44.1% |
LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence |
2025 |
| Qwen2 |
43.0% |
CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence |
2024 |
| Llama-3.1-70B |
42.8% |
Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus / evidence |
2024 |
| GPT-3.5 |
42.5% |
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs / evidence |
2024 |
| GPT-4 |
42.5% |
Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance / evidence |
2023 |
| Gemini |
41.9% |
CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence |
2024 |
| Llama-3 |
39.3% |
DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs / evidence |
2026 |
| Dream |
38.7% |
Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence |
2025 |
| LLaDA-1.5 |
37.2% |
LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence |
2025 |
| Dream-v0-Instruct-7B |
37.0% |
LLaDA-MoE: A Sparse MoE Diffusion Language Model / evidence |
2025 |
| PRIME-RL-7B |
35.7% |
SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence |
2025 |
| SwS-7B |
35.0% |
SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence |
2025 |
| Open-Reasoner-32B |
34.9% |
SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence |
2025 |
| Qwen2.5 |
34.8% |
SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence |
2025 |
| Oat-Zero-7B |
31.4% |
SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence |
2025 |
| GPT-5 |
30.5% |
Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance? / evidence |
2026 |
| SimpleRL-Base-7B |
30.5% |
SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence |
2025 |
| LLaDA |
30.2% |
Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles / evidence |
2025 |
| DeepSeek-R1 |
30.0% |
Token-Hungry, Yet Precise: DeepSeek R1 Highlights the Need for Multi-Step Reasoning Over Speed in MATH / evidence |
2025 |
| Llama-3.1-8B |
30.0% |
TokenSkip: Controllable Chain-of-Thought Compression in LLMs / evidence |
2025 |
| GPT-4 |
29.8% |
Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance? / evidence |
2026 |
| SimpleRL-Base-32B |
29.8% |
SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence |
2025 |
| GPT-4o |
29.0% |
CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence |
2024 |
| Llama-3 |
29.0% |
Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks / evidence |
2026 |
| Llama-2 |
28.9% |
SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models / evidence |
2024 |
| Open-Reasoner-7B |
27.6% |
SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence |
2025 |
| Qwen2.5 |
27.0% |
s1: Simple test-time scaling / evidence |
2025 |
| s1-32B |
27.0% |
s1: Simple test-time scaling / evidence |
2025 |
| Phi-3 |
26.5% |
SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models / evidence |
2024 |
| Llama-3.1-8B |
25.3% |
MathMixup: Boosting LLM Mathematical Reasoning with Difficulty-Controllable Data Synthesis and Curriculum Learning / evidence |
2026 |
| MathCoder-L |
23.3% |
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning / evidence |
2023 |
| InternLM2.5-7B |
22.9% |
MathMixup: Boosting LLM Mathematical Reasoning with Difficulty-Controllable Data Synthesis and Curriculum Learning / evidence |
2026 |
| InternLM-VL |
19.8% |
CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence |
2024 |
| Llama-2 |
19.2% |
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations / evidence |
2023 |
| LLaVA-1.5-13B |
18.9% |
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting / evidence |
2024 |
| LLaVA-v1.5 |
18.1% |
CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence |
2024 |
| LLaVA-v1.6-mistral |
16.8% |
CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models / evidence |
2024 |
| Phi-2-2.7B |
16.1% |
SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models / evidence |
2024 |
| Llama-2 |
14.9% |
DONOD: Efficient and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning / evidence |
2025 |
| GPT-2-1.5B |
8.3% |
Measuring Mathematical Problem Solving With the MATH Dataset / evidence |
2021 |
| GPT-3-175B |
7.7% |
Measuring Mathematical Problem Solving With the MATH Dataset / evidence |
2021 |
| GPT-2-0.7B |
6.9% |
Measuring Mathematical Problem Solving With the MATH Dataset / evidence |
2021 |
| GPT-3-13B-FineTuned |
6.8% |
Measuring Mathematical Problem Solving With the MATH Dataset / evidence |
2021 |
| Llama-1-RFT |
6.7% |
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning / evidence |
2023 |
| GPT-2-0.3B |
6.7% |
Measuring Mathematical Problem Solving With the MATH Dataset / evidence |
2021 |
| GPT-2-0.1B |
5.2% |
Measuring Mathematical Problem Solving With the MATH Dataset / evidence |
2021 |
| GPT-3-13B |
4.1% |
Measuring Mathematical Problem Solving With the MATH Dataset / evidence |
2021 |
| SwS-3B |
2.2% |
SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence |
2025 |
| BaseRL-3B |
0.0% |
SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning / evidence |
2025 |