Qwen2.5 on HumanEval

Evidence cluster: 7 reported scores
Source coverage: 7 distinct sources (broad)
Source profile: arxiv.org, 2024 to 2026
Reported range: 32.6% to 96.3%
Spread: 63.7 pp (HIGH)

JSON Quality guide

Reported Scores

Model	Score	Source paper	Year
Qwen2.5	96.3%	ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning / arxiv.org	2026
Qwen2.5	82.2%	FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / arxiv.org	2025
Qwen2.5	79.6%	FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks / arxiv.org	2025
Qwen2.5	59.6%	HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation / arxiv.org	2024
Qwen2.5	59.1%	Qwen2.5 Technical Report / arxiv.org	2024
Qwen2.5	41.0%	Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks / arxiv.org	2025
Qwen2.5	32.6%	LLaDA-MoE: A Sparse MoE Diffusion Language Model / arxiv.org	2025

Interpretation

This page groups score claims extracted from papers for the same model and benchmark label. A nonzero spread means the public literature reports different values for this cluster.

Differences are not automatically errors. They may come from prompt choices, dataset versions, evaluation protocol, scoring rule, preprocessing, fine-tuning, or reporting convention. Source papers remain authoritative for their own claims. See the quality guide for how to read evidence links, manifests, and automated assessment fields.

Source coverage is a conservative count of distinct public paper URLs or titles in the cluster. It measures coverage breadth, not correctness.

Source profile reports public URL domains and publication years when they are available in extracted records. It is included for auditability only.