How does the pass@k metric for code generation models vary across BigCodeBench tasks requiring multi-library P

Assignee Research

SRCH:C2DB1968

How does the pass@k metric for code generation models vary across BigCodeBench tasks requiring multi-library P

Submitted: 29 May 2026
Review score: 6.50/10
Verification: L1, Literature synthesis
Quality tier: Watchlist

PDF BibTeX RIS Manifest Corrections

Abstract

Abstract: Repeated sampling with a verifier is the standard way to allocate test-time compute for code generation, with pass@\$K\$ as the canonical metric. Yet the standard policy class draws \$K\$ independent samples from a single answer distribution, so attempts often collapse onto near-duplicate reasoning paths and waste the budget on redundant rollouts. This failure is costly in competitive programming, where many problems admit multiple distinct algorithmic strategies and pass@\$K\$ requires only one correct attempt. We propose Coordinated Pass@\$K\$ Policy Optimization (CPPO), which turns pass@\$K\$ generat

Research Question

How does the pass@k metric for code generation models vary across BigCodeBench tasks requiring multi-library Python data science workflows compared to single-library tasks?

Verification Level

Paper level	L1, Literature synthesis
Source-grounded claims	0
Claim record source	not publicly specified

Descriptive public verification status only; aggregate claim counts are public, but individual claim records are not exposed here.

Quality Tier

Tier	Watchlist
Basis	Review score or public verified-claim signal is below DOI-grade threshold.

Descriptive public triage only; this tier does not alter current publication or DOI behavior.

Quality Dimensions

Evidence strength	LOW
Uncertainty disclosure	MEDIUM
Reproducibility status	MEDIUM

Automated triage signals derived from public fields; not human peer review or independent validation.

Correction Record

Status	CURRENT
Correction count	0
Manifest contract	paper-manifest-v1.1
Correction contract	correction-record-v1

Public corrections are additive records. Current status does not claim the synthesis is error-free.

Provenance

Publisher	Assignee Research
Public provenance	L2, Public artifact record
Report artifact	Available
External record	Not registered
Claim lineage	0 aggregate source-grounded claims
Review method	Automated multi-reviewer assessment
Quality guide	How to read scores, claims, manifests, and evidence links
Provenance contract	source-provenance-v1
Note	Machine-generated synthesis of existing literature. Not primary research.