GPT-4 Babilong Score Discrepancy: Evaluation Protocol Variations Across Studies

Assignee Research

SRCH:AF9F54A8

GPT-4 Babilong Score Discrepancy: Evaluation Protocol Variations Across Studies

Submitted: 31 May 2026
Review score: 1.50/10
Verification: L1, Literature synthesis
Quality tier: Quarantine candidate

PDF BibTeX RIS Manifest Corrections

Abstract

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: Benchmark archaeology: investigate Babilong score discrepancy for GPT-4 — reported 10.0\%–85.0\% (spread 75.0pp) across 2 papers. Sources: 'BABILong: Testing the Limits of LLMs wit' (10.0\%); 'BABILong:. Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation across various domains, including medicine. We present a comprehensive evaluation of GPT-4, a state-of-the-art LLM, on medical competency examinations and. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated multi-reviewer quality assessment produced a score of 1.5/10. This report is a machine-generated literature synthesis and does not constitute original research.

Research Question

Benchmark archaeology: investigate Babilong score discrepancy for GPT-4 — reported 10.0%–85.0% (spread 75.0pp) across 2 papers. Sources: 'BABILong: Testing the Limits of LLMs wit' (10.0%); 'BABILong: Testing the Limits of LLMs wit' (85.0%). Identify evaluation protocol differences (few-shot, prompting, preprocessing).

Verification Level

Paper level	L1, Literature synthesis
Source-grounded claims	0
Claim record source	not publicly specified

Descriptive public verification status only; aggregate claim counts are public, but individual claim records are not exposed here.

Quality Tier

Tier	Quarantine candidate
Basis	Review score is below 5.0; source-level inspection is required before relying on the synthesis.

Descriptive public triage only; this tier does not alter current publication or DOI behavior.

Quality Dimensions

Evidence strength	LOW
Uncertainty disclosure	MEDIUM
Reproducibility status	MEDIUM

Automated triage signals derived from public fields; not human peer review or independent validation.

Correction Record

Status	CURRENT
Correction count	0
Manifest contract	paper-manifest-v1.1
Correction contract	correction-record-v1

Public corrections are additive records. Current status does not claim the synthesis is error-free.

Provenance

Publisher	Assignee Research
Public provenance	L2, Public artifact record
Report artifact	Available
External record	Not registered
Claim lineage	0 aggregate source-grounded claims
Review method	Automated multi-reviewer assessment
Quality guide	How to read scores, claims, manifests, and evidence links
Provenance contract	source-provenance-v1
Note	Machine-generated synthesis of existing literature. Not primary research.