SRCH:AF9F54A8
GPT-4 Babilong Score Discrepancy: Evaluation Protocol Variations Across Studies
Abstract
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: Benchmark archaeology: investigate Babilong score discrepancy for GPT-4 — reported 10.0\%–85.0\% (spread 75.0pp) across 2 papers. Sources: 'BABILong: Testing the Limits of LLMs wit' (10.0\%); 'BABILong:. Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation across various domains, including medicine. We present a comprehensive evaluation of GPT-4, a state-of-the-art LLM, on medical competency examinations and. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated multi-reviewer quality assessment produced a score of 1.5/10. This report is a machine-generated literature synthesis and does not constitute original research.
Research Question
Benchmark archaeology: investigate Babilong score discrepancy for GPT-4 — reported 10.0%–85.0% (spread 75.0pp) across 2 papers. Sources: 'BABILong: Testing the Limits of LLMs wit' (10.0%); 'BABILong: Testing the Limits of LLMs wit' (85.0%). Identify evaluation protocol differences (few-shot, prompting, preprocessing).
Verification Level
| Paper level | L1, Literature synthesis | |
| Source-grounded claims | 0 | |
| Claim record source | not publicly specified |
Descriptive public verification status only; aggregate claim counts are public, but individual claim records are not exposed here.
Quality Tier
| Tier | Quarantine candidate | |
| Basis | Review score is below 5.0; source-level inspection is required before relying on the synthesis. |
Descriptive public triage only; this tier does not alter current publication or DOI behavior.
Quality Dimensions
| Evidence strength | LOW | |
| Uncertainty disclosure | MEDIUM | |
| Reproducibility status | MEDIUM |
Automated triage signals derived from public fields; not human peer review or independent validation.
Correction Record
| Status | CURRENT |
| Correction count | 0 |
| Manifest contract | paper-manifest-v1.1 |
| Correction contract | correction-record-v1 |
Public corrections are additive records. Current status does not claim the synthesis is error-free.
Provenance
| Publisher | Assignee Research |
| Public provenance | L2, Public artifact record |
| Report artifact | Available |
| External record | Not registered |
| Claim lineage | 0 aggregate source-grounded claims |
| Review method | Automated multi-reviewer assessment |
| Quality guide | How to read scores, claims, manifests, and evidence links |
| Provenance contract | source-provenance-v1 |
| Note | Machine-generated synthesis of existing literature. Not primary research. |