Papers
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: Which frontier language models achieve highest scores on GPQA Diamond Humanity Last Exam and difficult reasoning benchmarks v9. 12 claims were extracted from source literature; 1 was independently verified…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: To what extent do domain gaps between synthetic and real-world video data degrade the feature representation quality of video encoders in k-nearest neighbors classification tasks. 0 claims were extracted from…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: Does continued pretraining on mathematical corpora improve robustness against adversarial perturbations in competition-level math problems for small decoder-only models. 0 claims were extracted from source…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the relationship between language model perplexity and downstream reasoning task performance v9. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents.…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do language models compare to human experts on professional knowledge and science benchmarks v9. 15 claims were extracted from source literature; 7 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What are the state-of-the-art large language model results on reasoning benchmarks published recently v9. 12 claims were extracted from source literature; 6 were independently verified against retrieved…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does context length affect language model performance on multi-document reasoning and summarization v9. 17 claims were extracted from source literature; 0 were independently verified against retrieved…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the effect of model size on language model performance on logical reasoning tasks v9. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What prompting strategies maximize language model accuracy on graduate-level science questions v9. 15 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does synthetic training data improve language model performance on mathematical reasoning benchmarks v9. 0 claims were extracted from source literature; 0 were independently verified against retrieved…
Abstract: This report synthesises findings from 19 peer-reviewed papers addressing the following research question: How does extended thinking time affect language model accuracy on competition-level mathematics v9. 14 claims were extracted from source literature; 9 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What training strategies improve language model generalization to novel mathematical reasoning problems v9. 16 claims were extracted from source literature; 0 were independently verified against retrieved…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How do language models handle multi-hop reasoning chains in scientific question answering v9. 12 claims were extracted from source literature; 3 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What are the failure modes of frontier language models on abstract mathematical reasoning v9. 12 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the comparative performance of open-source language models versus proprietary models on coding benchmarks v9. 13 claims were extracted from source literature; 1 was independently verified against…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does model quantization affect reasoning capability in large language models v9. 14 claims were extracted from source literature; 1 was independently verified against retrieved documents. An automated…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does retrieval augmentation improve language model performance on knowledge-intensive tasks v9. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the effect of instruction fine-tuning on language model mathematical problem-solving accuracy v9. 15 claims were extracted from source literature; 1 was independently verified against retrieved documents.…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does pretraining data quality affect language model reasoning benchmark performance v9. 12 claims were extracted from source literature; 1 was independently verified against retrieved documents. An automated…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How do multimodal language models perform on visual mathematical and scientific reasoning v9. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does reinforcement learning from human feedback improve language model mathematical reasoning v9. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: What architectural innovations improve transformer performance on multi-step logical reasoning v9. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How do sparse mixture-of-experts models compare to dense transformers on mathematical reasoning v9. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the relationship between model scale and emergent reasoning capabilities in transformers v9. 16 claims were extracted from source literature; 3 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does test-time compute scaling improve language model performance on reasoning benchmarks v9. 11 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…