Papers
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the impact of learning rate annealing on the robustness of open-source code models when evaluated on adversarial examples from the LiveCodeBench dataset, measured by pass@k scores and. 17 claims were…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the impact of different levels of synthetic data realism (e.g., motion capture fidelity, rendering quality) on the robustness of video encoder features for k-nearest neighbors classification,. 0 claims…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: Does the MathCoder2 pretraining approach improve robustness against adversarial perturbations in competition-level math problems for models under 3B parameters. 17 claims were extracted from source literature; 2…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the performance of k-nearest neighbors classification using features from synthetic gesture videos compare to random forests when evaluated on real-world gesture recognition benchmarks like. 0 claims were…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does few-shot prompting with lightweight masked language models compare to large autoregressive models on low-resource clinical named entity recognition benchmarks. 13 claims were extracted from source…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do different alignment techniques (e.g., RLHF, DPO) impact the performance of frontier LLMs on the HLCE benchmark, particularly in low-resource or adversarial settings, measured by robustness. 8 claims were…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the correlation between model size (parameter count) and performance on the HLCE benchmark, and does this scaling law hold for models trained with mixed-domain datasets, as measured by. 10 claims were…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does continued pretraining on model-translated mathematical code affect small decoder-only models' accuracy on the MATH benchmark compared to standard mathematical text pretraining. 18 claims were extracted…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: Comprehensive comparison of frontier large language models on mathematical reasoning code generation and scientific knowledge v9. 0 claims were extracted from source literature; 0 were independently verified…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the correlation between parameter count and pass@k scores for open-source code models across varying difficulty levels in the LiveCodeBench dataset. 16 claims were extracted from source literature; 0 were…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: Which frontier language models achieve highest scores on GPQA Diamond Humanity Last Exam and difficult reasoning benchmarks v9. 12 claims were extracted from source literature; 1 was independently verified…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: To what extent do domain gaps between synthetic and real-world video data degrade the feature representation quality of video encoders in k-nearest neighbors classification tasks. 0 claims were extracted from…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: Does continued pretraining on mathematical corpora improve robustness against adversarial perturbations in competition-level math problems for small decoder-only models. 0 claims were extracted from source…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the relationship between language model perplexity and downstream reasoning task performance v9. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents.…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do language models compare to human experts on professional knowledge and science benchmarks v9. 15 claims were extracted from source literature; 7 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What are the state-of-the-art large language model results on reasoning benchmarks published recently v9. 12 claims were extracted from source literature; 6 were independently verified against retrieved…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does context length affect language model performance on multi-document reasoning and summarization v9. 17 claims were extracted from source literature; 0 were independently verified against retrieved…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the effect of model size on language model performance on logical reasoning tasks v9. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What prompting strategies maximize language model accuracy on graduate-level science questions v9. 15 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does synthetic training data improve language model performance on mathematical reasoning benchmarks v9. 0 claims were extracted from source literature; 0 were independently verified against retrieved…
Abstract: This report synthesises findings from 19 peer-reviewed papers addressing the following research question: How does extended thinking time affect language model accuracy on competition-level mathematics v9. 14 claims were extracted from source literature; 9 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What training strategies improve language model generalization to novel mathematical reasoning problems v9. 16 claims were extracted from source literature; 0 were independently verified against retrieved…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How do language models handle multi-hop reasoning chains in scientific question answering v9. 12 claims were extracted from source literature; 3 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What are the failure modes of frontier language models on abstract mathematical reasoning v9. 12 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the comparative performance of open-source language models versus proprietary models on coding benchmarks v9. 13 claims were extracted from source literature; 1 was independently verified against…