Papers
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do multimodal language models perform on visual mathematical and scientific reasoning v7. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the impact of \$ abla\$-Reasoner's differentiable decoding loop on hallucination rates when evaluated on the TruthfulQA benchmark. 9 claims were extracted from source literature; 0 were independently…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What are the failure modes of frontier language models on abstract mathematical reasoning v7. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does pretraining data quality affect language model reasoning benchmark performance v7. 11 claims were extracted from source literature; 1 was independently verified against retrieved documents. An automated…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the effect of instruction fine-tuning on language model mathematical problem-solving accuracy v7. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents.…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do language models perform on formal theorem proving and mathematical verification tasks v7. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What architectural innovations improve transformer performance on multi-step logical reasoning v7. 20 claims were extracted from source literature; 1 was independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does reinforcement learning from human feedback improve language model mathematical reasoning v7. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What are the scaling laws for chain-of-thought reasoning in large language models v7. 13 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the relationship between model scale and emergent reasoning capabilities in transformers v7. 9 claims were extracted from source literature; 4 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: How does test-time compute scaling improve language model performance on reasoning benchmarks v7. 12 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: To what extent does the incorporation of symbolic rule supervision in neuro-symbolic frameworks reduce hallucination rates in chain-of-thought reasoning tasks compared to standard transformer-based. 0 claims were…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of reward-free alignment methods like DPO versus reward-based RLHF on the robustness of LLMs against adversarial prompts in safety evaluation datasets. 10 claims were extracted from source…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How do neuro-symbolic verification methods compare to end-to-end neural provers in maintaining proof success rates on the MiniF2F benchmark when theorem statements are subjected to syntactic. 0 claims were…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: What is the impact of code-text pretraining on cross-lingual code generation accuracy for low-resource programming languages when evaluated on the HumanEval-X benchmark. 11 claims were extracted from source…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do neuro-symbolic proof generation methods perform in terms of robustness against adversarial perturbations in theorem statements compared to end-to-end neural approaches on formal mathematics. 10 claims were…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: To what extent do alignment techniques (e.g., reinforcement learning from human feedback) improve model performance on HLE-Verified's high-difficulty questions compared to standard supervised. 10 claims were…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of reverse operation data augmentation on the sample efficiency of language models when fine-tuned on limited MMLU STEM subsets. 11 claims were extracted from source literature; 0 were…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: Does training on reversed-logic math problems enhance out-of-distribution robustness on the MATH benchmark compared to standard synthetic data methods. 0 claims were extracted from source literature; 0 were…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: Comprehensive comparison of frontier large language models on mathematical reasoning code generation and scientific knowledge v6. 0 claims were extracted from source literature; 0 were independently verified…
Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: What are the state-of-the-art large language model results on reasoning benchmarks published recently v6. 10 claims were extracted from source literature; 0 were independently verified against retrieved documents.…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the relationship between language model perplexity and downstream reasoning task performance v6. 10 claims were extracted from source literature; 1 was independently verified against retrieved documents.…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: Which frontier language models achieve highest scores on GPQA Diamond Humanity Last Exam and difficult reasoning benchmarks v6. 13 claims were extracted from source literature; 0 were independently verified…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do language models compare to human experts on professional knowledge and science benchmarks v6. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does extended thinking time affect language model accuracy on competition-level mathematics v6. 14 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…