Papers
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does pretraining on procedural data influence alignment metrics like toxicity and helpfulness in models evaluated on benchmarks like TruthfulQA and HELM. 7 claims were extracted from source literature; 1 was…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What are the failure modes of frontier language models on abstract mathematical reasoning v17. 16 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What techniques enable language models to solve competition-level software engineering problems v17. 16 claims were extracted from source literature; 2 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does pretraining data quality affect language model reasoning benchmark performance v17. 15 claims were extracted from source literature; 3 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the effect of instruction fine-tuning on language model mathematical problem-solving accuracy v17. 0 claims were extracted from source literature; 0 were independently verified against retrieved…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does reinforcement learning from human feedback improve language model mathematical reasoning v17. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 20 peer-reviewed papers addressing the following research question: How do language models perform on formal theorem proving and mathematical verification tasks v17. 15 claims were extracted from source literature; 9 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What are the scaling laws for chain-of-thought reasoning in large language models v17. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the relationship between model scale and emergent reasoning capabilities in transformers v17. 13 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does test-time compute scaling improve language model performance on reasoning benchmarks v17. 13 claims were extracted from source literature; 1 was independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How do sparse mixture-of-experts models compare to dense transformers on mathematical reasoning v17. 18 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What are the benchmark performance scores of EXAONE-3.5 on reasoning mathematics coding and language understanding tasks. 0 claims were extracted from source literature; 0 were independently verified against…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What are the benchmark performance scores of DeepSeek-7B on reasoning mathematics coding and language understanding tasks. 0 claims were extracted from source literature; 0 were independently verified against…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What are the benchmark performance scores of DeepSeek-14B on reasoning mathematics coding and language understanding tasks. 20 claims were extracted from source literature; 2 were independently verified against…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What are the benchmark performance scores of Claude-3-Haiku on reasoning mathematics coding and language understanding tasks. 0 claims were extracted from source literature; 0 were independently verified against…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What are the benchmark performance scores of Deepseek-VL on reasoning mathematics coding and language understanding tasks. 12 claims were extracted from source literature; 2 were independently verified against…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What are the state-of-the-art large language model results on reasoning benchmarks published recently v16. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents.…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How do language models compare to human experts on professional knowledge and science benchmarks v16. 14 claims were extracted from source literature; 1 was independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the relationship between language model perplexity and downstream reasoning task performance v16. 8 claims were extracted from source literature; 1 was independently verified against retrieved documents.…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How do multimodal language models perform on visual mathematical and scientific reasoning v16. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What techniques enable language models to solve competition-level software engineering problems v16. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What are the scaling laws for chain-of-thought reasoning in large language models v16. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What architectural innovations improve transformer performance on multi-step logical reasoning v16. 11 claims were extracted from source literature; 9 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the relationship between model scale and emergent reasoning capabilities in transformers v16. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How do sparse mixture-of-experts models compare to dense transformers on mathematical reasoning v16. 11 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…