Papers
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of incorporating evolutionary search strategies on the reasoning accuracy of LLMs when evaluated on competition-level software engineering datasets like CodeContests. 5 claims were extracted…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do large language models compare to Grammar Guided Genetic Programming in solving code generation tasks involving complex, overlapping data structures on the HumanEval benchmark. 18 claims were extracted from…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of synthetic problem quality on the inference efficiency and convergence speed of reinforcement learning for code generation tasks. 5 claims were extracted from source literature; 0 were…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the reasoning performance of long chain-of-thought (Long CoT) LLMs scale with model size and compute budget, as measured by accuracy on benchmark datasets like GSM8K or MATH. 8 claims were extracted from…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the effect of instruction fine-tuning on language model mathematical problem-solving accuracy v5. 15 claims were extracted from source literature; 0 were independently verified against retrieved…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does pretraining data quality affect language model reasoning benchmark performance v5. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What techniques enable language models to solve competition-level software engineering problems v5. 14 claims were extracted from source literature; 4 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do language models perform on formal theorem proving and mathematical verification tasks v5. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does reinforcement learning from human feedback improve language model mathematical reasoning v5. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the relationship between model scale and emergent reasoning capabilities in transformers v5. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What are the scaling laws for chain-of-thought reasoning in large language models v5. 15 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the Parallel Context Windows method impact accuracy on the Needle In A Haystack benchmark compared to sliding window approaches for context lengths exceeding 100k tokens. 0 claims were extracted from…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: To what extent does the Tree of Reviews framework improve robustness against noisy retrieval contexts compared to iterative Chain of Thought methods on the 2WikiMultiHopQA dataset. 9 claims were extracted from…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does context length affect language model performance on multi-document reasoning and summarization. 19 claims were extracted from source literature; 4 were independently verified against retrieved documents.…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the Tree of Reviews framework compare to standard Chain of Thought baselines in terms of answer accuracy and retrieval precision on the HotpotQA and 2WikiMultiHopQA benchmarks. 10 claims were extracted…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the relationship between language model perplexity and downstream reasoning task performance. 12 claims were extracted from source literature; 1 was independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the effect of model size on language model performance on logical reasoning tasks. 10 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do language models compare to human experts on professional knowledge and science benchmarks. 13 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does synthetic training data improve language model performance on mathematical reasoning benchmarks. 9 claims were extracted from source literature; 1 was independently verified against retrieved documents.…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What are the limitations of current language model evaluation benchmarks for measuring reasoning. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What prompting strategies maximize language model accuracy on graduate-level science questions. 16 claims were extracted from source literature; 1 was independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do language models handle multi-hop reasoning chains in scientific question answering. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the comparative performance of open-source language models versus proprietary models on coding benchmarks. 0 claims were extracted from source literature; 0 were independently verified against retrieved…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does model quantization affect reasoning capability in large language models. 20 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of dynamic depth allocation in DS-MoE on zero-shot code generation performance in benchmarks like HumanEval or MBPP compared to fixed-depth transformers. 0 claims were extracted from source…