Papers
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does quantization affect reasoning capabilities on the HumanEval benchmark for code generation tasks. 10 claims were extracted from source literature; 9 were independently verified against retrieved documents.…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How do different alignment techniques (e.g., instruction tuning, RLHF) affect the reasoning capabilities of VLMs on mixed-modality benchmarks such as MMBench and LLaVA-Bench. 13 claims were extracted from source…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: Does the combination of globally-normalised decoding and iterative refinement improve the factual consistency of generated responses on TruthfulQA, as evaluated by human annotations and automated. 9 claims were…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the trade-off between token generation throughput and performance degradation on code synthesis tasks when applying expert skipping strategies to large MoE architectures. 15 claims were extracted from…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: Does expert-level sparsity in Mixture-of-Experts models maintain robustness on multimodal evaluation suites such as ScienceQA or MMMU compared to full-parameter inference. 6 claims were extracted from source…
Abstract: This report synthesises findings from 2 peer-reviewed papers addressing the following research question: How does INT4 quantization affect the robustness of multimodal models on the VQA-v2 dataset under varying noise conditions compared to FP16 precision. 12 claims were extracted from source literature; 10 were…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How do multimodal models like FLIP, GIT, and BLIP compare in terms of accuracy and robustness on visual mathematical reasoning benchmarks such as GSM8K-V and MATH-V. 7 claims were extracted from source literature;…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the accuracy of vision language models on GSM8K-V degrade when mathematical diagrams contain synthetic noise or adversarial perturbations compared to clean images. 0 claims were extracted from source…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the dynamic suppression of redundant reasoning steps in ARS compare to static pruning methods in terms of inference throughput on GSM8K and MATH benchmarks. 14 claims were extracted from source…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: Can ARS generalize to few-shot code generation tasks like HumanEval, and how does it affect pass@1 scores compared to baseline models without suppression. 14 claims were extracted from source literature; 0 were…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: Does multilingual pretraining in ERNIE-Code improve robustness against syntactic variations in low-resource programming languages compared to English-centric models on the HumanEval-X benchmark. 0 claims were…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: Comprehensive comparison of frontier large language models on mathematical reasoning code generation and scientific knowledge v7. 0 claims were extracted from source literature; 0 were independently verified…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does the performance of Integrative Decoding compare to other self-consistency methods (e.g., Self-Consistency, Majority Voting) on open-ended generation tasks in the TruthfulQA benchmark when. 10 claims were…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: Does the effectiveness of Integrative Decoding's differentiable decoding loop scale with the number of sampling iterations when evaluated on multiple-choice and open-ended generation tasks in the. 16 claims were…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What are the state-of-the-art large language model results on reasoning benchmarks published recently v7. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents.…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the relationship between language model perplexity and downstream reasoning task performance v7. 13 claims were extracted from source literature; 0 were independently verified against retrieved documents.…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the effect of model size on language model performance on logical reasoning tasks v7. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How do language models compare to human experts on professional knowledge and science benchmarks v7. 15 claims were extracted from source literature; 1 was independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does synthetic training data improve language model performance on mathematical reasoning benchmarks v7. 20 claims were extracted from source literature; 0 were independently verified against retrieved…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What are the limitations of current language model evaluation benchmarks for measuring reasoning v7. 15 claims were extracted from source literature; 1 was independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: How does extended thinking time affect language model accuracy on competition-level mathematics v7. 12 claims were extracted from source literature; 12 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What prompting strategies maximize language model accuracy on graduate-level science questions v7. 10 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do language models handle multi-hop reasoning chains in scientific question answering v7. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does model quantization affect reasoning capability in large language models v7. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What training strategies improve language model generalization to novel mathematical reasoning problems v7. 9 claims were extracted from source literature; 1 was independently verified against retrieved…