Papers
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do language models handle multi-hop reasoning chains in scientific question answering v14. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How do multimodal language models perform on visual mathematical and scientific reasoning v14. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does retrieval augmentation improve language model performance on knowledge-intensive tasks v14. 13 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What are the failure modes of frontier language models on abstract mathematical reasoning v14. 10 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: To what extent does scaling the size of the concept graph in MathScale improve the model's accuracy on the MATH dataset compared to baselines without structured knowledge extraction. 0 claims were extracted from…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does Adaptive Graph-Guided Retrieval in Kodezi Chronos-1 compare to traditional retrieval-augmented generation (RAG) approaches in terms of debugging accuracy and throughput on multi-file. 13 claims were…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the effect of instruction fine-tuning on language model mathematical problem-solving accuracy v14. 19 claims were extracted from source literature; 0 were independently verified against retrieved…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the psychometric-based evaluation method compare to traditional proof pass rate metrics in terms of accuracy and computational efficiency when applied to large-scale theorem proving. 9 claims were…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does pretraining data quality affect language model reasoning benchmark performance v14. 10 claims were extracted from source literature; 2 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What techniques enable language models to solve competition-level software engineering problems v14. 13 claims were extracted from source literature; 2 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What architectural innovations improve transformer performance on multi-step logical reasoning v14. 11 claims were extracted from source literature; 1 was independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do language models perform on formal theorem proving and mathematical verification tasks v14. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How do sparse mixture-of-experts models compare to dense transformers on mathematical reasoning v14. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 5 peer-reviewed papers addressing the following research question: What are the scaling laws for chain-of-thought reasoning in large language models v14. 10 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does test-time compute scaling improve language model performance on reasoning benchmarks v14. 11 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the integration of synthetic chart data augmentations in instruction-tuned datasets like MMC-Instruction affect LMM generalization performance across ChartQA and FigureQA benchmarks,. 0 claims were…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the impact of FlashSpeech's zero-shot speaker adaptation on word error rate degradation when evaluated on out-of-domain emotional speech datasets like CREMA-D. 0 claims were extracted from source…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: Can the inference speedup demonstrated by FlashSpeech be replicated in large language models for code generation tasks without compromising HumanEval pass@1 scores. 10 claims were extracted from source…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does the scaling of instruction-tuned datasets (e.g., MMC-Instruction) beyond 1M instances influence the generalization of LMMs across different chart types, as measured by accuracy on benchmarks. 14 claims…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the top-1 accuracy of training-free k-NN classification using synthetic video features compare to fine-tuned baselines on the NVGesture dataset when evaluated across different large. 0 claims were…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of increasing the scale of MMC-Instruction (e.g., 1M vs. 600k instances) on the robustness of LMMs to distributional shifts in chart types, as measured by accuracy on unseen. 10 claims were…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of visual encoder resolution on the accuracy of multimodal models when interpreting complex diagrams in code generation tasks. 7 claims were extracted from source literature; 1 was…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does chain-of-thought prompting affect the robustness of large multimodal models against adversarial perturbations in diagram-based reasoning benchmarks. 0 claims were extracted from source literature; 0 were…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: Can training-free reasoning compression methods like ARS maintain performance on code generation tasks evaluated by HumanEval while reducing token usage. 0 claims were extracted from source literature; 0 were…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does fine-tuning MMC-Instruction-trained LMMs on domain-specific chart datasets improve their performance on benchmarks like ChartQA, as compared to zero-shot capabilities. 12 claims were extracted from…