Papers
Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: What is the correlation between model parameter scale and accuracy degradation on the Humanity Last Exam subset for models exceeding 100B parameters. 6 claims were extracted from source literature; 5 were…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do multimodal frontier models perform on reasoning benchmarks that require integrating visual diagrams with text-based scientific questions compared to text-only architectures. 7 claims were extracted from…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does integrating factual consistency metrics like FACTCC into RAG pipelines impact hallucination rates on medical QA benchmarks compared to standard retrieval methods. 9 claims were extracted from source…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does Qwen3's performance on GPQA Diamond compare to other frontier models when evaluated under chain-of-thought prompting versus standard zero-shot settings. 6 claims were extracted from source literature; 6…
Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: How does the answer accuracy on NaturalQuestions and TriviaQA correlate with retrieval latency when comparing iterative retrieval strategies like RGAR against single-shot standard RAG. 9 claims were extracted from…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: Do multimodal models pre-trained on Visual Genome exhibit improved robustness against adversarial visual perturbations in visual reasoning tasks compared to models trained on sparse image-text pairs. 6 claims…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: Does scaling PDDL-Instruct to larger models improve multi-step symbolic planning throughput while maintaining accuracy in complex PDDL domains. 11 claims were extracted from source literature; 8 were…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does confidence-calibrated fine-tuning impact pass@N accuracy on the MATH benchmark compared to standard supervised fine-tuning. 8 claims were extracted from source literature; 7 were independently verified…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does the robustness of RAG systems vary with different retrieval methods (e.g., dense vs. sparse retrieval) when applied to long-tail scientific queries, evaluated through precision-recall curves. 9 claims…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does incorporating factual consistency metrics (e.g., FACTCC) into retrieval-augmented generation improve answer accuracy on medical QA benchmarks like MedQA compared to standard RAG approaches. 8 claims were…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: Comprehensive comparison of frontier large language models on mathematical reasoning code generation and scientific knowledge v19. 10 claims were extracted from source literature; 9 were independently verified…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the trade-off between retrieval latency and answer quality scale with different retrieval-augmentation strategies (e.g., RGAR vs. standard RAG) on large-scale question-answering benchmarks. 9 claims were…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What are the state-of-the-art large language model results on reasoning benchmarks published recently v19. 9 claims were extracted from source literature; 9 were independently verified against retrieved…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does synthetic training data improve language model performance on mathematical reasoning benchmarks v19. 7 claims were extracted from source literature; 7 were independently verified against retrieved…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: Which frontier language models achieve highest scores on GPQA Diamond Humanity Last Exam and difficult reasoning benchmarks v19. 7 claims were extracted from source literature; 7 were independently verified…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How do language models compare to human experts on professional knowledge and science benchmarks v19. 12 claims were extracted from source literature; 12 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does context length affect language model performance on multi-document reasoning and summarization v19. 10 claims were extracted from source literature; 10 were independently verified against retrieved…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the relationship between language model perplexity and downstream reasoning task performance v19. 10 claims were extracted from source literature; 0 were independently verified against retrieved documents.…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the effect of model size on language model performance on logical reasoning tasks v19. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What are the limitations of current language model evaluation benchmarks for measuring reasoning v19. 8 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How do language models handle multi-hop reasoning chains in scientific question answering v19. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What prompting strategies maximize language model accuracy on graduate-level science questions v19. 17 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does extended thinking time affect language model accuracy on competition-level mathematics v19. 8 claims were extracted from source literature; 7 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the comparative performance of open-source language models versus proprietary models on coding benchmarks v19. 0 claims were extracted from source literature; 0 were independently verified against…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What training strategies improve language model generalization to novel mathematical reasoning problems v19. 12 claims were extracted from source literature; 0 were independently verified against retrieved…