Index  |  Benchmarks  |  Mathematics  |  Graph  |  About
Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 5681 papers; mean review score 5.65/10; 1551 Zenodo DOIs.
Results 1226–1250 of 5681 entries

Papers

[4456]
7 June 2026. Score: 7.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: What is the correlation between model parameter scale and accuracy degradation on the Humanity Last Exam subset for models exceeding 100B parameters. 6 claims were extracted from source literature; 5 were…

[4455]
7 June 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20576385

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do multimodal frontier models perform on reasoning benchmarks that require integrating visual diagrams with text-based scientific questions compared to text-only architectures. 7 claims were extracted from…

[4454]
7 June 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20576381

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does integrating factual consistency metrics like FACTCC into RAG pipelines impact hallucination rates on medical QA benchmarks compared to standard retrieval methods. 9 claims were extracted from source…

[4453]
7 June 2026. Score: 8.90/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20576379

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does Qwen3's performance on GPQA Diamond compare to other frontier models when evaluated under chain-of-thought prompting versus standard zero-shot settings. 6 claims were extracted from source literature; 6…

[4452]
7 June 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20576374

Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: How does the answer accuracy on NaturalQuestions and TriviaQA correlate with retrieval latency when comparing iterative retrieval strategies like RGAR against single-shot standard RAG. 9 claims were extracted from…

[4451]
7 June 2026. Score: 7.90/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20576346

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: Do multimodal models pre-trained on Visual Genome exhibit improved robustness against adversarial visual perturbations in visual reasoning tasks compared to models trained on sparse image-text pairs. 6 claims…

[4450]
7 June 2026. Score: 7.50/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20576343

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: Does scaling PDDL-Instruct to larger models improve multi-step symbolic planning throughput while maintaining accuracy in complex PDDL domains. 11 claims were extracted from source literature; 8 were…

[4449]
7 June 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20576341

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does confidence-calibrated fine-tuning impact pass@N accuracy on the MATH benchmark compared to standard supervised fine-tuning. 8 claims were extracted from source literature; 7 were independently verified…

[4448]
7 June 2026. Score: 9.17/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20576326

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does the robustness of RAG systems vary with different retrieval methods (e.g., dense vs. sparse retrieval) when applied to long-tail scientific queries, evaluated through precision-recall curves. 9 claims…

[4447]
7 June 2026. Score: 7.40/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does incorporating factual consistency metrics (e.g., FACTCC) into retrieval-augmented generation improve answer accuracy on medical QA benchmarks like MedQA compared to standard RAG approaches. 8 claims were…

[4446]
7 June 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20576314

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: Comprehensive comparison of frontier large language models on mathematical reasoning code generation and scientific knowledge v19. 10 claims were extracted from source literature; 9 were independently verified…

[4445]
7 June 2026. Score: 7.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20576312

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the trade-off between retrieval latency and answer quality scale with different retrieval-augmentation strategies (e.g., RGAR vs. standard RAG) on large-scale question-answering benchmarks. 9 claims were…

[4444]
7 June 2026. Score: 7.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20576308

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What are the state-of-the-art large language model results on reasoning benchmarks published recently v19. 9 claims were extracted from source literature; 9 were independently verified against retrieved…

[4443]
7 June 2026. Score: 8.33/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20576300

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does synthetic training data improve language model performance on mathematical reasoning benchmarks v19. 7 claims were extracted from source literature; 7 were independently verified against retrieved…

[4442]
7 June 2026. Score: 6.30/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: Which frontier language models achieve highest scores on GPQA Diamond Humanity Last Exam and difficult reasoning benchmarks v19. 7 claims were extracted from source literature; 7 were independently verified…

[4441]
7 June 2026. Score: 8.97/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20576295

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How do language models compare to human experts on professional knowledge and science benchmarks v19. 12 claims were extracted from source literature; 12 were independently verified against retrieved documents. An…

[4440]
7 June 2026. Score: 8.07/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20576291

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does context length affect language model performance on multi-document reasoning and summarization v19. 10 claims were extracted from source literature; 10 were independently verified against retrieved…

[4439]
7 June 2026. Score: 3.33/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the relationship between language model perplexity and downstream reasoning task performance v19. 10 claims were extracted from source literature; 0 were independently verified against retrieved documents.…

[4438]
7 June 2026. Score: 3.67/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the effect of model size on language model performance on logical reasoning tasks v19. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4437]
7 June 2026. Score: 3.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What are the limitations of current language model evaluation benchmarks for measuring reasoning v19. 8 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4436]
7 June 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How do language models handle multi-hop reasoning chains in scientific question answering v19. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4435]
7 June 2026. Score: 4.00/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What prompting strategies maximize language model accuracy on graduate-level science questions v19. 17 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4434]
7 June 2026. Score: 7.40/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does extended thinking time affect language model accuracy on competition-level mathematics v19. 8 claims were extracted from source literature; 7 were independently verified against retrieved documents. An…

[4433]
7 June 2026. Score: 0.17/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the comparative performance of open-source language models versus proprietary models on coding benchmarks v19. 0 claims were extracted from source literature; 0 were independently verified against…

[4432]
7 June 2026. Score: 4.73/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What training strategies improve language model generalization to novel mathematical reasoning problems v19. 12 claims were extracted from source literature; 0 were independently verified against retrieved…

« Prev 1 48 49 50 51 52 228 Next »