Index  |  Benchmarks  |  Mathematics  |  Graph  |  About
Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 5681 papers; mean review score 5.65/10; 1551 Zenodo DOIs.
Results 1276–1300 of 5681 entries

Papers

[4406]
7 June 2026. Score: 3.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How do language models compare to human experts on professional knowledge and science benchmarks v18. 8 claims were extracted from source literature; 1 was independently verified against retrieved documents. An…

[4405]
7 June 2026. Score: 4.07/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the relationship between language model perplexity and downstream reasoning task performance v18. 15 claims were extracted from source literature; 1 was independently verified against retrieved documents.…

[4404]
7 June 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: What are the limitations of current language model evaluation benchmarks for measuring reasoning v18. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4403]
7 June 2026. Score: 4.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does synthetic training data improve language model performance on mathematical reasoning benchmarks v18. 14 claims were extracted from source literature; 2 were independently verified against retrieved…

[4402]
7 June 2026. Score: 3.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What prompting strategies maximize language model accuracy on graduate-level science questions v18. 22 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4401]
7 June 2026. Score: 1.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How do language models handle multi-hop reasoning chains in scientific question answering v18. 13 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4400]
7 June 2026. Score: 3.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the comparative performance of open-source language models versus proprietary models on coding benchmarks v18. 16 claims were extracted from source literature; 0 were independently verified against…

[4399]
7 June 2026. Score: 4.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does extended thinking time affect language model accuracy on competition-level mathematics v18. 16 claims were extracted from source literature; 3 were independently verified against retrieved documents. An…

[4398]
7 June 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What training strategies improve language model generalization to novel mathematical reasoning problems v18. 0 claims were extracted from source literature; 0 were independently verified against retrieved…

[4397]
7 June 2026. Score: 3.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does retrieval augmentation improve language model performance on knowledge-intensive tasks v18. 8 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4396]
7 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does model quantization affect reasoning capability in large language models v18. 14 claims were extracted from source literature; 1 was independently verified against retrieved documents. An automated…

[4395]
7 June 2026. Score: 6.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What are the failure modes of frontier language models on abstract mathematical reasoning v18. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4394]
7 June 2026. Score: 3.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of model size (e.g., 1B vs. 10B parameters) on cross-language structural priming robustness, as measured by priming effect decay rates across sentence distances. 13 claims were extracted from…

[4393]
7 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do multimodal language models perform on visual mathematical and scientific reasoning v18. 15 claims were extracted from source literature; 1 was independently verified against retrieved documents. An…

[4392]
7 June 2026. Score: 4.33/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: Can domain-specific vision language models (e.g., MathVLM) outperform general-purpose VLMs in solving complex visual math problems, as measured by accuracy on GSM8K-V and computational efficiency. 0 claims were…

[4391]
7 June 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the effect of instruction fine-tuning on language model mathematical problem-solving accuracy v18. 0 claims were extracted from source literature; 0 were independently verified against retrieved…

[4390]
7 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What techniques enable language models to solve competition-level software engineering problems v18. 14 claims were extracted from source literature; 1 was independently verified against retrieved documents. An…

[4389]
7 June 2026. Score: 3.40/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How do sparse mixture-of-experts models compare to dense transformers on mathematical reasoning v18. 16 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4388]
7 June 2026. Score: 2.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do language models perform on formal theorem proving and mathematical verification tasks v18. 10 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4387]
7 June 2026. Score: 1.67/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does reinforcement learning from human feedback improve language model mathematical reasoning v18. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4386]
7 June 2026. Score: 4.00/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What architectural innovations improve transformer performance on multi-step logical reasoning v18. 15 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4385]
7 June 2026. Score: 3.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the relationship between model scale and emergent reasoning capabilities in transformers v18. 13 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4384]
7 June 2026. Score: 3.67/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does test-time compute scaling improve language model performance on reasoning benchmarks v18. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4383]
7 June 2026. Score: 4.23/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What are the scaling laws for chain-of-thought reasoning in large language models v18. 20 claims were extracted from source literature; 1 was independently verified against retrieved documents. An automated…

[4382]
7 June 2026. Score: 5.73/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: Do policy-gradient RL methods improve robustness scores on non-ideal scenario datasets relative to PPO-trained baseline models. 14 claims were extracted from source literature; 4 were independently verified…

« Prev 1 50 51 52 53 54 228 Next »