Index  |  Benchmarks  |  Mathematics  |  Graph  |  About
Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 6190 papers; mean review score 5.55/10; 1559 Zenodo DOIs.
Results 2026–2050 of 6190 entries

Papers

[4165]
6 June 2026. Score: 4.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: Does the MathCoder2 pretraining approach improve robustness against adversarial perturbations in competition-level math problems for models under 3B parameters. 17 claims were extracted from source literature; 2…

[4164]
6 June 2026. Score: 3.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the performance of k-nearest neighbors classification using features from synthetic gesture videos compare to random forests when evaluated on real-world gesture recognition benchmarks like. 0 claims were…

[4163]
6 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does few-shot prompting with lightweight masked language models compare to large autoregressive models on low-resource clinical named entity recognition benchmarks. 13 claims were extracted from source…

[4162]
6 June 2026. Score: 1.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do different alignment techniques (e.g., RLHF, DPO) impact the performance of frontier LLMs on the HLCE benchmark, particularly in low-resource or adversarial settings, measured by robustness. 8 claims were…

[4161]
6 June 2026. Score: 3.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the correlation between model size (parameter count) and performance on the HLCE benchmark, and does this scaling law hold for models trained with mixed-domain datasets, as measured by. 10 claims were…

[4160]
6 June 2026. Score: 3.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does continued pretraining on model-translated mathematical code affect small decoder-only models' accuracy on the MATH benchmark compared to standard mathematical text pretraining. 18 claims were extracted…

[4159]
6 June 2026. Score: 3.17/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: Comprehensive comparison of frontier large language models on mathematical reasoning code generation and scientific knowledge v9. 0 claims were extracted from source literature; 0 were independently verified…

[4158]
6 June 2026. Score: 3.07/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the correlation between parameter count and pass@k scores for open-source code models across varying difficulty levels in the LiveCodeBench dataset. 16 claims were extracted from source literature; 0 were…

[4157]
6 June 2026. Score: 5.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: Which frontier language models achieve highest scores on GPQA Diamond Humanity Last Exam and difficult reasoning benchmarks v9. 12 claims were extracted from source literature; 1 was independently verified…

[4156]
6 June 2026. Score: 3.33/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: To what extent do domain gaps between synthetic and real-world video data degrade the feature representation quality of video encoders in k-nearest neighbors classification tasks. 0 claims were extracted from…

[4155]
6 June 2026. Score: 5.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: Does continued pretraining on mathematical corpora improve robustness against adversarial perturbations in competition-level math problems for small decoder-only models. 0 claims were extracted from source…

[4154]
6 June 2026. Score: 3.00/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the relationship between language model perplexity and downstream reasoning task performance v9. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents.…

[4153]
6 June 2026. Score: 6.60/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do language models compare to human experts on professional knowledge and science benchmarks v9. 15 claims were extracted from source literature; 7 were independently verified against retrieved documents. An…

[4152]
6 June 2026. Score: 6.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What are the state-of-the-art large language model results on reasoning benchmarks published recently v9. 12 claims were extracted from source literature; 6 were independently verified against retrieved…

[4151]
6 June 2026. Score: 3.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does context length affect language model performance on multi-document reasoning and summarization v9. 17 claims were extracted from source literature; 0 were independently verified against retrieved…

[4150]
6 June 2026. Score: 7.17/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the effect of model size on language model performance on logical reasoning tasks v9. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4149]
6 June 2026. Score: 3.73/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What prompting strategies maximize language model accuracy on graduate-level science questions v9. 15 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4148]
6 June 2026. Score: 3.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does synthetic training data improve language model performance on mathematical reasoning benchmarks v9. 0 claims were extracted from source literature; 0 were independently verified against retrieved…

[4147]
6 June 2026. Score: 6.93/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 19 peer-reviewed papers addressing the following research question: How does extended thinking time affect language model accuracy on competition-level mathematics v9. 14 claims were extracted from source literature; 9 were independently verified against retrieved documents. An…

[4146]
6 June 2026. Score: 3.07/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What training strategies improve language model generalization to novel mathematical reasoning problems v9. 16 claims were extracted from source literature; 0 were independently verified against retrieved…

[4145]
6 June 2026. Score: 5.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How do language models handle multi-hop reasoning chains in scientific question answering v9. 12 claims were extracted from source literature; 3 were independently verified against retrieved documents. An…

[4144]
6 June 2026. Score: 3.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What are the failure modes of frontier language models on abstract mathematical reasoning v9. 12 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4143]
6 June 2026. Score: 3.40/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the comparative performance of open-source language models versus proprietary models on coding benchmarks v9. 13 claims were extracted from source literature; 1 was independently verified against…

[4142]
6 June 2026. Score: 4.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does model quantization affect reasoning capability in large language models v9. 14 claims were extracted from source literature; 1 was independently verified against retrieved documents. An automated…

[4141]
6 June 2026. Score: 5.17/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does retrieval augmentation improve language model performance on knowledge-intensive tasks v9. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

« Prev 1 80 81 82 83 84 248 Next »