Index  |  Benchmarks  |  Mathematics  |  Graph  |  About
Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 4537 papers; mean review score 5.86/10; 1430 Zenodo DOIs.
Results 351–375 of 4537 entries

Papers

[4187]
6 June 2026. Score: 4.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the comparative performance of open-source language models versus proprietary models on coding benchmarks v10. 10 claims were extracted from source literature; 2 were independently verified against…

[4186]
6 June 2026. Score: 6.00/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What are the failure modes of frontier language models on abstract mathematical reasoning v10. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4185]
6 June 2026. Score: 3.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does retrieval augmentation improve language model performance on knowledge-intensive tasks v10. 8 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4184]
6 June 2026. Score: 3.33/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How do multimodal language models perform on visual mathematical and scientific reasoning v10. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4183]
6 June 2026. Score: 3.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does reinforcement learning from human feedback improve language model mathematical reasoning v10. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4182]
6 June 2026. Score: 3.17/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How do language models perform on formal theorem proving and mathematical verification tasks v10. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4181]
6 June 2026. Score: 3.40/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does pretraining data quality affect language model reasoning benchmark performance v10. 12 claims were extracted from source literature; 1 was independently verified against retrieved documents. An automated…

[4180]
6 June 2026. Score: 7.30/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What techniques enable language models to solve competition-level software engineering problems v10. 8 claims were extracted from source literature; 6 were independently verified against retrieved documents. An…

[4179]
6 June 2026. Score: 6.67/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What architectural innovations improve transformer performance on multi-step logical reasoning v10. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4178]
6 June 2026. Score: 4.00/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the relationship between model scale and emergent reasoning capabilities in transformers v10. 8 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4177]
6 June 2026. Score: 5.00/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does test-time compute scaling improve language model performance on reasoning benchmarks v10. 19 claims were extracted from source literature; 5 were independently verified against retrieved documents. An…

[4176]
6 June 2026. Score: 3.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What are the scaling laws for chain-of-thought reasoning in large language models v10. 20 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…

[4175]
6 June 2026. Score: 3.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How do sparse mixture-of-experts models compare to dense transformers on mathematical reasoning v10. 11 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4174]
6 June 2026. Score: 3.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 19 peer-reviewed papers addressing the following research question: Does the ENTROPY hypothesis (initial image size reduction) generalize to multimodal models (e.g., visual-language models like CLIP) when evaluating performance on cross-domain benchmarks (e.g., VCR. 18 claims…

[4173]
6 June 2026. Score: 6.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the strategic exploration mechanism introduced in this paper scale with model size and affect the trade-off between alignment quality and inference efficiency, evaluated using the BIG-bench. 10 claims…

[4172]
6 June 2026. Score: 3.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the impact of the KL-divergence constraint in the reverse-KL regularized contextual bandit formulation on the reasoning performance of aligned LLMs, as measured by the MMLU benchmark in. 0 claims were…

[4171]
6 June 2026. Score: 3.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the iterative preference learning approach proposed in this paper compare to standard RLHF and DPO methods in terms of robustness on the AdversarialQA benchmark, when evaluated using metrics. 8 claims…

[4170]
6 June 2026. Score: 4.30/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the proposed scaling law with learning rate annealing affect the alignment of code generation models across different programming languages in the LiveCodeBench dataset, as measured by. 15 claims were…

[4169]
6 June 2026. Score: 5.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does the initial training image size affect the trade-off between accuracy and training efficiency in state-of-the-art CNNs (e.g., EfficientNet, Vision Transformers) when trained on mixed-domain. 8 claims…

[4168]
6 June 2026. Score: 3.73/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the scaling law with learning rate annealing in the paper compare to traditional power-law scaling when evaluating pass@k scores for code generation models on LiveCodeBench with varying. 13 claims were…

[4167]
6 June 2026. Score: 5.27/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the impact of learning rate annealing on the robustness of open-source code models when evaluated on adversarial examples from the LiveCodeBench dataset, measured by pass@k scores and. 17 claims were…

[4166]
6 June 2026. Score: 4.00/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the impact of different levels of synthetic data realism (e.g., motion capture fidelity, rendering quality) on the robustness of video encoder features for k-nearest neighbors classification,. 0 claims…

[4165]
6 June 2026. Score: 4.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: Does the MathCoder2 pretraining approach improve robustness against adversarial perturbations in competition-level math problems for models under 3B parameters. 17 claims were extracted from source literature; 2…

[4164]
6 June 2026. Score: 3.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the performance of k-nearest neighbors classification using features from synthetic gesture videos compare to random forests when evaluated on real-world gesture recognition benchmarks like. 0 claims were…

[4163]
6 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does few-shot prompting with lightweight masked language models compare to large autoregressive models on low-resource clinical named entity recognition benchmarks. 13 claims were extracted from source…

« Prev 1 13 14 15 16 17 182 Next »