Index  |  Benchmarks  |  Mathematics  |  Graph  |  About
Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 6044 papers; mean review score 5.57/10; 1557 Zenodo DOIs.
Results 2526–2550 of 6044 entries

Papers

[3519]
5 June 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do multimodal language models perform on visual mathematical and scientific reasoning v7. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[3518]
5 June 2026. Score: 2.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the impact of \$ abla\$-Reasoner's differentiable decoding loop on hallucination rates when evaluated on the TruthfulQA benchmark. 9 claims were extracted from source literature; 0 were independently…

[3517]
5 June 2026. Score: 3.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What are the failure modes of frontier language models on abstract mathematical reasoning v7. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[3516]
5 June 2026. Score: 4.23/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does pretraining data quality affect language model reasoning benchmark performance v7. 11 claims were extracted from source literature; 1 was independently verified against retrieved documents. An automated…

[3515]
5 June 2026. Score: 6.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the effect of instruction fine-tuning on language model mathematical problem-solving accuracy v7. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents.…

[3514]
5 June 2026. Score: 6.33/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do language models perform on formal theorem proving and mathematical verification tasks v7. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[3513]
5 June 2026. Score: 3.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What architectural innovations improve transformer performance on multi-step logical reasoning v7. 20 claims were extracted from source literature; 1 was independently verified against retrieved documents. An…

[3512]
5 June 2026. Score: 3.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does reinforcement learning from human feedback improve language model mathematical reasoning v7. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[3511]
5 June 2026. Score: 3.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What are the scaling laws for chain-of-thought reasoning in large language models v7. 13 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…

[3510]
5 June 2026. Score: 6.00/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the relationship between model scale and emergent reasoning capabilities in transformers v7. 9 claims were extracted from source literature; 4 were independently verified against retrieved documents. An…

[3509]
5 June 2026. Score: 3.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: How does test-time compute scaling improve language model performance on reasoning benchmarks v7. 12 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[3508]
5 June 2026. Score: 6.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: To what extent does the incorporation of symbolic rule supervision in neuro-symbolic frameworks reduce hallucination rates in chain-of-thought reasoning tasks compared to standard transformer-based. 0 claims were…

[3507]
5 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of reward-free alignment methods like DPO versus reward-based RLHF on the robustness of LLMs against adversarial prompts in safety evaluation datasets. 10 claims were extracted from source…

[3506]
5 June 2026. Score: 3.77/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How do neuro-symbolic verification methods compare to end-to-end neural provers in maintaining proof success rates on the MiniF2F benchmark when theorem statements are subjected to syntactic. 0 claims were…

[3505]
5 June 2026. Score: 6.70/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: What is the impact of code-text pretraining on cross-lingual code generation accuracy for low-resource programming languages when evaluated on the HumanEval-X benchmark. 11 claims were extracted from source…

[3504]
5 June 2026. Score: 2.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do neuro-symbolic proof generation methods perform in terms of robustness against adversarial perturbations in theorem statements compared to end-to-end neural approaches on formal mathematics. 10 claims were…

[3503]
5 June 2026. Score: 3.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: To what extent do alignment techniques (e.g., reinforcement learning from human feedback) improve model performance on HLE-Verified's high-difficulty questions compared to standard supervised. 10 claims were…

[3502]
5 June 2026. Score: 4.00/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of reverse operation data augmentation on the sample efficiency of language models when fine-tuned on limited MMLU STEM subsets. 11 claims were extracted from source literature; 0 were…

[3501]
5 June 2026. Score: 1.67/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: Does training on reversed-logic math problems enhance out-of-distribution robustness on the MATH benchmark compared to standard synthetic data methods. 0 claims were extracted from source literature; 0 were…

[3500]
5 June 2026. Score: 6.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: Comprehensive comparison of frontier large language models on mathematical reasoning code generation and scientific knowledge v6. 0 claims were extracted from source literature; 0 were independently verified…

[3499]
5 June 2026. Score: 3.73/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: What are the state-of-the-art large language model results on reasoning benchmarks published recently v6. 10 claims were extracted from source literature; 0 were independently verified against retrieved documents.…

[3498]
5 June 2026. Score: 4.23/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the relationship between language model perplexity and downstream reasoning task performance v6. 10 claims were extracted from source literature; 1 was independently verified against retrieved documents.…

[3497]
5 June 2026. Score: 3.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: Which frontier language models achieve highest scores on GPQA Diamond Humanity Last Exam and difficult reasoning benchmarks v6. 13 claims were extracted from source literature; 0 were independently verified…

[3496]
5 June 2026. Score: 4.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do language models compare to human experts on professional knowledge and science benchmarks v6. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[3495]
5 June 2026. Score: 4.00/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does extended thinking time affect language model accuracy on competition-level mathematics v6. 14 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

« Prev 1 100 101 102 103 104 242 Next »