Assignee Research: Index of Papers

Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 6044 papers; mean review score 5.57/10; 1557 Zenodo DOIs.

Results 2526–2550 of 6044 entries

Papers

[3519]

Multimodal Language Models in Visual Mathematical and Scientific Reasoning

5 June 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do multimodal language models perform on visual mathematical and scientific reasoning v7. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[3518]

Differentiable Decoding in \$

5 June 2026. Score: 2.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the impact of \$ abla\$-Reasoner's differentiable decoding loop on hallucination rates when evaluated on the TruthfulQA benchmark. 9 claims were extracted from source literature; 0 were independently…

[3517]

Frontier Language Model Failures in Abstract Mathematical Reasoning

5 June 2026. Score: 3.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What are the failure modes of frontier language models on abstract mathematical reasoning v7. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[3516]

Pretraining Data Quality and Its Impact on Language Model Reasoning Performance

5 June 2026. Score: 4.23/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does pretraining data quality affect language model reasoning benchmark performance v7. 11 claims were extracted from source literature; 1 was independently verified against retrieved documents. An automated…

[3515]

Instruction Fine-Tuning Effects on Language Model Mathematical Problem-Solving Accuracy

5 June 2026. Score: 6.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the effect of instruction fine-tuning on language model mathematical problem-solving accuracy v7. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents.…

[3514]

Language Models in Formal Theorem Proving and Mathematical Verification Tasks

5 June 2026. Score: 6.33/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do language models perform on formal theorem proving and mathematical verification tasks v7. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[3513]

Interleaved Visual Reasoning Chains Enhance Transformer Performance in Multi-Step Logic Tasks

5 June 2026. Score: 3.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What architectural innovations improve transformer performance on multi-step logical reasoning v7. 20 claims were extracted from source literature; 1 was independently verified against retrieved documents. An…

[3512]

Reinforcement Learning from Human Feedback Enhances Language Model Mathematical Reasoning

5 June 2026. Score: 3.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does reinforcement learning from human feedback improve language model mathematical reasoning v7. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[3511]

Scaling Laws of Chain-of-Thought Reasoning in Large Language Models

5 June 2026. Score: 3.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What are the scaling laws for chain-of-thought reasoning in large language models v7. 13 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…

[3510]

Emergent Reasoning in Transformers: Scale Effects and LLM-ProS Evaluation Framework

5 June 2026. Score: 6.00/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the relationship between model scale and emergent reasoning capabilities in transformers v7. 9 claims were extracted from source literature; 4 were independently verified against retrieved documents. An…

[3509]

Test-Time Compute Scaling Enhances Language Model Reasoning Benchmarks

5 June 2026. Score: 3.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: How does test-time compute scaling improve language model performance on reasoning benchmarks v7. 12 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[3508]

Symbolic Rule Supervision Reduces Hallucinations in Chain-of-Thought Reasoning on GSM8K

5 June 2026. Score: 6.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: To what extent does the incorporation of symbolic rule supervision in neuro-symbolic frameworks reduce hallucination rates in chain-of-thought reasoning tasks compared to standard transformer-based. 0 claims were…

[3507]

Reward-Free vs. Reward-Based Alignment for LLM Robustness Against Adversarial Prompts

5 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of reward-free alignment methods like DPO versus reward-based RLHF on the robustness of LLMs against adversarial prompts in safety evaluation datasets. 10 claims were extracted from source…

[3506]

Neuro-Symbolic vs. Neural Provers Under Adversarial Perturbations on MiniF2F

5 June 2026. Score: 3.77/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How do neuro-symbolic verification methods compare to end-to-end neural provers in maintaining proof success rates on the MiniF2F benchmark when theorem statements are subjected to syntactic. 0 claims were…

[3505]

Code-Text Pretraining Enhances Cross-Lingual Code Generation in Low-Resource Languages

5 June 2026. Score: 6.70/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: What is the impact of code-text pretraining on cross-lingual code generation accuracy for low-resource programming languages when evaluated on the HumanEval-X benchmark. 11 claims were extracted from source…

[3504]

Neuro-Symbolic vs. Neural Proof Generation Robustness to Adversarial Perturbations

5 June 2026. Score: 2.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do neuro-symbolic proof generation methods perform in terms of robustness against adversarial perturbations in theorem statements compared to end-to-end neural approaches on formal mathematics. 10 claims were…

[3503]

Alignment Techniques Outperform Supervised Fine-Tuning on High-Difficulty Benchmarks

5 June 2026. Score: 3.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: To what extent do alignment techniques (e.g., reinforcement learning from human feedback) improve model performance on HLE-Verified's high-difficulty questions compared to standard supervised. 10 claims were…

[3502]

Reverse Operation Data Augmentation and Sample Efficiency in Fine-Tuned Language Models

5 June 2026. Score: 4.00/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of reverse operation data augmentation on the sample efficiency of language models when fine-tuned on limited MMLU STEM subsets. 11 claims were extracted from source literature; 0 were…

[3501]

Reversed-Logic Math Training Enhances Out-of-Distribution Robustness on MATH

5 June 2026. Score: 1.67/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: Does training on reversed-logic math problems enhance out-of-distribution robustness on the MATH benchmark compared to standard synthetic data methods. 0 claims were extracted from source literature; 0 were…

[3500]

Frontier Large Language Models in Mathematical Reasoning and Scientific Knowledge Synthesis

5 June 2026. Score: 6.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: Comprehensive comparison of frontier large language models on mathematical reasoning code generation and scientific knowledge v6. 0 claims were extracted from source literature; 0 were independently verified…

[3499]

State-of-the-Art Large Language Model Performance on Reasoning Benchmarks

5 June 2026. Score: 3.73/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: What are the state-of-the-art large language model results on reasoning benchmarks published recently v6. 10 claims were extracted from source literature; 0 were independently verified against retrieved documents.…

[3498]

Perplexity and Downstream Reasoning Performance in Language Models: A Meta-Analysis

5 June 2026. Score: 4.23/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the relationship between language model perplexity and downstream reasoning task performance v6. 10 claims were extracted from source literature; 1 was independently verified against retrieved documents.…

[3497]

Frontier Language Models on GPQA Diamond and Reasoning Benchmarks V6

5 June 2026. Score: 3.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: Which frontier language models achieve highest scores on GPQA Diamond Humanity Last Exam and difficult reasoning benchmarks v6. 13 claims were extracted from source literature; 0 were independently verified…

[3496]

Language Models and Human Experts on Professional Knowledge Benchmarks

5 June 2026. Score: 4.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do language models compare to human experts on professional knowledge and science benchmarks v6. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[3495]

Extended Thinking Time Improves Language Model Accuracy in Competition-Level Mathematics

5 June 2026. Score: 4.00/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does extended thinking time affect language model accuracy on competition-level mathematics v6. 14 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

« Prev 1 … 100 101 102 103 104 … 242 Next »