Assignee Research: Index of Papers

Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 6012 papers; mean review score 5.58/10; 1557 Zenodo DOIs.

Results 2551–2575 of 6012 entries

Papers

[3462]

Evolutionary Search Strategies Enhance LLM Reasoning in Competition-Level Code Generation

4 June 2026. Score: 3.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of incorporating evolutionary search strategies on the reasoning accuracy of LLMs when evaluated on competition-level software engineering datasets like CodeContests. 5 claims were extracted…

[3461]

Large Language Models and Grammar-Guided Genetic Programming for Complex Code Generation on HumanEval

4 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do large language models compare to Grammar Guided Genetic Programming in solving code generation tasks involving complex, overlapping data structures on the HumanEval benchmark. 18 claims were extracted from…

[3460]

Synthetic Problem Quality Effects on Reinforcement Learning for Code Generation

4 June 2026. Score: 3.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of synthetic problem quality on the inference efficiency and convergence speed of reinforcement learning for code generation tasks. 5 claims were extracted from source literature; 0 were…

[3459]

Scaling Laws of Long Chain-of-Thought Reasoning in Large Language Models

4 June 2026. Score: 2.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the reasoning performance of long chain-of-thought (Long CoT) LLMs scale with model size and compute budget, as measured by accuracy on benchmark datasets like GSM8K or MATH. 8 claims were extracted from…

[3458]

Instruction Fine-Tuning Improves Language Model Mathematical Problem-Solving Accuracy

4 June 2026. Score: 3.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the effect of instruction fine-tuning on language model mathematical problem-solving accuracy v5. 15 claims were extracted from source literature; 0 were independently verified against retrieved…

[3457]

Pretraining Data Quality and Its Impact on Language Model Reasoning Performance

4 June 2026. Score: 3.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does pretraining data quality affect language model reasoning benchmark performance v5. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…

[3456]

Genetic Programming and Language Features for Competition-Level Code Synthesis

4 June 2026. Score: 5.33/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What techniques enable language models to solve competition-level software engineering problems v5. 14 claims were extracted from source literature; 4 were independently verified against retrieved documents. An…

[3455]

Language Models in Formal Theorem Proving and Mathematical Verification Tasks

4 June 2026. Score: 3.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do language models perform on formal theorem proving and mathematical verification tasks v5. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[3454]

Reinforcement Learning from Human Feedback Enhances Language Model Mathematical Reasoning

4 June 2026. Score: 3.00/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does reinforcement learning from human feedback improve language model mathematical reasoning v5. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[3453]

Emergent Reasoning in Transformers at Scale: A Multi-Study Synthesis

4 June 2026. Score: 3.17/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the relationship between model scale and emergent reasoning capabilities in transformers v5. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[3452]

Scaling Laws of Chain-of-Thought Reasoning in Large Language Models

4 June 2026. Score: 1.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What are the scaling laws for chain-of-thought reasoning in large language models v5. 15 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…

[3451]

Parallel Context Windows vs Sliding Window Accuracy on Long-Context Needle-in-a-Haystack Benchmarks

4 June 2026. Score: 6.00/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the Parallel Context Windows method impact accuracy on the Needle In A Haystack benchmark compared to sliding window approaches for context lengths exceeding 100k tokens. 0 claims were extracted from…

[3450]

Tree of Reviews Framework Enhances Robustness in Noisy Retrieval Contexts over Chain of Thought

4 June 2026. Score: 5.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: To what extent does the Tree of Reviews framework improve robustness against noisy retrieval contexts compared to iterative Chain of Thought methods on the 2WikiMultiHopQA dataset. 9 claims were extracted from…

[3449]

Context Length Effects on Language Model Performance in Multi-Document Reasoning and Summarization

4 June 2026. Score: 5.60/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does context length affect language model performance on multi-document reasoning and summarization. 19 claims were extracted from source literature; 4 were independently verified against retrieved documents.…

[3448]

Tree of Reviews Outperforms Chain of Thought in Multi-Hop QA Accuracy and Retrieval Precision

4 June 2026. Score: 5.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the Tree of Reviews framework compare to standard Chain of Thought baselines in terms of answer accuracy and retrieval precision on the HotpotQA and 2WikiMultiHopQA benchmarks. 10 claims were extracted…

[3447]

Perplexity and Downstream Reasoning Performance in Language Models

4 June 2026. Score: 4.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the relationship between language model perplexity and downstream reasoning task performance. 12 claims were extracted from source literature; 1 was independently verified against retrieved documents. An…

[3446]

Scaling Laws of Language Model Performance in Logical Reasoning Tasks

4 June 2026. Score: 3.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the effect of model size on language model performance on logical reasoning tasks. 10 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…

[3445]

Language Models vs. Human Experts on Professional Knowledge and Science Benchmarks

4 June 2026. Score: 4.00/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do language models compare to human experts on professional knowledge and science benchmarks. 13 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[3444]

Synthetic Training Data Enhances Language Model Performance in Mathematical Reasoning

4 June 2026. Score: 5.00/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does synthetic training data improve language model performance on mathematical reasoning benchmarks. 9 claims were extracted from source literature; 1 was independently verified against retrieved documents.…

[3443]

Current Language Model Benchmark Limitations in Reasoning Evaluation

4 June 2026. Score: 6.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What are the limitations of current language model evaluation benchmarks for measuring reasoning. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[3442]

QA-Prompting Outperforms State-of-the-Art Strategies for Graduate-Level Science Questions

4 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What prompting strategies maximize language model accuracy on graduate-level science questions. 16 claims were extracted from source literature; 1 was independently verified against retrieved documents. An…

[3441]

Language Models in Multi-Hop Scientific Reasoning: A Systematic Review

4 June 2026. Score: 6.00/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do language models handle multi-hop reasoning chains in scientific question answering. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…

[3440]

Open-Source vs. Proprietary Language Models on Coding Benchmark Performance

4 June 2026. Score: 3.33/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the comparative performance of open-source language models versus proprietary models on coding benchmarks. 0 claims were extracted from source literature; 0 were independently verified against retrieved…

[3439]

Quantization Impact on Reasoning Capabilities in Large Language Models

4 June 2026. Score: 3.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does model quantization affect reasoning capability in large language models. 20 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…

[3438]

Dynamic Depth Allocation in DS-MoE for Zero-Shot Code Generation Performance

4 June 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of dynamic depth allocation in DS-MoE on zero-shot code generation performance in benchmarks like HumanEval or MBPP compared to fixed-depth transformers. 0 claims were extracted from source…

« Prev 1 … 101 102 103 104 105 … 241 Next »