Assignee Research: Index of Papers

Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 5307 papers; mean review score 5.67/10; 1468 Zenodo DOIs.

Results 976–1000 of 5307 entries

Papers

[4332]

Meta-Reasoning Performance and Few-Shot Arithmetic Benchmark Correlations

6 June 2026. Score: 3.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: Does meta-reasoning performance on MR-GSM8K correlate with few-shot reasoning accuracy on other arithmetic benchmarks like MATH or SVAMP. 11 claims were extracted from source literature; 0 were independently…

[4331]

Failure Modes of Frontier Language Models in Abstract Mathematical Reasoning

6 June 2026. Score: 3.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What are the failure modes of frontier language models on abstract mathematical reasoning v15. 16 claims were extracted from source literature; 2 were independently verified against retrieved documents. An…

[4330]

Synthetic vs. Human-Generated Math Data for Zero-Shot Benchmark Fine-Tuning

6 June 2026. Score: 3.00/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does fine-tuning with synthetic math problems generated by a stronger model affect the zero-shot performance on the MATH and GSM8K benchmarks compared to human-written fine-tuning data. 0 claims were…

[4329]

Multimodal vs. Text-Only Models on MATH Benchmark Performance

6 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the performance of multimodal models trained on both textual and symbolic mathematical representations compare to text-only models on the MATH dataset. 10 claims were extracted from source literature; 2…

[4328]

Instruction Fine-Tuning Effects on Language Model Mathematical Problem-Solving Accuracy

6 June 2026. Score: 4.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the effect of instruction fine-tuning on language model mathematical problem-solving accuracy v15. 10 claims were extracted from source literature; 0 were independently verified against retrieved…

[4327]

Per-Token Compute Density and Error Rates in BigBench Hard Logical Deduction Tasks

6 June 2026. Score: 3.33/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: What is the correlation between per-token compute density and error rates on the BigBench Hard logical deduction tasks when using dynamic compute allocation. 7 claims were extracted from source literature; 0 were…

[4326]

RLHF-Aligned vs. Instruction-Tuned Models on CodeT5+ Software Modification Tasks

6 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do reinforcement learning from human feedback (RLHF) aligned models perform compared to instruction-tuned models on the CodeT5+ benchmark for software modification tasks. 8 claims were extracted from source…

[4325]

Pretraining Data Quality and Its Impact on Language Model Reasoning Performance

6 June 2026. Score: 6.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does pretraining data quality affect language model reasoning benchmark performance v15. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…

[4324]

Language Models for Competition-Level Software Engineering Problem Solving

6 June 2026. Score: 2.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What techniques enable language models to solve competition-level software engineering problems v15. 12 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4323]

Sparse Mixture-of-Experts vs. Dense Transformers in Mathematical Reasoning Benchmarks

6 June 2026. Score: 3.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do sparse mixture-of-experts models compare to dense transformers on mathematical reasoning v15. 16 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4322]

Reinforcement Learning from Human Feedback Enhances Language Model Mathematical Reasoning

6 June 2026. Score: 3.67/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does reinforcement learning from human feedback improve language model mathematical reasoning v15. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4321]

Language Model Performance on Formal Theorem Proving and Mathematical Verification

6 June 2026. Score: 4.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How do language models perform on formal theorem proving and mathematical verification tasks v15. 17 claims were extracted from source literature; 1 was independently verified against retrieved documents. An…

[4320]

Architectural Innovations Enhancing Transformer Performance in Multi-Step Logical Reasoning

6 June 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: What architectural innovations improve transformer performance on multi-step logical reasoning v15. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4319]

Emergent Reasoning in Transformers at Scale: A Multi-Study Synthesis

6 June 2026. Score: 3.17/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the relationship between model scale and emergent reasoning capabilities in transformers v15. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4318]

Scaling Laws of Chain-of-Thought Reasoning in Large Language Models

6 June 2026. Score: 3.33/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: What are the scaling laws for chain-of-thought reasoning in large language models v15. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…

[4317]

Test-Time Compute Scaling and Language Model Reasoning Performance on Benchmark Suites

6 June 2026. Score: 5.60/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does test-time compute scaling improve language model performance on reasoning benchmarks v15. 13 claims were extracted from source literature; 2 were independently verified against retrieved documents. An…

[4316]

Impact of Concept Graph Depth in MathScale on SMBI Benchmark Performance

6 June 2026. Score: 3.57/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: To what extent does the depth of the concept graph in MathScale influence its performance on the SMBI benchmark compared to shallow concept graphs or no structured knowledge at all. 17 claims were extracted from…

[4315]

Cross-Lingual Retrieval Density and Answer Accuracy in Multilingual RAG Systems

6 June 2026. Score: 4.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does cross-lingual retrieval density impact answer accuracy in multilingual RAG systems compared to monolingual baselines on the XQuAD benchmark. 13 claims were extracted from source literature; 1 was…

[4314]

Language Model Performance Across Varying Context Lengths in Multi-Document Reasoning and Summarization

6 June 2026. Score: 3.07/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does context length affect language model performance on multi-document reasoning and summarization v14. 10 claims were extracted from source literature; 0 were independently verified against retrieved…

[4313]

Frontier Language Models on GPQA Diamond and HLCE Benchmark Performance

6 June 2026. Score: 4.57/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: Which frontier language models achieve highest scores on GPQA Diamond Humanity Last Exam and difficult reasoning benchmarks v14. 14 claims were extracted from source literature; 1 was independently verified…

[4312]

Frontier Large Language Models in Mathematical Reasoning and Scientific Knowledge Synthesis

6 June 2026. Score: 5.00/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: Comprehensive comparison of frontier large language models on mathematical reasoning code generation and scientific knowledge v14. 0 claims were extracted from source literature; 0 were independently verified…

[4311]

State-of-the-Art Large Language Model Performance on Reasoning Benchmarks

6 June 2026. Score: 5.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What are the state-of-the-art large language model results on reasoning benchmarks published recently v14. 0 claims were extracted from source literature; 0 were independently verified against retrieved…

[4310]

Perplexity and Downstream Reasoning Performance in Language Models

6 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the relationship between language model perplexity and downstream reasoning task performance v14. 14 claims were extracted from source literature; 1 was independently verified against retrieved documents.…

[4309]

Language Models vs. Human Experts on Professional Knowledge Benchmarks

6 June 2026. Score: 3.00/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How do language models compare to human experts on professional knowledge and science benchmarks v14. 11 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4308]

Scaling Laws for Language Model Performance in Logical Reasoning Tasks

6 June 2026. Score: 3.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the effect of model size on language model performance on logical reasoning tasks v14. 10 claims were extracted from source literature; 1 was independently verified against retrieved documents. An…

« Prev 1 … 38 39 40 41 42 … 213 Next »