Assignee Research: Index of Papers

Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 5895 papers; mean review score 5.60/10; 1554 Zenodo DOIs.

Results 2751–2775 of 5895 entries

Papers

[3145]

LiveCodeBench Robustness and Generalization Across Peer-Reviewed Evaluations

3 June 2026. Score: 4.33/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: LiveCodeBench benchmark: robustness and generalization analysis — rotation 0. 16 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…

[3144]

Shift Parallelism vs Pipeline Parallelism for Multimodal LLM Token Throughput Efficiency

3 June 2026. Score: 4.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: Can Shift Parallelism maintain high token throughput efficiency when scaled to multimodal LLMs processing variable-length image-text sequences compared to pipeline parallelism. 16 claims were extracted from…

[3143]

Shift Parallelism KV Cache Variance Mitigation in Multi-Turn Dialogue Reasoning

3 June 2026. Score: 4.17/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the impact of KV cache variance mitigation techniques in Shift Parallelism on multi-turn dialogue reasoning accuracy compared to standard tensor parallelism. 0 claims were extracted from source literature;…

[3142]

Alignment Degradation in Multimodal Models on Out-of-Distribution Engineering Diagrams

3 June 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the performance degradation of current alignment techniques in multimodal models when evaluated on out-of-distribution engineering diagrams from the Uni-MMMU benchmark. 0 claims were extracted from source…

[3141]

Comparative Throughput Benchmarks in Language Model Inference Efficiency

3 June 2026. Score: 3.67/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: Language model inference efficiency throughput benchmark comparison. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated multi-reviewer quality…

[3140]

Multimodal Language Model Vision Reasoning Benchmark Performance and Analysis

3 June 2026. Score: 4.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: Multimodal language model vision reasoning benchmark evaluation analysis. 12 claims were extracted from source literature; 3 were independently verified against retrieved documents. An automated multi-reviewer…

[3139]

Robustness of Multimodal Model Alignment on Adversarial and Out-of-Distribution Uni-MMMU Samples

3 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How robust are current alignment techniques in multimodal models when evaluated on adversarial or out-of-distribution samples from Uni-MMMU's science and engineering disciplines. 7 claims were extracted from…

[3138]

Chain-of-Thought Extended Thinking Benchmark Accuracy Across GSM8K, LogiQA, and BIG-Bench Hard

3 June 2026. Score: 4.40/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: Chain-of-thought extended thinking benchmark accuracy improvement survey. 13 claims were extracted from source literature; 1 was independently verified against retrieved documents. An automated multi-reviewer…

[3137]

Test-Time Compute Scaling and Accuracy Trade-offs in Reasoning Benchmarks

3 June 2026. Score: 2.40/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: Test-time compute scaling reasoning benchmark performance accuracy tradeoff. 12 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated multi-reviewer…

[3136]

Systematic Review of Open-Source Language Model Benchmark Leaderboards

3 June 2026. Score: 5.00/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: Open source language model benchmark leaderboard systematic review. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated multi-reviewer quality…

[3135]

Pass@k Performance, Latency, and Throughput Trade-offs in Large Language Models on LiveCodeBench

3 June 2026. Score: 3.33/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the pass@k performance of large language models on LiveCodeBench correlate with their inference latency and token throughput across different model scales. 0 claims were extracted from source literature;…

[3134]

SWE-Shepherd Step-Level Feedback on Autonomous Coding Agent Efficiency

3 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the impact of integrating fine-grained intermediate feedback from PRMs on the inference efficiency and token consumption of autonomous coding agents on the SWE-bench dataset. 13 claims were extracted from…

[3133]

HLE-Verified Protocol Effects on Noisy and Verified Humanity Last Exam Benchmark Correlations

3 June 2026. Score: 3.77/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the verification protocol of HLE-Verified impact the correlation between model performance on noisy vs. verified subsets of the Humanity Last Exam benchmark. 0 claims were extracted from source…

[3132]

Comparative Efficiency of Inference Optimization Techniques on HLE-Verified Benchmarks

3 June 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the comparative efficiency of different inference optimization techniques when evaluating frontier models on the revised HLE-Verified benchmark in terms of throughput and accuracy trade-offs. 0 claims…

[3131]

Language Model Performance on BIG-Bench Hard Reasoning Tasks: A Multi-Study Synthesis

3 June 2026. Score: 5.17/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: BIG-Bench Hard reasoning task language model evaluation comparison. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated multi-reviewer quality…

[3130]

State-of-the-Art Language Models for HumanEval Code Generation: A Systematic Survey

3 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: HumanEval code generation state of the art language model survey. 16 claims were extracted from source literature; 1 was independently verified against retrieved documents. An automated multi-reviewer quality…

[3129]

Unified Multimodal Understanding Benchmarks: A Systematic Review of MMMU Evaluations

3 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: MMMU multimodal understanding benchmark evaluation systematic review. 7 claims were extracted from source literature; 1 was independently verified against retrieved documents. An automated multi-reviewer quality…

[3128]

Frontier Model Performance on the GPQA Diamond Benchmark: A Literature Synthesis

3 June 2026. Score: 3.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: GPQA Diamond benchmark frontier model performance evaluation recent literature. 11 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…

[3127]

LiveCodeBench Performance Analysis of Competitive Programming Language Models

3 June 2026. Score: 3.90/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: LiveCodeBench competitive programming language model performance analysis. 20 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated multi-reviewer…

[3126]

Frontier Model Performance on the Humanity Last Exam Benchmark: A Multi-Study Synthesis

3 June 2026. Score: 3.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: Humanity Last Exam benchmark frontier model evaluation comparison. 14 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated multi-reviewer quality…

[3125]

Language Model Performance on the AIME Mathematical Competition Benchmark

3 June 2026. Score: 3.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: AIME mathematical competition language model benchmark evaluation. 20 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated multi-reviewer quality…

[3124]

SWE-Bench Verified Autonomous Coding Agents: State-of-the-Art Performance and Trajectory Supervision

3 June 2026. Score: 5.87/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: SWE-bench Verified autonomous coding agent state of the art results. 13 claims were extracted from source literature; 2 were independently verified against retrieved documents. An automated multi-reviewer quality…

[3123]

Ensemble Defense Mechanisms in LLM Code Generation: Accuracy and Robustness on HumanEval+

3 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the impact of ensemble defense mechanisms on the accuracy and robustness of LLMs in code generation tasks, as measured by the HumanEval+ benchmark. 11 claims were extracted from source literature; 1 was…

[3122]

Adversarial Contrastive Pre-Training vs Supervised Models in Rumor Detection Efficiency

3 June 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20527196

Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: How does the computational efficiency of adversarial contrastive pre-trained models compare to traditional supervised models in rumor detection tasks, as measured by inference latency and throughput. 6 claims were…

[3121]

Metapath Sampling Granularity Impacts on HGNN Inference Efficiency in OGBN-MAG

3 June 2026. Score: 6.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: How does the choice of different metapath sampling granularities (coarse vs. fine-grained) affect the inference efficiency and throughput of Metapath Context Convolution-based HGNNs on large-scale. 5 claims were…

« Prev 1 … 109 110 111 112 113 … 236 Next »