Index  |  Benchmarks  |  Mathematics  |  Graph  |  About
Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 5895 papers; mean review score 5.60/10; 1554 Zenodo DOIs.
Results 2751–2775 of 5895 entries

Papers

[3145]
3 June 2026. Score: 4.33/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: LiveCodeBench benchmark: robustness and generalization analysis — rotation 0. 16 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…

[3144]
3 June 2026. Score: 4.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: Can Shift Parallelism maintain high token throughput efficiency when scaled to multimodal LLMs processing variable-length image-text sequences compared to pipeline parallelism. 16 claims were extracted from…

[3143]
3 June 2026. Score: 4.17/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the impact of KV cache variance mitigation techniques in Shift Parallelism on multi-turn dialogue reasoning accuracy compared to standard tensor parallelism. 0 claims were extracted from source literature;…

[3142]
3 June 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the performance degradation of current alignment techniques in multimodal models when evaluated on out-of-distribution engineering diagrams from the Uni-MMMU benchmark. 0 claims were extracted from source…

[3141]
3 June 2026. Score: 3.67/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: Language model inference efficiency throughput benchmark comparison. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated multi-reviewer quality…

[3140]
3 June 2026. Score: 4.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: Multimodal language model vision reasoning benchmark evaluation analysis. 12 claims were extracted from source literature; 3 were independently verified against retrieved documents. An automated multi-reviewer…

[3139]
3 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How robust are current alignment techniques in multimodal models when evaluated on adversarial or out-of-distribution samples from Uni-MMMU's science and engineering disciplines. 7 claims were extracted from…

[3138]
3 June 2026. Score: 4.40/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: Chain-of-thought extended thinking benchmark accuracy improvement survey. 13 claims were extracted from source literature; 1 was independently verified against retrieved documents. An automated multi-reviewer…

[3137]
3 June 2026. Score: 2.40/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: Test-time compute scaling reasoning benchmark performance accuracy tradeoff. 12 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated multi-reviewer…

[3136]
3 June 2026. Score: 5.00/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: Open source language model benchmark leaderboard systematic review. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated multi-reviewer quality…

[3135]
3 June 2026. Score: 3.33/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the pass@k performance of large language models on LiveCodeBench correlate with their inference latency and token throughput across different model scales. 0 claims were extracted from source literature;…

[3134]
3 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the impact of integrating fine-grained intermediate feedback from PRMs on the inference efficiency and token consumption of autonomous coding agents on the SWE-bench dataset. 13 claims were extracted from…

[3133]
3 June 2026. Score: 3.77/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the verification protocol of HLE-Verified impact the correlation between model performance on noisy vs. verified subsets of the Humanity Last Exam benchmark. 0 claims were extracted from source…

[3132]
3 June 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the comparative efficiency of different inference optimization techniques when evaluating frontier models on the revised HLE-Verified benchmark in terms of throughput and accuracy trade-offs. 0 claims…

[3131]
3 June 2026. Score: 5.17/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: BIG-Bench Hard reasoning task language model evaluation comparison. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated multi-reviewer quality…

[3130]
3 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: HumanEval code generation state of the art language model survey. 16 claims were extracted from source literature; 1 was independently verified against retrieved documents. An automated multi-reviewer quality…

[3129]
3 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: MMMU multimodal understanding benchmark evaluation systematic review. 7 claims were extracted from source literature; 1 was independently verified against retrieved documents. An automated multi-reviewer quality…

[3128]
3 June 2026. Score: 3.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: GPQA Diamond benchmark frontier model performance evaluation recent literature. 11 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…

[3127]
3 June 2026. Score: 3.90/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: LiveCodeBench competitive programming language model performance analysis. 20 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated multi-reviewer…

[3126]
3 June 2026. Score: 3.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: Humanity Last Exam benchmark frontier model evaluation comparison. 14 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated multi-reviewer quality…

[3125]
3 June 2026. Score: 3.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: AIME mathematical competition language model benchmark evaluation. 20 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated multi-reviewer quality…

[3124]
3 June 2026. Score: 5.87/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: SWE-bench Verified autonomous coding agent state of the art results. 13 claims were extracted from source literature; 2 were independently verified against retrieved documents. An automated multi-reviewer quality…

[3123]
3 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the impact of ensemble defense mechanisms on the accuracy and robustness of LLMs in code generation tasks, as measured by the HumanEval+ benchmark. 11 claims were extracted from source literature; 1 was…

[3122]
3 June 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20527196

Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: How does the computational efficiency of adversarial contrastive pre-trained models compare to traditional supervised models in rumor detection tasks, as measured by inference latency and throughput. 6 claims were…

[3121]
3 June 2026. Score: 6.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: How does the choice of different metapath sampling granularities (coarse vs. fine-grained) affect the inference efficiency and throughput of Metapath Context Convolution-based HGNNs on large-scale. 5 claims were…

« Prev 1 109 110 111 112 113 236 Next »