Index  |  Benchmarks  |  Mathematics  |  Graph  |  About
Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 6257 papers; mean review score 5.53/10; 1561 Zenodo DOIs.
Results 2251–2275 of 6257 entries

Papers

[4007]
6 June 2026. Score: 5.33/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the load balancing efficiency of DeepSeek-V3's auxiliary-loss-free policy compare to traditional routing methods during long-context inference tasks. 13 claims were extracted from source literature; 4…

[4006]
6 June 2026. Score: 5.27/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the ALF-LB load balancing method compare to traditional auxiliary-loss-based approaches in terms of training throughput and final model accuracy on the HumanEval code generation benchmark. 13 claims were…

[4005]
6 June 2026. Score: 4.00/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the trade-off between instruction-following accuracy and inference latency when comparing Claude-3.5-Sonnet with quantized versions of Llama-3 on the Multi-Turn Robotic Instruction Following. 0 claims…

[4004]
6 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the performance of LongVA-7B and LLaVA-1.6 on HumanEval-V vary when evaluated with different diagram types (e.g., flowcharts vs. UML diagrams), and can this inform model-specific. 11 claims were…

[4003]
6 June 2026. Score: 3.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the performance of Claude-3.5-Sonnet compare to state-of-the-art open-source multimodal models on the MobileAloha benchmark when evaluated for instruction adherence in robotic manipulation. 17 claims…

[4002]
6 June 2026. Score: 3.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How robust are instruction-following capabilities of Claude-3.5-Sonnet and quantized mobile models when tested with adversarial perturbations in the MobileAloha dataset, measured by success rate and. 16 claims…

[4001]
6 June 2026. Score: 4.17/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of multi-turn refinement loops on the robustness of code generation models against adversarial prompts in the HumanEval dataset. 0 claims were extracted from source literature; 0 were…

[4000]
6 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of varying levels of visual complexity in diagrams on the reasoning accuracy of LLaVA-NeXT and Video-LLaVA-8B, and how does this correlate with their performance on standard. 14 claims were…

[3999]
6 June 2026. Score: 3.00/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: What is the cross-domain transferability of visual reasoning capabilities in LMMs when trained on HumanEval-V versus traditional multimodal benchmarks like VQA or COCO. 15 claims were extracted from source…

[3998]
6 June 2026. Score: 4.33/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the correlation between model parameter scale and performance degradation on visual logic puzzles within the LogicVista dataset under low-resolution conditions. 17 claims were extracted from source…

[3997]
6 June 2026. Score: 3.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: Does applying self-refinement loops to codegen-2b yield diminishing returns in accuracy improvement after three iterations on the APPS competition-level dataset. 14 claims were extracted from source literature; 0…

[3996]
6 June 2026. Score: 3.77/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What are the robustness and generalization capabilities of Gemini 1.5 Pro when evaluated on LongVideoBench across different video domains (e.g., lectures, tutorials, documentaries) and how does this. 0 claims…

[3995]
6 June 2026. Score: 4.30/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does Gemini 1.5 Pro handle long-term dependency modeling in video-language understanding tasks compared to prior models, and what metrics (e.g., F1 score, latency) best capture this performance. 17 claims…

[3994]
6 June 2026. Score: 3.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the difference in pass@k metrics between iterative self-refinement and single-pass decoding for codegen-2b on the HumanEval benchmark. 12 claims were extracted from source literature; 0 were independently…

[3993]
6 June 2026. Score: 3.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the retrieval accuracy of Gemini 1.5 Pro compare to other multimodal models like GPT-4V or PaLM-M on long-context benchmarks such as LongBench or Needle-in-a-Haystack. 16 claims were extracted from source…

[3992]
6 June 2026. Score: 5.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the retrieval accuracy of Gemini 1.5 Flash degrade on the Needle In A Haystack benchmark compared to Gemini 1.5 Pro when context length exceeds 500k tokens. 0 claims were extracted from source literature;…

[3991]
6 June 2026. Score: 6.00/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the impact of domain-adaptive RAG on the calibration metrics and false positive rates of quantized Mistral 7B when detecting anomalies in multimodal cyber-physical system logs. 0 claims were extracted…

[3990]
6 June 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20567664

Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: How do multimodal models like Gemini 1.5 Pro compare to prior models in terms of accuracy and computational cost when processing interleaved video-language inputs of varying lengths, particularly for. 7 claims were…

[3989]
6 June 2026. Score: 4.57/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the performance degradation of XGLM-564M on imbalanced educational dialogue datasets vary between Indonesian and English across different difficulty levels. 13 claims were extracted from source…

[3988]
6 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the difference in adversarial robustness scores for XGLM-564M when classifying tutoring dialogue acts across high school versus undergraduate level datasets in English and Indonesian. 10 claims were…

[3987]
6 June 2026. Score: 3.33/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the out-of-domain generalization accuracy of XGLM-564M compare between Indonesian and English on low-resource educational dialogue act classification tasks. 0 claims were extracted from source…

[3986]
6 June 2026. Score: 4.00/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the integration of faithfulness constraints in RAG pipelines affect the accuracy of Phi-3-mini and Mistral-7B-v0.1 on low-resource language benchmarks. 7 claims were extracted from source literature; 0…

[3985]
6 June 2026. Score: 4.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of retrieval context length on the factuality scores of Phi-3-mini versus Mistral-7B-v0.1 in multi-hop question answering tasks. 16 claims were extracted from source literature; 3 were…

[3984]
6 June 2026. Score: 4.60/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: What is the performance gap in F1 scores for Indonesian hate speech detection between feature-based multilingual models and fine-tuned monolingual approaches across varying training data sizes. 20 claims were…

[3983]
6 June 2026. Score: 3.07/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do Phi-3-mini and Mistral-7B-v0.1 compare in hallucination rates on long-context RAG benchmarks for specialized religious domains. 8 claims were extracted from source literature; 0 were independently verified…

« Prev 1 89 90 91 92 93 251 Next »