Index  |  Benchmarks  |  Mathematics  |  Graph  |  About
Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 6190 papers; mean review score 5.55/10; 1559 Zenodo DOIs.
Results 2326–2350 of 6190 entries

Papers

[3865]
6 June 2026. Score: 3.67/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of model scaling (e.g., 7B vs. 70B parameters) on LLaMA's language understanding capabilities as measured by GLUE or SuperGLUE benchmarks. 0 claims were extracted from source literature; 0 were…

[3864]
6 June 2026. Score: 8.17/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20566023

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How do alignment techniques (e.g., RLHF, DPO) affect GPT-4's performance on code generation tasks, as evaluated by HumanEval or MBPP benchmarks. 10 claims were extracted from source literature; 9 were…

[3863]
6 June 2026. Score: 0.17/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: How does Qwen2-VL's accuracy on GSM8K-V compare to LLaVA-1.6 and InternVL when evaluating robustness to visual noise in grade school math problems. 0 claims were extracted from source literature; 0 were…

[3862]
6 June 2026. Score: 7.50/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20566007

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the performance gap between Claude-Sonnet-3.5 and distilled mobile models on context-dependent language understanding tasks specific to mobile interaction patterns. 11 claims were extracted from source…

[3861]
6 June 2026. Score: 8.07/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20565996

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of chain-of-thought prompting on Qwen2-VL's performance trajectory across the MathVista benchmark compared to baseline zero-shot inference. 11 claims were extracted from source literature; 10…

[3860]
6 June 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20565991

Abstract: This report synthesises findings from 2 peer-reviewed papers addressing the following research question: What is the accuracy degradation of Llama-3.1-70B on MMSU emotional intent classification when acoustic features are perturbed. 10 claims were extracted from source literature; 10 were independently verified…

[3859]
6 June 2026. Score: 7.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: What is the accuracy-throughput tradeoff of ARS when applied to InternVL3-8B on code generation tasks, as measured by HumanEval or MBPP benchmarks. 7 claims were extracted from source literature; 7 were…

[3858]
6 June 2026. Score: 7.23/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: What is the failure rate of Gemini-2.5-Flash on edge-case coding tasks when evaluated for five-nines reliability standards. 8 claims were extracted from source literature; 7 were independently verified against…

[3857]
6 June 2026. Score: 4.40/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does LongVA-7B perform on diagram-based code generation tasks compared to LLaVA-1.6 and Qwen-VL on the HumanEval-V benchmark. 13 claims were extracted from source literature; 1 was independently verified…

[3856]
6 June 2026. Score: 4.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does Video-LLaVA-8B compare to LLaVA-NeXT on the HumanEval-V benchmark for diagram-based code generation accuracy. 13 claims were extracted from source literature; 1 was independently verified against…

[3855]
6 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the alignment of Foundation-Sec-8B with human preferences or safety constraints affect its performance on reasoning tasks in visual contexts, measured by accuracy and bias metrics across. 16 claims were…

[3854]
6 June 2026. Score: 4.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How do Large Multimodal Models compare in accuracy on HumanEval-V tasks when evaluated against other multimodal benchmarks like MMBench or DAVIS. 12 claims were extracted from source literature; 1 was…

[3853]
6 June 2026. Score: 8.40/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the performance of gemma-2-2B compare to other 2B-parameter models on the Mobile-MMLU benchmark, particularly in low-resource settings with limited storage and computational constraints. 9 claims were…

[3852]
6 June 2026. Score: 3.67/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do multimodal models compare in performance on HumanEval-V tasks when evaluated for low-resource diagram understanding with limited training data. 0 claims were extracted from source literature; 0 were…

[3851]
6 June 2026. Score: 9.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the performance of open-source multimodal models scale with model size on HumanEval-V benchmarks compared to proprietary models. 9 claims were extracted from source literature; 9 were independently…

[3850]
6 June 2026. Score: 6.57/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: Can fine-tuning Phi-4 on additional synthetic visual-math datasets improve its robustness on out-of-distribution GSM8K-V problems. 10 claims were extracted from source literature; 9 were independently verified…

[3849]
6 June 2026. Score: 8.57/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: What is the impact of varying image resolution and text complexity on Phi-4's reasoning performance in grade school math word problems with visual contexts. 9 claims were extracted from source literature; 9 were…

[3848]
6 June 2026. Score: 8.33/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: What is the accuracy drop of codellambda-7b-hf-float16 on RoundTripCodeEval when subjected to adversarial code perturbations, and how does this robustness compare to Llama-2-13b and WizardCoder-13b. 8 claims were…

[3847]
6 June 2026. Score: 8.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: How does the iterative refinement process in Self-Refine affect the response quality of codegen-2b on the HELM benchmark for language understanding tasks, and what is the quantitative difference in. 10 claims were…

[3846]
6 June 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20565854

Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: What is the comparative reasoning accuracy of codellama-7b-hf-float16 on binary analysis tasks versus other specialized LLMs (e.g., BinGPT, assemblyLLM) using the BinMetric benchmark. 10 claims were extracted from…

[3845]
6 June 2026. Score: 5.30/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: How do different quantization techniques (e.g., 8-bit, 4-bit) affect the performance of codellama-7b-hf on binary analysis tasks compared to the float16 baseline. 9 claims were extracted from source literature; 9…

[3844]
6 June 2026. Score: 7.20/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does Lugha-Llama-8B-wura perform on African language reasoning benchmarks compared to base Llama 8B. 12 claims were extracted from source literature; 8 were independently verified against retrieved documents.…

[3843]
6 June 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20565826

Abstract: This report synthesises findings from 2 peer-reviewed papers addressing the following research question: What is the performance gap of XGLM-564M between Indonesian and English language understanding benchmarks across different educational difficulty levels. 6 claims were extracted from source literature; 6 were…

[3842]
6 June 2026. Score: 8.57/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20565824

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What are the comparative MT-bench conversation quality scores between Phi-3-mini and Mistral-7B-v0.1 across diverse dialogue domains. 14 claims were extracted from source literature; 14 were independently…

[3841]
6 June 2026. Score: 9.50/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20565820

Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: What are the comparative MT-bench conversation quality scores between Phi-3-mini and InternVL2-8B when evaluated on multi-turn instruction following tasks. 14 claims were extracted from source literature; 14 were…

« Prev 1 92 93 94 95 96 248 Next »