Assignee Research: Index of Papers

Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 6190 papers; mean review score 5.55/10; 1559 Zenodo DOIs.

Results 2326–2350 of 6190 entries

Papers

[3865]

Scaling Effects on LLaMA Language Understanding Across GLUE and SuperGLUE Benchmarks

6 June 2026. Score: 3.67/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of model scaling (e.g., 7B vs. 70B parameters) on LLaMA's language understanding capabilities as measured by GLUE or SuperGLUE benchmarks. 0 claims were extracted from source literature; 0 were…

[3864]

Alignment Techniques and GPT-4 Performance in Code Generation Benchmarks

6 June 2026. Score: 8.17/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20566023

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How do alignment techniques (e.g., RLHF, DPO) affect GPT-4's performance on code generation tasks, as evaluated by HumanEval or MBPP benchmarks. 10 claims were extracted from source literature; 9 were…

[3863]

Qwen2-VL Robustness to Visual Noise in GSM8K-V Benchmarking Against LLaVA-1.6 and InternVL

6 June 2026. Score: 0.17/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: How does Qwen2-VL's accuracy on GSM8K-V compare to LLaVA-1.6 and InternVL when evaluating robustness to visual noise in grade school math problems. 0 claims were extracted from source literature; 0 were…

[3862]

Claude-Sonnet-3.5 and Distilled Mobile Models in Context-Dependent Language Understanding

6 June 2026. Score: 7.50/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20566007

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the performance gap between Claude-Sonnet-3.5 and distilled mobile models on context-dependent language understanding tasks specific to mobile interaction patterns. 11 claims were extracted from source…

[3861]

Chain-of-Thought Prompting Enhances Qwen2-VL Performance on MathVista Benchmark

6 June 2026. Score: 8.07/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20565996

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of chain-of-thought prompting on Qwen2-VL's performance trajectory across the MathVista benchmark compared to baseline zero-shot inference. 11 claims were extracted from source literature; 10…

[3860]

Llama-3.1-70B Accuracy Degradation Under Acoustic Perturbations in MMSU Emotional Intent Classification

6 June 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20565991

Abstract: This report synthesises findings from 2 peer-reviewed papers addressing the following research question: What is the accuracy degradation of Llama-3.1-70B on MMSU emotional intent classification when acoustic features are perturbed. 10 claims were extracted from source literature; 10 were independently verified…

[3859]

ARS Accuracy-Throughput Trade-offs in InternVL3-8B Code Generation Benchmarks

6 June 2026. Score: 7.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: What is the accuracy-throughput tradeoff of ARS when applied to InternVL3-8B on code generation tasks, as measured by HumanEval or MBPP benchmarks. 7 claims were extracted from source literature; 7 were…

[3858]

Gemini-2.5-Flash Failure Rates on Edge-Case Coding Tasks Under Five-Nines Reliability Standards

6 June 2026. Score: 7.23/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: What is the failure rate of Gemini-2.5-Flash on edge-case coding tasks when evaluated for five-nines reliability standards. 8 claims were extracted from source literature; 7 were independently verified against…

[3857]

LongVA-7B Performance on Diagram-Based Code Generation in HumanEval-V

6 June 2026. Score: 4.40/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does LongVA-7B perform on diagram-based code generation tasks compared to LLaVA-1.6 and Qwen-VL on the HumanEval-V benchmark. 13 claims were extracted from source literature; 1 was independently verified…

[3856]

Video-LLaVA-8B and LLaVA-NeXT Performance on HumanEval-V for Diagram-Based Code Generation

6 June 2026. Score: 4.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does Video-LLaVA-8B compare to LLaVA-NeXT on the HumanEval-V benchmark for diagram-based code generation accuracy. 13 claims were extracted from source literature; 1 was independently verified against…

[3855]

Alignment of Foundation-Sec-8B with Human Preferences Impacts Visual Reasoning Accuracy and Bias

6 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the alignment of Foundation-Sec-8B with human preferences or safety constraints affect its performance on reasoning tasks in visual contexts, measured by accuracy and bias metrics across. 16 claims were…

[3854]

Large Multimodal Models on HumanEval-V: Accuracy Across Multimodal Benchmarks

6 June 2026. Score: 4.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How do Large Multimodal Models compare in accuracy on HumanEval-V tasks when evaluated against other multimodal benchmarks like MMBench or DAVIS. 12 claims were extracted from source literature; 1 was…

[3853]

Gemma-2-2B Performance on Mobile-MMLU Under Low-Resource Constraints

6 June 2026. Score: 8.40/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the performance of gemma-2-2B compare to other 2B-parameter models on the Mobile-MMLU benchmark, particularly in low-resource settings with limited storage and computational constraints. 9 claims were…

[3852]

Multimodal Model Performance on Low-Resource Diagram Understanding in HumanEval-V

6 June 2026. Score: 3.67/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do multimodal models compare in performance on HumanEval-V tasks when evaluated for low-resource diagram understanding with limited training data. 0 claims were extracted from source literature; 0 were…

[3851]

Scaling Performance of Open-Source and Proprietary Multimodal Models on HumanEval-V

6 June 2026. Score: 9.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the performance of open-source multimodal models scale with model size on HumanEval-V benchmarks compared to proprietary models. 9 claims were extracted from source literature; 9 were independently…

[3850]

Fine-Tuning Phi-4 on Synthetic Visual-Math Data Enhances Out-of-Distribution Robustness

6 June 2026. Score: 6.57/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: Can fine-tuning Phi-4 on additional synthetic visual-math datasets improve its robustness on out-of-distribution GSM8K-V problems. 10 claims were extracted from source literature; 9 were independently verified…

[3849]

Phi-4 Reasoning Performance Under Varying Image Resolution and Text Complexity in Visual Math Problems

6 June 2026. Score: 8.57/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: What is the impact of varying image resolution and text complexity on Phi-4's reasoning performance in grade school math word problems with visual contexts. 9 claims were extracted from source literature; 9 were…

[3848]

Adversarial Robustness of CodeLLMs on RoundTripCodeEval Benchmark

6 June 2026. Score: 8.33/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: What is the accuracy drop of codellambda-7b-hf-float16 on RoundTripCodeEval when subjected to adversarial code perturbations, and how does this robustness compare to Llama-2-13b and WizardCoder-13b. 8 claims were…

[3847]

Self-Refine Iterative Refinement Impact on CodeGen-2B HELM Benchmark Performance

6 June 2026. Score: 8.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: How does the iterative refinement process in Self-Refine affect the response quality of codegen-2b on the HELM benchmark for language understanding tasks, and what is the quantitative difference in. 10 claims were…

[3846]

CodeLlama-7B Reasoning Accuracy on BinMetric vs. Specialized Binary Analysis LLMs

6 June 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20565854

Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: What is the comparative reasoning accuracy of codellama-7b-hf-float16 on binary analysis tasks versus other specialized LLMs (e.g., BinGPT, assemblyLLM) using the BinMetric benchmark. 10 claims were extracted from…

[3845]

Quantization Impact on CodeLlama-7B Performance in Binary Analysis Tasks

6 June 2026. Score: 5.30/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: How do different quantization techniques (e.g., 8-bit, 4-bit) affect the performance of codellama-7b-hf on binary analysis tasks compared to the float16 baseline. 9 claims were extracted from source literature; 9…

[3844]

Lugha-Llama-8B-wura Performance on African Language Reasoning Benchmarks vs. Base Llama 8B

6 June 2026. Score: 7.20/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does Lugha-Llama-8B-wura perform on African language reasoning benchmarks compared to base Llama 8B. 12 claims were extracted from source literature; 8 were independently verified against retrieved documents.…

[3843]

XGLM-564M Performance Disparities Across Indonesian and English Language Benchmarks

6 June 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20565826

Abstract: This report synthesises findings from 2 peer-reviewed papers addressing the following research question: What is the performance gap of XGLM-564M between Indonesian and English language understanding benchmarks across different educational difficulty levels. 6 claims were extracted from source literature; 6 were…

[3842]

Phi-3-Mini and Mistral-7B MT-Bench Performance Across Dialogue Domains

6 June 2026. Score: 8.57/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20565824

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What are the comparative MT-bench conversation quality scores between Phi-3-mini and Mistral-7B-v0.1 across diverse dialogue domains. 14 claims were extracted from source literature; 14 were independently…

[3841]

Phi-3-Mini and InternVL2-8B MT-Bench Performance in Multi-Turn Instruction Following

6 June 2026. Score: 9.50/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20565820

Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: What are the comparative MT-bench conversation quality scores between Phi-3-mini and InternVL2-8B when evaluated on multi-turn instruction following tasks. 14 claims were extracted from source literature; 14 were…

« Prev 1 … 92 93 94 95 96 … 248 Next »