Assignee Research: Index of Papers

Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 8281 papers; mean review score 5.72/10; 2258 Zenodo DOIs. Verified contributions (Gate 2: formal proof or sandbox reproduction): 146. 87 claims falsified by the pipeline (see falsification record). 169 published AI claims under field audit; 92 contested by the literature itself (see audit ledger). 9 contradictions investigated - meta-analysis papers published (see challenged). What does this mean?

Results 7376–7400 of 8281 entries

Papers

[906]

Mistral-Large-2 Code Generation on MBPP: Automated vs. Human Evaluation Metrics

30 May 2026. Score: 3.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the functional correctness and code quality of Mistral-Large-2 generated solutions on MBPP compare when evaluated using automated test suites versus human evaluation scores. The use of machine learning…

[905]

Cross-Model Robustness of Qwen3-235B and Llama2-70B Under PPTC-R Attacks

30 May 2026. Score: 5.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the cross-model robustness comparison between Qwen3-235B and Llama2-70B under PPTC-R attacks, evaluated using accuracy drop and token efficiency. In this paper, we investigate the problem of distributed…

[904]

Vision-Language vs. Pure Visual Models in Medical Image Segmentation: A Meta-Analysis of Synthetic Metrics and Human Agreement

30 May 2026. Score: 6.50/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do vision-language models compare to pure visual models in terms of correlation between synthetic segmentation metrics and human rater agreement on multimodal medical image tasks like BRATS,. Training a deep…

[903]

Multimodal Transformer Scaling and Human Attention Alignment in Fine-Grained Spatial Tasks

30 May 2026. Score: 3.00/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 20 peer-reviewed papers addressing the following research question: To what extent does model size scaling in multimodal transformers (e.g., ViT, CLIP vs. small-scale CNN-based models) affect the alignment of synthetic metrics with human attention benchmarks in tasks. Tactile…

[902]

Multimodal Context Enhances DeepSeek-R1 Code Repair Performance on FeedbackEval

30 May 2026. Score: 3.67/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does multimodal context (text + code diagrams) affect the iterative code repair performance of DeepSeek-R1 on FeedbackEval compared to text-only context, measured by repair success rate and token. Code repair…

[901]

DeepSeek-R1 and Claude-3 Token Efficiency in Few-Shot Code Generation on HumanEval

30 May 2026. Score: 3.33/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the token efficiency of DeepSeek-R1 compare to Claude-3 when performing few-shot code generation on HumanEval, measured by pass@1 accuracy per token consumed. How far are Large Language Models (LLMs) in…

[900]

INT4 Quantization Impact on Llama-3.1 Zero-Shot Code Generation Performance

30 May 2026. Score: 3.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does INT4 quantization affect the zero-shot code generation performance of Llama-3.1 models on HumanEval, and does this trade-off persist across different hardware configurations (e.g., A100 vs.. Quantization…

[899]

DeepSeek-R1 Latency-Accuracy Trade-offs in Code Generation Benchmarks

30 May 2026. Score: 4.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the trade-off between inference latency and code generation accuracy for DeepSeek-R1 versus other LLMs (e.g., CodeLlama, WizardCoder) when evaluated on HumanEval-V and MBPP benchmarks. This paper explores…

[898]

DeepSeek-R1 Context Window Scaling for Security Vulnerability Detection in Code

30 May 2026. Score: 4.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the impact of context window scaling on the security vulnerability detection performance of DeepSeek-R1 compared to other models across different code lengths and complexity levels. Many studies have…

[897]

Fine-Tuning Llama3 on Big-Vul Dataset Enhances FeedbackEval Benchmark Performance

30 May 2026. Score: 4.40/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does fine-tuning Llama3 with the Big-Vul dataset's vulnerability classification annotations impact its performance on the FeedbackEval benchmark compared to the base model. Detecting toxic content using…

[896]

DeepSeek R1 and Claude Efficiency-Accuracy Trade-offs in Secure Code Review Pipelines

30 May 2026. Score: 9.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20467008

Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: What is the efficiency-accuracy trade-off when deploying Deepseek R1 and Claude in secure code review pipelines, measured by inference latency and vulnerability detection F1-scores on the Big-Vul. Large language…

[895]

Multimodal Training with Static Code Visualizations Enhances Codestral Vulnerability Classification

30 May 2026. Score: 3.83/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: To what extent does multimodal training with static code analysis visualizations improve Codestral's ability to classify vulnerabilities in the Big-Vul dataset compared to text-only training. Increasing…

[894]

Instruction Tuning with Code Security Examples Enhances Llama3 Zero-Shot Vulnerability Detection

30 May 2026. Score: 3.33/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does instruction tuning with code security examples improve Llama3's zero-shot performance on the Big-Vul dataset compared to general code instruction tuning. Large Language Models (LLMs) have demonstrated…

[893]

Scaling Codestral Model Size and Its Effect on Big-Vul Vulnerability Classification

30 May 2026. Score: 3.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the impact of model size scaling (e.g., 7B vs 33B) on Codestral's vulnerability classification accuracy across different severity levels in Big-Vul. While automated vulnerability detection techniques have…

[892]

Few-Shot Prompting with Vulnerability Taxonomy Outperforms Fine-Tuning on Big-Vul Detection

30 May 2026. Score: 6.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does few-shot prompting with vulnerability taxonomy examples affect DeepSeek-V3's precision on Big-Vul compared to fine-tuning approaches. Few-shot prompting has emerged as a practical alternative to…

[891]

Auxiliary-Loss-Free Load Balancing in DeepSeek-V3 for Stable Code Generation

30 May 2026. Score: 5.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the auxiliary-loss-free load balancing strategy in DeepSeek-V3 influence model performance stability on code generation tasks in the GPQA Diamond domain compared to traditional MoE load. For…

[890]

Scaling Effects on Vulnerability Classification Accuracy in Llama3, Codestral, and DeepSeek R1

30 May 2026. Score: 2.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the impact of model size scaling on the pass@1 accuracy of Llama3, Codestral, and Deepseek R1 when evaluating vulnerability classification on the Big-Vul dataset. Recent advancements in generative AI have…

[889]

Multimodal Context Enhances LLM Vulnerability Detection on Big-Vul

30 May 2026. Score: 3.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the inclusion of multimodal context (e.g., commit messages, code diffs) affect the vulnerability detection accuracy of LLMs compared to text-only file context on the Big-Vul dataset. Detecting…

[888]

DeepSeek-R1 and Claude Performance on SWE-Bench Across Programming Languages and Context Conditions

30 May 2026. Score: 4.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: How does the performance of DeepSeek-R1 compare to Claude on SWE-bench Verified across different programming languages when provided with issue-specific file context versus baseline context-free. The evaluation…

[887]

Scaling DeepSeek-V3 from 7B to 33B Parameters Enhances GPQA Diamond Robustness

30 May 2026. Score: 3.50/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: To what extent does scaling the model size of DeepSeek-V3 from 7B to 33B parameters improve its robustness to distribution shifts in GPQA Diamond questions, as evaluated by accuracy and consistency. Foundation…

[886]

Fine-Tuning Effects on Pass@1 Accuracy for Romanized Nepali in Llama-3.1-8B, Mistral-7B, and Qwen3-8B

30 May 2026. Score: 7.00/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of fine-tuning on the pass@1 accuracy of Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B for Romanized Nepali language tasks using the same bilingual dataset. Romanized Nepali, the Nepali language…

[885]

Inference Efficiency of Llama-3.1-8B, Mistral-7B, and Qwen3-8B for Code Generation on MBPP

30 May 2026. Score: 4.00/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B compare in terms of inference efficiency (throughput and latency) when generating code on MBPP under constrained hardware conditions. Romanized Nepali, the…

[884]

Fine-Tuning Codestral on Taxonomy-Aligned Vulnerability Datasets for Code Repair Success

30 May 2026. Score: 3.00/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of fine-tuning Codestral on taxonomy-aligned vulnerability datasets compared to general code datasets, as measured by repair success rates on the Big-Vul dataset and the SWCC. Context:…

[883]

Multimodal Input Integration in DeepSeek R1 and Codestral for Vulnerability Repair

30 May 2026. Score: 4.33/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the integration of multimodal inputs (e.g., AST + control flow graphs) affect the vulnerability repair capabilities of DeepSeek R1 versus Codestral, measured by accuracy and throughput on. With the…

[882]

Vendi-RAG Diversity-Weight Tuning and Performance on Adversarial NLP Benchmarks

30 May 2026. Score: 4.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does varying the diversity-weight parameter in Vendi-RAG affect the performance of FLAN-T5-xl on adversarial benchmarks like ANLI and HANS, as measured by accuracy and F1-score. Retrieval-augmented generation…

« Prev 1 … 294 295 296 297 298 … 332 Next »