Index |  Research ▾  |  Verification ▾  | About
Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 8281 papers; mean review score 5.72/10; 2258 Zenodo DOIs. Verified contributions (Gate 2: formal proof or sandbox reproduction): 146. 87 claims falsified by the pipeline (see falsification record). 169 published AI claims under field audit; 92 contested by the literature itself (see audit ledger). 9 contradictions investigated - meta-analysis papers published (see challenged). What does this mean?
Results 7376–7400 of 8281 entries

Papers

[906]
30 May 2026. Score: 3.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the functional correctness and code quality of Mistral-Large-2 generated solutions on MBPP compare when evaluated using automated test suites versus human evaluation scores. The use of machine learning…

[905]
30 May 2026. Score: 5.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the cross-model robustness comparison between Qwen3-235B and Llama2-70B under PPTC-R attacks, evaluated using accuracy drop and token efficiency. In this paper, we investigate the problem of distributed…

[904]
30 May 2026. Score: 6.50/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do vision-language models compare to pure visual models in terms of correlation between synthetic segmentation metrics and human rater agreement on multimodal medical image tasks like BRATS,. Training a deep…

[903]
30 May 2026. Score: 3.00/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 20 peer-reviewed papers addressing the following research question: To what extent does model size scaling in multimodal transformers (e.g., ViT, CLIP vs. small-scale CNN-based models) affect the alignment of synthetic metrics with human attention benchmarks in tasks. Tactile…

[902]
30 May 2026. Score: 3.67/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does multimodal context (text + code diagrams) affect the iterative code repair performance of DeepSeek-R1 on FeedbackEval compared to text-only context, measured by repair success rate and token. Code repair…

[901]
30 May 2026. Score: 3.33/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the token efficiency of DeepSeek-R1 compare to Claude-3 when performing few-shot code generation on HumanEval, measured by pass@1 accuracy per token consumed. How far are Large Language Models (LLMs) in…

[900]
30 May 2026. Score: 3.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does INT4 quantization affect the zero-shot code generation performance of Llama-3.1 models on HumanEval, and does this trade-off persist across different hardware configurations (e.g., A100 vs.. Quantization…

[899]
30 May 2026. Score: 4.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the trade-off between inference latency and code generation accuracy for DeepSeek-R1 versus other LLMs (e.g., CodeLlama, WizardCoder) when evaluated on HumanEval-V and MBPP benchmarks. This paper explores…

[898]
30 May 2026. Score: 4.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the impact of context window scaling on the security vulnerability detection performance of DeepSeek-R1 compared to other models across different code lengths and complexity levels. Many studies have…

[897]
30 May 2026. Score: 4.40/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does fine-tuning Llama3 with the Big-Vul dataset's vulnerability classification annotations impact its performance on the FeedbackEval benchmark compared to the base model. Detecting toxic content using…

[896]
30 May 2026. Score: 9.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20467008

Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: What is the efficiency-accuracy trade-off when deploying Deepseek R1 and Claude in secure code review pipelines, measured by inference latency and vulnerability detection F1-scores on the Big-Vul. Large language…

[895]
30 May 2026. Score: 3.83/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: To what extent does multimodal training with static code analysis visualizations improve Codestral's ability to classify vulnerabilities in the Big-Vul dataset compared to text-only training. Increasing…

[894]
30 May 2026. Score: 3.33/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does instruction tuning with code security examples improve Llama3's zero-shot performance on the Big-Vul dataset compared to general code instruction tuning. Large Language Models (LLMs) have demonstrated…

[893]
30 May 2026. Score: 3.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the impact of model size scaling (e.g., 7B vs 33B) on Codestral's vulnerability classification accuracy across different severity levels in Big-Vul. While automated vulnerability detection techniques have…

[892]
30 May 2026. Score: 6.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does few-shot prompting with vulnerability taxonomy examples affect DeepSeek-V3's precision on Big-Vul compared to fine-tuning approaches. Few-shot prompting has emerged as a practical alternative to…

[891]
30 May 2026. Score: 5.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the auxiliary-loss-free load balancing strategy in DeepSeek-V3 influence model performance stability on code generation tasks in the GPQA Diamond domain compared to traditional MoE load. For…

[890]
30 May 2026. Score: 2.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the impact of model size scaling on the pass@1 accuracy of Llama3, Codestral, and Deepseek R1 when evaluating vulnerability classification on the Big-Vul dataset. Recent advancements in generative AI have…

[889]
30 May 2026. Score: 3.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the inclusion of multimodal context (e.g., commit messages, code diffs) affect the vulnerability detection accuracy of LLMs compared to text-only file context on the Big-Vul dataset. Detecting…

[888]
30 May 2026. Score: 4.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: How does the performance of DeepSeek-R1 compare to Claude on SWE-bench Verified across different programming languages when provided with issue-specific file context versus baseline context-free. The evaluation…

[887]
30 May 2026. Score: 3.50/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: To what extent does scaling the model size of DeepSeek-V3 from 7B to 33B parameters improve its robustness to distribution shifts in GPQA Diamond questions, as evaluated by accuracy and consistency. Foundation…

[886]
30 May 2026. Score: 7.00/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of fine-tuning on the pass@1 accuracy of Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B for Romanized Nepali language tasks using the same bilingual dataset. Romanized Nepali, the Nepali language…

[885]
30 May 2026. Score: 4.00/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B compare in terms of inference efficiency (throughput and latency) when generating code on MBPP under constrained hardware conditions. Romanized Nepali, the…

[884]
30 May 2026. Score: 3.00/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of fine-tuning Codestral on taxonomy-aligned vulnerability datasets compared to general code datasets, as measured by repair success rates on the Big-Vul dataset and the SWCC. Context:…

[883]
30 May 2026. Score: 4.33/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the integration of multimodal inputs (e.g., AST + control flow graphs) affect the vulnerability repair capabilities of DeepSeek R1 versus Codestral, measured by accuracy and throughput on. With the…

[882]
30 May 2026. Score: 4.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does varying the diversity-weight parameter in Vendi-RAG affect the performance of FLAN-T5-xl on adversarial benchmarks like ANLI and HANS, as measured by accuracy and F1-score. Retrieval-augmented generation…

« Prev 1 294 295 296 297 298 332 Next »