Assignee Research: Index of Papers

Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 8275 papers; mean review score 5.72/10; 2253 Zenodo DOIs. Verified contributions (Gate 2: formal proof or sandbox reproduction): 144. 87 claims falsified by the pipeline (see falsification record). 169 published AI claims under field audit; 92 contested by the literature itself (see audit ledger). 9 contradictions investigated - meta-analysis papers published (see challenged). What does this mean?

Results 7351–7375 of 8275 entries

Papers

[925]

Llama3-70B and Codestral-7B Alignment with Human Security Judgments Across Fine-Tuning Iterations

30 May 2026. Score: 3.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the alignment of Llama3-70B with human security review judgments (measured by EM score on SECURITYBENCH) evolve compared to Codestral-7B across different iterations of instruction fine-tuning. Large…

[924]

Generalization of Llama3-70B and Codestral-34B to Low-Resource Programming Languages

30 May 2026. Score: 3.40/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How do Llama3-70B and Codestral-34B generalize to low-resource programming languages beyond Java and Python, such as Rust or Go, when fine-tuned on limited domain-specific datasets, as measured by.…

[923]

Multi-Agent Context Engineering Workflows and Code LLM Throughput in Niche Domains

30 May 2026. Score: 2.83/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does the integration of multi-agent context engineering workflows impact the throughput of niche domain code generation in Code LLMs, measured by tokens per second on HumanEval or MBPP benchmarks. Large…

[922]

Scaling Effects on Instruction Following Accuracy in Out-of-Domain Code Generation

30 May 2026. Score: 4.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the impact of model scaling on instruction following accuracy when evaluated on out-of-domain code generation tasks. Despite widespread deployment of Large Language Models, systematic evaluation of…

[921]

Retrieval-Augmented Llama3-70B and Llama-13B in Code Vulnerability Classification: Efficiency and Accuracy Trade-offs

30 May 2026. Score: 5.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the computational efficiency trade-off when applying retrieval augmentation to Llama3-70B for code vulnerability classification, and how does it compare to smaller models like Llama-13B in. With many…

[920]

Fine-Tuning vs Retrieval-Augmented Prompting for Code Security in Llama3-70B and Gemini 1.5 Pro

30 May 2026. Score: 6.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does fine-tuning affect the zero-shot and few-shot performance of Llama3-70B and Gemini 1.5 Pro on the CodeXGLUE security subset compared to retrieval-augmented approaches. Few-shot prompting has emerged as a…

[919]

Train-Test Split Contamination and F1-Score Inflation in Code Generation Models

30 May 2026. Score: 3.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the impact of train-test split contamination on F1-score inflation for code generation models on CodeXGLUE security subsets. Anomaly detection is a widely explored domain in machine learning. Many models…

[918]

Retrieval-Augmented Generation Latency-Accuracy Trade-offs in Llama3-70B and Gemini 1.5 Pro on CodeXGLUE Security

30 May 2026. Score: 4.17/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: What is the trade-off between inference latency and accuracy when using retrieval-augmented generation for Llama3-70B versus Gemini 1.5 Pro on the CodeXGLUE security subset under few-shot learning. The advent of…

[917]

Mistral-Large-2 Self-Invoking Code Alignment and Cross-Domain Performance in MBPP Pro

30 May 2026. Score: 5.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the alignment of Mistral-Large-2's self-invoking code generation affect its performance on cross-domain tasks (e.g., math vs. string manipulation) in MBPP Pro, and can fine-tuning improve. We introduce…

[916]

Retrieval-Augmented Gemini 1.5 Pro and Llama3-70B Performance on CodeXGLUE Security Subset

30 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20467241

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the performance of retrieval-augmented Gemini 1.5 Pro and Llama3-70B compare on the CodeXGLUE security subset when evaluated with few-shot versus zero-shot learning across different. Few-shot prompting…

[915]

Mistral-Large-2 and State-of-the-Art LLMs in Self-Invoking Code Generation on MBPP Pro

30 May 2026. Score: 5.27/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the self-invoking code generation performance of Mistral-Large-2 compare to other state-of-the-art LLMs like GPT-4 or Claude 3 on the MBPP Pro benchmark in terms of solution correctness and. We introduce…

[914]

Mistral-Large-2 Inference Efficiency Scaling on MBPP Code Generation

30 May 2026. Score: 4.40/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 17 peer-reviewed papers addressing the following research question: How does the inference efficiency of Mistral-Large-2 scale with model size when generating code on the MBPP benchmark, as measured by tokens per second and latency metrics. Large-scale video generative models,…

[913]

Mistral-Large-2 Throughput and Efficiency Under Self-Invoking Code Generation Complexity

30 May 2026. Score: 5.77/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: To what extent does the complexity of the base problem in self-invoking code generation tasks impact the throughput and efficiency of Mistral-Large-2 during inference. We introduce self-invoking code generation,…

[912]

Self-Instruct Tuning with GPT-4 for Japanese Language Models: Performance Gains over Human Benchmarks

30 May 2026. Score: 6.80/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of self-instruct methods based on GPT-4 on the performance of Japanese language models compared to traditional human-annotated benchmarks, as measured by BLEU or ROUGE scores. Despite…

[911]

Mistral-Large-2 and GPT-4 Code Generation Performance on MBPP with Execution-Based Metrics

30 May 2026. Score: 3.17/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the code generation quality of Mistral-Large-2 compare to other state-of-the-art LLMs like GPT-4 on the MBPP benchmark when evaluated using execution-based metrics such as pass@k. Large language models…

[910]

Mistral-Large-2 Solution Transferability Across Programming Domains on MBPP Pro

30 May 2026. Score: 3.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How robust is Mistral-Large-2's solution transferability across different programming domains when evaluated on a cross-domain adaptation of the MBPP Pro benchmark. Reusing pre-collected data from different…

[909]

Multimodal vs. Text-Only Models in Math Word Problem Performance and Efficiency

30 May 2026. Score: 2.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: Do multimodal models exhibit higher PER than text-only models on math word problems (e.g., SVAMP, AQuA) when evaluated with equal compute budgets, and how does modality fusion impact efficiency. Recent progress…

[908]

Multimodal Model Scaling and Inference Efficiency in Sign Language Video-to-Text Translation

30 May 2026. Score: 5.70/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of multimodal model scaling on inference efficiency when processing sign language video-to-text tasks, as measured by throughput and latency on benchmarks such as DAILY-1M or LSLR. Multimodal…

[907]

Mistral-Large-2 Scaling Effects on MBPP Code Generation and Evaluation Metrics

30 May 2026. Score: 4.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the performance of Mistral-Large-2 in generating code solutions on MBPP scale with model size, and how does this scaling affect both functional correctness and human evaluation scores. Although large…

[906]

Mistral-Large-2 Code Generation on MBPP: Automated vs. Human Evaluation Metrics

30 May 2026. Score: 3.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the functional correctness and code quality of Mistral-Large-2 generated solutions on MBPP compare when evaluated using automated test suites versus human evaluation scores. The use of machine learning…

[905]

Cross-Model Robustness of Qwen3-235B and Llama2-70B Under PPTC-R Attacks

30 May 2026. Score: 5.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the cross-model robustness comparison between Qwen3-235B and Llama2-70B under PPTC-R attacks, evaluated using accuracy drop and token efficiency. In this paper, we investigate the problem of distributed…

[904]

Vision-Language vs. Pure Visual Models in Medical Image Segmentation: A Meta-Analysis of Synthetic Metrics and Human Agreement

30 May 2026. Score: 6.50/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do vision-language models compare to pure visual models in terms of correlation between synthetic segmentation metrics and human rater agreement on multimodal medical image tasks like BRATS,. Training a deep…

[903]

Multimodal Transformer Scaling and Human Attention Alignment in Fine-Grained Spatial Tasks

30 May 2026. Score: 3.00/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 20 peer-reviewed papers addressing the following research question: To what extent does model size scaling in multimodal transformers (e.g., ViT, CLIP vs. small-scale CNN-based models) affect the alignment of synthetic metrics with human attention benchmarks in tasks. Tactile…

[902]

Multimodal Context Enhances DeepSeek-R1 Code Repair Performance on FeedbackEval

30 May 2026. Score: 3.67/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does multimodal context (text + code diagrams) affect the iterative code repair performance of DeepSeek-R1 on FeedbackEval compared to text-only context, measured by repair success rate and token. Code repair…

[901]

DeepSeek-R1 and Claude-3 Token Efficiency in Few-Shot Code Generation on HumanEval

30 May 2026. Score: 3.33/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the token efficiency of DeepSeek-R1 compare to Claude-3 when performing few-shot code generation on HumanEval, measured by pass@1 accuracy per token consumed. How far are Large Language Models (LLMs) in…

« Prev 1 … 293 294 295 296 297 … 331 Next »