Assignee Research: Index of Papers

Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 8297 papers; mean review score 5.73/10; 2272 Zenodo DOIs. Verified contributions (Gate 2: formal proof or sandbox reproduction): 142. 97 claims falsified by the pipeline (see falsification record). 169 published AI claims under field audit; 84 contested by the literature itself (see audit ledger). 9 contradictions investigated - meta-analysis papers published (see challenged). What does this mean?

Results 7551–7575 of 8297 entries

Papers

[747]

DeepSeek-V3 Cross-Domain Finetuning Trade-offs on GPQA Diamond Inference Efficiency

30 May 2026. Score: 9.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20458372

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: What is the inference efficiency trade-off when applying cross-domain finetuning to DeepSeek-V3 on GPQA Diamond tasks. We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total…

[746]

Cross-Domain Finetuning Enhances DeepSeek-V3 Robustness to GPQA Diamond Distribution Shifts

30 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20458341

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: To what extent does cross-domain finetuning improve DeepSeek-V3's robustness to distribution shifts in GPQA Diamond questions. Abstract The rapid evolution of large language models (LLMs) has driven a…

[745]

Llama-3.1-8B vs. Falcon-8B and Mistral-8B on MBPP Pass@1 Accuracy

30 May 2026. Score: 6.50/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does Llama-3.1-8B's code generation performance on MBPP compare to other open-source 8B models like Falcon-8B and Mistral-8B in terms of pass@1 accuracy. Romanized Nepali, the Nepali language written in the…

[744]

DeepSeek R1 and Codestral Code Repair Success Under Aligned Vulnerability Taxonomies

30 May 2026. Score: 2.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: What is the impact of vulnerability taxonomy alignment on the code repair success rates of DeepSeek R1 versus Codestral on the Big-Vul dataset. Many ML-based approaches have been proposed to automatically detect,…

[743]

GDPR-Compliant Anonymization Trade-offs in Llama-3.1-8B Inference Pipelines

30 May 2026. Score: 6.50/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What are the trade-offs between latency overhead and semantic preservation when applying GDPR-compliant anonymization techniques to Llama-3.1-8B inference pipelines. Large language models (LLMs) have achieved…

[742]

Scaling Inference Efficiency of Code Generation Models on Multilingual CodeMixBench

30 May 2026. Score: 4.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How do different code generation models scale in inference efficiency when evaluated on multilingual programming benchmarks like CodeMixBench. Large Language Models (LLMs) have achieved remarkable success in code…

[741]

Vendi-RAG Diversity-Weight Parameter Effects on FLAN-T5-xl Robustness in Adversarial QA

30 May 2026. Score: 3.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the diversity-weight parameter in Vendi-RAG influence the robustness of FLAN-T5-xl against adversarial attacks (e.g., ANLI) in knowledge-intensive QA, and what is the correlation between. Machine…

[740]

Performance-Efficiency Trade-offs in Code Generation Models from 0.5B to 13B Parameters

30 May 2026. Score: 5.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the Performance-Efficiency Ratio vary across different inference budget thresholds for code generation tasks using models ranging from 0.5B to 13B parameters on HumanEval and MBPP benchmarks. We…

[739]

Vendi-RAG Diversity-Weight Impact on ELI5 Performance with Sparse and Dense Retrievers

30 May 2026. Score: 5.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: How does the diversity-weight parameter in Vendi-RAG affect its performance on the ELI5 dataset when using a sparse retriever versus a dense retriever, measured by ROUGE-L scores. Questa tesi affronta il problema…

[738]

Adaptive Diversity-Weight Tuning in Vendi-RAG for Throughput and Accuracy Trade-offs

30 May 2026. Score: 5.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does adaptive diversity-weight tuning in Vendi-RAG affect throughput on the TriviaQA benchmark compared to fixed-weight retrieval for FLAN-T5-xxl, and what is the optimal efficiency-accuracy.…

[737]

Manifold-Aware vs Euclidean and Cosine Distances in DPR for Natural Questions Recall

30 May 2026. Score: 2.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the manifold-aware distance metric in DPR compare to Euclidean and cosine distance in terms of Recall@10 on Natural Questions (NQ) when the context window is limited to 512 tokens. The advent of…

[736]

Vendi-RAG Diversity-Aware Retrieval for Robustness in Adversarial and Out-of-Domain QA

30 May 2026. Score: 9.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20457731

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: Can Vendi-RAG's diversity-aware retrieval approach improve robustness against adversarial or out-of-domain questions in the ELI5 benchmark compared to BM25 and dense retrieval baselines. Large Language Models…

[735]

Vendi-RAG Performance Scaling with Document Corpus Size on TriviaQA

30 May 2026. Score: 8.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: How does Vendi-RAG's performance scale with increasing document corpus size in terms of EM score and latency on the TriviaQA benchmark compared to traditional RAG. The rapid evolution of natural language…

[734]

Vendi-RAG Diversity Optimization Robustness Across Domain Shifts in Cross-Domain Benchmarks

30 May 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20457601

Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: How robust is Vendi-RAG's diversity optimization to domain shifts when evaluated on cross-domain benchmarks like TyDiQA and DROP with F1 score comparisons. Aligned large language models (LLMs) demonstrate…

[733]

Robustness Comparison of Llama-2-7B and Llama-3-8B in Constrained Out-of-Domain Retrieval

30 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20457589

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the robustness of Llama-2-7B and Llama-3-8B in handling out-of-domain retrieval tasks compare when evaluated on MuSiQue with a constrained context window of 1024 tokens. Prompt engineering has emerged as…

[732]

Contriever and DPR Retrieval Accuracy on Natural Questions at 2048-Token Context

30 May 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20457581

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the retrieval accuracy of Contriever and DPR encoders compare on the Natural Questions benchmark when the context window size is increased to 2048 tokens. Retrieval-Augmented Generation (RAG) has…

[731]

Context Window Size Reduction Effects on Llama-3-8B Throughput in Retrieval-Augmented SQuAD 2.0 Generation

30 May 2026. Score: 6.50/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of context window size reduction (4096 to 1024 tokens) on the throughput of Llama-3-8B when performing retrieval augmented generation on SQuAD 2.0. Retrieval-Augmented Generation (RAG) has…

[730]

Manifold-Aware DPR Inference Latency at Million-Passage Scale on MS MARCO

30 May 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20457564

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does the inference latency of DPR with manifold-aware distance metrics compare to standard DPR when scaling to 1M passages on the MS MARCO benchmark. The ice arches that usually develop at the northern and…

[729]

Manifold-Aware Distance Metrics Enhance DPR Retrieval in Low-Dimensional OOD Settings

30 May 2026. Score: 6.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the impact of incorporating manifold-aware distance metrics on the retrieval accuracy of DPR for out-of-domain (OOD) datasets like TriviaQA when the embedding dimension is reduced by 50\%. Dense Passage…

[728]

DPO-Enhanced DONOD Improves Robustness on Adversarial Safety Benchmarks

30 May 2026. Score: 5.80/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does the addition of DPO to DONOD affect performance on adversarial safety benchmarks like AdvBench and WildHacks compared to SFT-only models. Predicting the trajectories of surrounding objects is a critical…

[727]

Scaling and Alignment-Weighted DPO Effects on LLaMA-2 Jailbreak Robustness

30 May 2026. Score: 4.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: Does scaling the model size from LLaMA-2-7B to larger variants (e.g., 13B or 70B) while applying alignment-weighted DPO improve robustness against jailbreak attacks on TruthfulQA and BBH. Recent advances in…

[726]

Scaling Effects on Fine-Tuned Multilingual Models in Arabic-SQuAD Benchmarking

30 May 2026. Score: 3.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does scaling the model size (e.g., increasing the number of parameters) of fine-tuned multilingual models affect their performance on Arabic-SQuAD compared to monolingual models in terms of exact. In an era…

[725]

DONOD Threshold Variation and Its Impact on Training Efficiency and Cross-Domain Generalization in LLaMA-2 Models

30 May 2026. Score: 6.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does varying the DON threshold in DONOD affect the trade-off between training efficiency (measured in training tokens reduced) and cross-domain generalization (measured by performance on MBPP and. Ad-hoc…

[724]

Multi-Modal Lightweight Transformers vs. Text-Only Models in Code Generation and Reasoning Benchmarks

30 May 2026. Score: 3.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How do multi-modal lightweight Transformers perform relative to text-only models on mixed code-generation and reasoning benchmarks (e.g., MBPP + MMLU) when evaluated for alignment with human. Large language…

[723]

Cross-Domain Robustness of Fine-Tuned Multilingual Models in Arabic Question Answering

30 May 2026. Score: 6.50/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the cross-domain robustness of fine-tuned multilingual models on Arabic QA when evaluated across multiple Arabic datasets (e.g., ArabiQA, ArSQuAD) compared to monolingual models. The rapid expansion of…

« Prev 1 … 301 302 303 304 305 … 332 Next »