Assignee Research: Index of Papers

[634]

Human Attention Benchmarks for Multi-Task Learning in Attention-Based Models

30 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453327

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: Can the human attention benchmark be used to improve the training of attention-based models through multi-task learning frameworks. Deep convolutional neural networks have performed remarkably well on many…

[633]

Multi-Layer Human Attention Masks and Explanation Quality in Deep Neural Networks

30 May 2026. Score: 8.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453272

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What is the impact of using multi-layer human attention masks versus single-layer attention mechanisms on explanation quality scores. Deep convolutional neural networks have performed remarkably well on many…

[632]

DeepSeek-V4-Pro Cross-Domain Reasoning on ARC and HellaSwag Benchmarks

30 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453264

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What are the cross-domain reasoning capabilities of DeepSeek-V4-Pro when evaluated on the ARC and HellaSwag benchmarks. Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on…

[631]

Human Attention Benchmark vs. Synthetic Metrics in Model Performance Correlation

30 May 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453257

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the human attention benchmark compare to existing synthetic attention evaluation metrics in terms of correlation with model performance on downstream tasks. Many computational models of visual attention…

[630]

DeepSeek-V4-Pro and GPT-4 Performance on HumanEval Code Generation Benchmarks

30 May 2026. Score: 4.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the performance difference between DeepSeek-V4-Pro and GPT-4 on HumanEval code generation benchmark scores. Understanding and reasoning over diagrams is a fundamental aspect of human intelligence. While…

[629]

DeepSeek-V4-Pro Inference Efficiency on MMLU and GSM8K Benchmarks

30 May 2026. Score: 2.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the inference efficiency of DeepSeek-V4-Pro compare to other LLMs on standard reasoning benchmarks like MMLU and GSM8K. Rapid advancements in large language models (LLMs) have increased interest in…

[628]

DeepSeek-V3 and GPT-4 Precision and Recall in Code Smell Detection Against Human Annotations

30 May 2026. Score: 4.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 18 peer-reviewed papers addressing the following research question: What are the precision and recall metrics for DeepSeek-V3 in detecting specific code smell categories compared to human-annotated ground truth. Determining which Large Language Model (LLM) is superior for code…

[627]

DeepSeek-R1 and Llama-2-70B Inference Latency on GSM8K Across Hardware Configurations

30 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453193

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the inference latency of DeepSeek-R1 compare to Llama-2-70B on GSM8K across different batch sizes and hardware configurations. Finetuning language models on a collection of datasets phrased as…

[626]

DeepSeek-R1 Inference Latency on HumanEval-V Compared to Multimodal Baselines

30 May 2026. Score: 7.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453166

Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: What is the inference latency of DeepSeek-R1 on HumanEval-V benchmark tasks compared to baseline multimodal models. Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in…

[625]

DeepSeek-R1 and Claude Performance on SWE-Bench Verified With and Without File Context

30 May 2026. Score: 4.00/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: What is the performance difference between DeepSeek-R1 and Claude models on SWE-bench Verified when evaluated with and without access to issue-specific file context. Code repair is a fundamental task in software…

[624]

Cross-Domain Finetuning Enhances DeepSeek-V3 Performance on GPQA Diamond

30 May 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453141

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: Does cross-domain finetuning improve DeepSeek-V3's performance on GPQA Diamond, and if so, by what percentage. Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in…

[623]

DeepSeek-V3 Pass@1 Accuracy on HumanEval: A Multi-Study Synthesis

30 May 2026. Score: 6.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: What is the pass@1 accuracy of DeepSeek-V3 on the HumanEval benchmark for code generation tasks. As Large Language Models (LLMs) become increasingly integrated into secure software development workflows, a…

[622]

Llama-3.1-8B MBPP Performance Across Python and JavaScript Fine-Tuning Domains

30 May 2026. Score: 8.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453107

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: Does Llama-3.1-8B exhibit consistent MBPP performance across different programming language domains (e.g., Python vs. JavaScript) when fine-tuned on domain-specific code datasets. Large Language Models (LLMs)…

[621]

DeepSeek-V3 File Retrieval Accuracy and Issue Resolution Success on SWE-Bench Verified

30 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the file retrieval accuracy of DeepSeek-V3 correlate with its final issue resolution success rate on SWE-bench Verified. As Large Language Models (LLMs) become increasingly integrated into secure…

[620]

DeepSeek-V3 Inference Latency on SWE-Bench Verified vs. Baseline Models

30 May 2026. Score: 7.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: What is the inference latency in tokens per second of DeepSeek-V3 when processing SWE-bench Verified issues compared to baseline models. Abstract The rapid evolution of large language models (LLMs) has driven a…

[619]

Llama-3.1-8B Performance on MBPP Against Open-Source 8B-Parameter Models CodeMixBench: Evaluating LLM Robustness on Multilingual

30 May 2026. Score: 7.90/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453080

Abstract: This report synthesises findings from 2 peer-reviewed papers addressing the following research question: How does Llama-3.1-8B's performance on MBPP compare to other open-source 8B-parameter models like Falcon-8B or Mistral-8B in terms of pass@1 accuracy. Large Language Models (LLMs) have achieved remarkable success…

[618]

Llama-3.1-8B Performance on LiveCodeBench Under Low-Resource Constraints

30 May 2026. Score: 2.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does Llama-3.1-8B's performance on LiveCodeBench compare to smaller or similarly sized language models when evaluated under low-resource conditions or limited inference budgets. Large Language Models achieve…

[617]

PDF Preprocessing Trade-offs in GDPR-Compliant Llama-3.1-8B LiveCodeBench Performance

30 May 2026. Score: 8.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453066

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do different PDF preprocessing techniques (e.g., anonymization, content extraction methods) affect LiveCodeBench performance for Llama-3.1-8B in GDPR-compliant pipelines, and what trade-offs. Blockchains or…

[616]

Vendi-RAG Diversity-Weight Tuning for Latency-Accuracy Trade-offs in Multi-Hop QA

29 May 2026. Score: 4.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) for domain-specific question-answering (QA) tasks by leveraging external knowledge sources. However, traditional RAG systems primarily focus on relevance-based retrieval and often struggle with redundancy, especially when reasoning requires…

[615]

Vendi-RAG Retrieval Diversity Trade-offs in Multi-Hop QA Latency and Accuracy

29 May 2026. Score: 5.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) for domain-specific question-answering (QA) tasks by leveraging external knowledge sources. However, traditional RAG systems primarily focus on relevance-based retrieval and often struggle with redundancy, especially when reasoning requires…

[614]

Vendi-RAG vs. Dense Retrieval Baselines on Multi-Hop QA Benchmarks

29 May 2026. Score: 4.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) for domain-specific question-answering (QA) tasks by leveraging external knowledge sources. However, traditional RAG systems primarily focus on relevance-based retrieval and often struggle with redundancy, especially when reasoning requires…

[613]

Diversity-Weight Tuning in Vendi-RAG: Latency and EM Performance on HotpotQA

29 May 2026. Score: 3.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) for domain-specific question-answering (QA) tasks by leveraging external knowledge sources. However, traditional RAG systems primarily focus on relevance-based retrieval and often struggle with redundancy, especially when reasoning requires…

[612]

Vendi-RAG Performance on Multi-Hop QA with Top-k and Iterative Diversity Methods

29 May 2026. Score: 6.27/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) for domain-specific question-answering (QA) tasks by leveraging external knowledge sources. However, traditional RAG systems primarily focus on relevance-based retrieval and often struggle with redundancy, especially when reasoning requires…

[611]

Contriever and DPR Retrieval Accuracy on SQuAD 2.0 with 1024-Token Windows

29 May 2026. Score: 3.50/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: Retrieval-Augmented Generation (RAG) is a prevalent approach to infuse a private knowledge base of documents with Large Language Models (LLM) to build Generative Q\&A (Question-Answering) systems. However, RAG accuracy becomes increasingly challenging as the corpus of documents scales up, with Retrievers playing an…

[610]

Impact of Passage Count on Retrieval Accuracy in Retrieval-Augmented Generation Systems

29 May 2026. Score: 3.67/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: Retrieval-Augmented Generation (RAG) is a prevalent approach to infuse a private knowledge base of documents with Large Language Models (LLM) to build Generative Q\&A (Question-Answering) systems. However, RAG accuracy becomes increasingly challenging as the corpus of documents scales up, with Retrievers playing an…