Assignee Research: Index of Papers

[336]

How do simple negative sampling techniques compare to advanced data augmentation methods for improving out-of-

28 May 2026. Score: 3.33/10. Verification: L2, Source-grounded claims.

Abstract: To produce a domain-agnostic question answering model for the Machine Reading Question Answering (MRQA) 2019 Shared Task, we investigate the relative benefits of large pre-trained language models, various data sampling strategies, as well as query and context paraphrases generated by back-translation. We find a…

[335]

What is the impact of back-translation paraphrasing techniques on QA model generalization across different mod

28 May 2026. Score: 5.33/10. Verification: L2, Source-grounded claims.

Abstract: To produce a domain-agnostic question answering model for the Machine Reading Question Answering (MRQA) 2019 Shared Task, we investigate the relative benefits of large pre-trained language models, various data sampling strategies, as well as query and context paraphrases generated by back-translation. We find a…

[334]

Can ASE framework maintain consistent accuracy scores while scaling inference budget across diverse legal doma

28 May 2026. Score: 5.67/10. Verification: L1, Literature synthesis.

Abstract: Large language models (LLMs) have shown remarkable skills across various activities, including text generation and code synthesis. Their widespread applicability, however, raises substantial concerns about security, privacy, and possibly misuse. Of recent legislative efforts, the most notable is the proposed EU AI…

[333]

How does negative sampling performance vary across different LLM architectures (7B vs 70B) when evaluated on o

28 May 2026. Score: 5.50/10. Verification: L2, Source-grounded claims.

Abstract: The complexity of multimedia applications in terms of intensity of computation and heterogeneity of treated data led the designers to embark them on multiprocessor systems on chip. The complexity of these systems on one hand and the expectations of the consumers on the other hand complicate the designers job to…

[332]

How does the token-efficiency trade-off (accuracy per inference cost) vary between DeepSeek-R1 and o1-preview

28 May 2026. Score: 6.33/10. Verification: L2, Source-grounded claims.

Abstract: Despite increasing discussions on open-source Artificial Intelligence (AI), existing research lacks a discussion on the transparency and accessibility of state-of-the-art (SoTA) Large Language Models (LLMs). The Open Source Initiative (OSI) has recently released its first formal definition of open-source software.…

[331]

What is the throughput (queries per second) trade-off for dense retrievers (e.g., Contriever) on MuSiQue 2-hop

28 May 2026. Score: 6.00/10. Verification: L2, Source-grounded claims.

Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5\% and 17.0\%, respectively, which is considerably better than the previous…

[330]

How does retrieval-augmented code generation latency compare to end-to-end generation on HumanEval benchmark u

28 May 2026. Score: 5.00/10. Verification: L2, Source-grounded claims.

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[329]

How does the cross-lingual performance of DeepSeek-R1 and o1-preview vary across different legal sub-domains w

28 May 2026. Score: 9.00/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20435959

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[328]

What is the impact of evidence gap identification mechanisms in FAIR-RAG on downstream task performance measur

28 May 2026. Score: 6.50/10. Verification: L2, Source-grounded claims.

Abstract: The advent of Large Language Models (LLMs) has revolutionized Natural Language Processing, yet their application in high-stakes, specialized domains like religious question answering is hindered by challenges like hallucination and unfaithfulness to authoritative sources. This issue is particularly critical for the…

[327]

To what extent does fine-tuning on BEIR-NL improve downstream task performance in Dutch legal and news domains

28 May 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: Zero-shot evaluation of information retrieval (IR) models is often performed using BEIR; a large and heterogeneous benchmark composed of multiple datasets, covering different retrieval tasks across various domains. Although BEIR has become a standard benchmark for the zero-shot setup, its exclusively English content…

[326]

How does FAIR-RAG's faithfulness mechanism affect cross-domain generalization performance when evaluated on sp

28 May 2026. Score: 4.83/10. Verification: L2, Source-grounded claims.

Abstract: The advent of Large Language Models (LLMs) has revolutionized Natural Language Processing, yet their application in high-stakes, specialized domains like religious question answering is hindered by challenges like hallucination and unfaithfulness to authoritative sources. This issue is particularly critical for the…

[325]

How does FAIR-RAG's iterative refinement process scale in terms of inference latency and token-level processin

28 May 2026. Score: 1.83/10. Verification: L1, Literature synthesis.

Abstract: While Retrieval-Augmented Generation (RAG) mitigates hallucination and knowledge staleness in Large Language Models (LLMs), existing frameworks often falter on complex, multi-hop queries that require synthesizing information from disparate sources. Current advanced RAG methods, employing iterative or adaptive…

[324]

Can Vendi-RAG's iterative diversity-quality optimization maintain consistent performance gains when applied to

28 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20435915

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[323]

What is the impact of varying the diversity-weight parameter in Vendi-RAG on retrieval throughput (queries/sec

28 May 2026. Score: 8.17/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20435906

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[322]

How does Vendi-RAG's iterative diversity-accuracy optimization compare to static retrieval methods like Contri

28 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20435902

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[321]

How does dynamic iterative retrieval with varying passage counts per hop affect the efficiency-accuracy trade-

28 May 2026. Score: 4.00/10. Verification: L2, Source-grounded claims.

Abstract: Multi-Hop Question Answering (MHQA) tasks permeate real-world applications, posing challenges in orchestrating multi-step reasoning across diverse knowledge domains. While existing approaches have been improved with iterative retrieval, they still struggle to identify and organize dynamic knowledge. To address this,…

[320]

How does the performance of instruction-tuned retrievers on multi-hop queries from MuSiQue compare to single-c

28 May 2026. Score: 7.33/10. Verification: L2, Source-grounded claims.

Abstract: Instruction-tuned language models (LM) are able to respond to imperative commands, providing a more natural user interface compared to their base counterparts. In this work, we present Promptriever, the first retrieval model able to be prompted like an LM. To train Promptriever, we curate and release a new…

[319]

How does RAG performance vary across different external knowledge bases when evaluated on the HotPotQA benchma

28 May 2026. Score: 3.00/10. Verification: L2, Source-grounded claims.

Abstract: Recently the retrieval-augmented generation (RAG) has been successfully applied in code generation. However, existing pipelines for retrieval-augmented code generation (RACG) employ static knowledge bases with a single source, limiting the adaptation capabilities of Large Language Models (LLMs) to domains they have…

[318]

How do different embedding models (SPECTER, ConRetri(Saltz)) influence RAG performance on the Natural Question

28 May 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20435839

Abstract: Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in…

[317]

How does Gemini 1.5 Flash compare to Gemini 1.5 Pro on retrieval accuracy when scaling context from 1M to 2M t

28 May 2026. Score: 9.00/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20435799

Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family…

[316]

How does the inference latency of Llama-2-7B and Llama-2-70B models scale when processing 128K-token contexts

28 May 2026. Score: 7.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20435771

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[315]

How does the code generation accuracy of 13B and 34B parameter-efficient fine-tuned models compare on the Huma

28 May 2026. Score: 6.67/10. Verification: L2, Source-grounded claims.

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[314]

What is the impact of context length on the performance of Mixtral 8x7B versus single-check 7B models on the M

28 May 2026. Score: 7.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20435765

Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate…

[313]

What is the relative contribution of retrieval versus generation components to overall task performance when a

28 May 2026. Score: 1.67/10. Verification: L2, Source-grounded claims.

Abstract: Recently the retrieval-augmented generation (RAG) has been successfully applied in code generation. However, existing pipelines for retrieval-augmented code generation (RACG) employ static knowledge bases with a single source, limiting the adaptation capabilities of Large Language Models (LLMs) to domains they have…

[312]

How do different prompting strategies affect the calibration of uncertainty estimates in retrieval-augmented l

28 May 2026. Score: 1.00/10. Verification: L2, Source-grounded claims.

Abstract: Recently the retrieval-augmented generation (RAG) has been successfully applied in code generation. However, existing pipelines for retrieval-augmented code generation (RACG) employ static knowledge bases with a single source, limiting the adaptation capabilities of Large Language Models (LLMs) to domains they have…