Assignee Research: Index of Papers

[141]

What is the impact of retriever robustness (measured via BEIR) on LLM reasoning accuracy in multi-hop QA tasks

28 May 2026. Score: 6.23/10. Verification: L2, Source-grounded claims.

Abstract: Zero-shot evaluation of information retrieval (IR) models is often performed using BEIR; a large and heterogeneous benchmark composed of multiple datasets, covering different retrieval tasks across various domains. Although BEIR has become a standard benchmark for the zero-shot setup, its exclusively English content…

[140]

To what extent does the choice of dense versus sparse retrieval method affect the correlation between BEIR rob

28 May 2026. Score: 6.33/10. Verification: L2, Source-grounded claims.

Abstract: Multi-hop question answering is a knowledge-intensive complex problem. Large Language Models (LLMs) use their Chain of Thoughts (CoT) capability to reason complex problems step by step, and retrieval-augmentation can effectively alleviate factual errors caused by outdated and unknown knowledge in LLMs. Recent works…

[139]

How does the alignment between retriever robustness scores on BEIR and downstream LLM reasoning accuracy in mu

28 May 2026. Score: 4.00/10. Verification: L2, Source-grounded claims.

Abstract: Scaling laws provide important insights that can guide the design of large language models (LLMs). Existing work has primarily focused on studying scaling laws for pretraining (upstream) loss. However, in transfer learning settings, in which LLMs are pretrained on an unsupervised dataset and then finetuned on a…

[138]

How does the inference throughput (queries per second) of Vendi-RAG compare to standard dense retriever RAG ba

28 May 2026. Score: 2.17/10. Verification: L1, Literature synthesis.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) for domain-specific question-answering (QA) tasks by leveraging external knowledge sources. However, traditional RAG systems primarily focus on relevance-based retrieval and often struggle with redundancy, especially when reasoning requires…

[137]

What is the trade-off between retrieval diversity and answer accuracy when applying Vendi-RAG to a dense retri

28 May 2026. Score: 7.33/10. Verification: L2, Source-grounded claims.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) for domain-specific question-answering (QA) tasks by leveraging external knowledge sources. However, traditional RAG systems primarily focus on relevance-based retrieval and often struggle with redundancy, especially when reasoning requires…

[136]

Can Promptriever's prompting capability be extended to improve robustness against adversarial query perturbati

28 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims.

Abstract: This article surveys and organizes research works in a new paradigm in natural language processing, which we dub “prompt-based learning.” Unlike traditional supervised learning, which trains a model to take in an input x and predict an output y as P ( y|x ), prompt-based learning is based on language models that…

[135]

What is the impact of instance-level instruction diversity (from MS MARCO) on Promptriever's zero-shot general

28 May 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20428080

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[134]

How do different retrieval evaluation strategies (e.g., recall-based vs relevance-based) affect the downstream

28 May 2026. Score: 7.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20427995

Abstract: Large Language Models (LLMs) showcase impressive capabilities but encounter challenges like hallucination, outdated knowledge, and non-transparent, untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases. This…

[133]

How does instruction-tuned retrieval performance on multi-hop queries from MuSiQue compare to single-context a

28 May 2026. Score: 6.23/10. Verification: L2, Source-grounded claims.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge to answer questions more accurately. However, research on evaluating RAG systems-particularly the retriever component-remains limited, as most existing work focuses on single-context retrieval rather than multi-hop…

[132]

What is the impact of varying the number of retrieved passages per hop on the accuracy of multi-hop reasoning

28 May 2026. Score: 6.17/10. Verification: L2, Source-grounded claims.

Abstract: Multi-hop question answering is a knowledge-intensive complex problem. Large Language Models (LLMs) use their Chain of Thoughts (CoT) capability to reason complex problems step by step, and retrieval-augmentation can effectively alleviate factual errors caused by outdated and unknown knowledge in LLMs. Recent works…

[131]

What is the impact of synthetic data generation scaling (e.g., number of training examples) on retriever MRR@1

28 May 2026. Score: 6.73/10. Verification: L2, Source-grounded claims.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge to answer questions more accurately. However, research on evaluating RAG systems-particularly the retriever component-remains limited, as most existing work focuses on single-context retrieval rather than multi-hop…

[130]

How does the domain shift between synthetic training data and target benchmarks (e.g., HotPotQA vs MuSiQue) af

28 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims.

Abstract: Large Language Models (LLMs) showcase impressive capabilities but encounter challenges like hallucination, outdated knowledge, and non-transparent, untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases. This…

[129]

What is the robustness degradation of 7B and 70B LLMs on HotPotQA when context window is extended from 32K to

28 May 2026. Score: 7.33/10. Verification: L2, Source-grounded claims.

Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family…

[128]

How does the computational throughput (tokens per second) and inference latency of 7B vs 70B LLMs scale when p

28 May 2026. Score: 7.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20427802

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[127]

How does the number of reasoning hops in multi-hop QA benchmarks (2-hop vs 3-hop in HotPotQA) affect the relat

28 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20427753

Abstract: Large Language Models (LLMs) showcase impressive capabilities but encounter challenges like hallucination, outdated knowledge, and non-transparent, untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases. This…

[126]

Cross-benchmark generalization of PRISM framework's robustness to irrelevant context: how do Llama-3, Mistral,

28 May 2026. Score: 6.67/10. Verification: L2, Source-grounded claims.

Abstract: Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive…

[125]

To what extent does the marginal accuracy improvement of extending context windows from 32K to 128K tokens in

28 May 2026. Score: 8.33/10. Verification: L2, Source-grounded claims.

Abstract: Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal paradigm in the development of Large Language Models (LLMs). Although existing research mainly emphasizes accuracy and efficiency, the trustworthiness of RAG systems remains insufficiently explored. RAG can improve LLM reliability by grounding…

[124]

What is the impact of iterative retrieval agent depth (number of retrieval rounds) on final answer F1 score an

28 May 2026. Score: 4.67/10. Verification: L1, Literature synthesis.

Abstract: In this paper, we present Hierarchical Graph Network (HGN) for multi-hop question answering. To aggregate clues from scattered texts across multiple paragraphs, a hierarchical graph is created by constructing nodes on different levels of granularity (questions, paragraphs, sentences, entities), the representations of…

[123]

How does PRISM framework's retrieval efficiency (latency and throughput) compare to end-to-end multi-hop QA pi

28 May 2026. Score: 6.17/10. Verification: L2, Source-grounded claims.

Abstract: Large Language Models (LLMs) have demonstrated significant performance improvements across various cognitive tasks. An emerging application is using LLMs to enhance retrieval-augmented generation (RAG) capabilities. These systems require LLMs to understand user queries, retrieve relevant information, and synthesize…

[122]

Does iterative retrieval with visual frame reranking mitigate video content drift degradation more effectively

28 May 2026. Score: 3.83/10. Verification: L1, Literature synthesis.

Abstract: Recent developments in modeling language and vision have been successfully applied to image question answering. It is both crucial and natural to extend this research direction to the video domain for video question answering (VideoQA). Compared to the image domain where large scale and fully annotated benchmark…

[121]

How does the performance of VideoRAG compare to temporal video question answering models on long-form video un

28 May 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20427115

Abstract: We present HERO, a novel framework for large-scale video+language omnirepresentation learning. HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer via multimodal fusion, and global video context is captured by a Temporal…

[120]

What is the inference latency overhead of VideoRAG's retrieval-augmented approach compared to vanilla video un

28 May 2026. Score: 6.00/10. Verification: L1, Literature synthesis.

Abstract: Large Language Models (LLMs) suffer from hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information for improved factual grounding. With advances in multimodal learning, Multimodal RAG…

[119]

What is the computational cost (in FLOPs or latency) versus F1 score trade-off when scaling context windows fr

28 May 2026. Score: 7.33/10. Verification: L1, Literature synthesis.

Abstract: Large Language Models (LLMs) showcase impressive capabilities but encounter challenges like hallucination, outdated knowledge, and non-transparent, untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases. This…

[118]

To what extent does the choice of LLM-as-a-judge (e.g., GPT-4 vs. Llama-3-70B) affect the relative ranking of

28 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20426978

Abstract: Most Reading Comprehension methods limit themselves to queries which can be answered using a single sentence, paragraph, or document. Enabling models to combine disjoint pieces of textual evidence would extend the scope of machine comprehension methods, but currently no resources exist to train and test this…

[117]

How does the F1 score of LLM-as-a-judge evaluation compare to exact match for multi-hop HotPotQA when using it

28 May 2026. Score: 3.83/10. Verification: L2, Source-grounded claims.

Abstract: Extractive reading comprehension question answering (QA) datasets are typically evaluated using Exact Match (EM) and F1-score, but these metrics often fail to fully capture model performance. With the success of large language models (LLMs), they have been employed in various tasks, including serving as judges…