Assignee Research: Index of Papers

[152]

What is the relative accuracy drop of decomposed vs. non-decomposed multi-hop RAG systems under adversarial qu

28 May 2026. Score: 6.33/10. Verification: L2, Source-grounded claims.

Abstract: Large Language Models (LLMs) showcase impressive capabilities but encounter challenges like hallucination, outdated knowledge, and non-transparent, untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases. This…

[151]

How does the accuracy of LLM-based multi-hop RAG systems degrade under adversarial query perturbations (e.g.,

28 May 2026. Score: 5.33/10. Verification: L2, Source-grounded claims.

Abstract: Multi-hop question answering is a knowledge-intensive complex problem. Large Language Models (LLMs) use their Chain of Thoughts (CoT) capability to reason complex problems step by step, and retrieval-augmentation can effectively alleviate factual errors caused by outdated and unknown knowledge in LLMs. Recent works…

[150]

Does scaling the LLM size (e.g., 7B vs. 70B parameters) mitigate the accuracy loss from adversarial perturbati

28 May 2026. Score: 6.83/10. Verification: L2, Source-grounded claims.

Abstract: This comprehensive review delves into the pivotal role of prompt engineering in unleashing the capabilities of Large Language Models (LLMs). The development of Artificial Intelligence (AI), from its inception in the 1950s to the emergence of advanced neural networks and deep learning architectures, has made a…

[149]

How does the cross-domain robustness of LLM-based retriever evaluation strategies (e.g., using an LLM as a jud

28 May 2026. Score: 7.00/10. Verification: L2, Source-grounded claims.

Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5\% and 17.0\%, respectively, which is considerably better than the previous…

[148]

To what extent does increasing the number of hops (2-hop vs 3-hop) in multi-hop queries on MuSiQue increase re

28 May 2026. Score: 6.33/10. Verification: L2, Source-grounded claims.

Abstract: Unlike previous studies on the Metaverse based on Second Life, the current Metaverse is based on the social value of Generation Z that online and offline selves are not different. With the technological development of deep learning-based high-precision recognition models and natural generation models, Metaverse is…

[147]

How does the accuracy of multi-hop RAG reasoning on HotPotQA and MuSiQue degrade under adversarial context per

28 May 2026. Score: 8.17/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20428926

Abstract: Large Language Models (LLMs) showcase impressive capabilities but encounter challenges like hallucination, outdated knowledge, and non-transparent, untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases. This…

[146]

Can a lightweight, severity-aware adversarial detection filter (e.g., based on embedding cosine distance) impr

28 May 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20428876

Abstract: State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal…

[145]

What is the trade-off between inference throughput and multi-hop reasoning accuracy in LLM-based RAG systems w

28 May 2026. Score: 8.00/10. Verification: L2, Source-grounded claims.

Abstract: Retrieval-augmented generation has raise extensive attention as it is promising to address the limitations of large language models including outdated knowledge and hallucinations. However, retrievers struggle to capture relevance, especially for queries with complex information needs. Recent work has proposed to…

[144]

To what extent do adversarial perturbations of varying severity reduce answer accuracy and F1 score in multi-h

28 May 2026. Score: 6.50/10. Verification: L2, Source-grounded claims.

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[143]

To what extent does the choice of retriever (BM25 vs. dense passage retriever vs. LLM-based re-ranker) impact

28 May 2026. Score: 8.17/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20428796

Abstract: Large Language Models (LLMs) showcase impressive capabilities but encounter challenges like hallucination, outdated knowledge, and non-transparent, untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases. This…

[142]

How does adversarial passage perturbation severity (measured via semantic similarity degradation) affect the t

28 May 2026. Score: 6.23/10. Verification: L2, Source-grounded claims.

Abstract: While Retrieval-Augmented Generation (RAG) mitigates hallucination and knowledge staleness in Large Language Models (LLMs), existing frameworks often falter on complex, multi-hop queries that require synthesizing information from disparate sources. Current advanced RAG methods, employing iterative or adaptive…

[141]

What is the impact of retriever robustness (measured via BEIR) on LLM reasoning accuracy in multi-hop QA tasks

28 May 2026. Score: 6.23/10. Verification: L2, Source-grounded claims.

Abstract: Zero-shot evaluation of information retrieval (IR) models is often performed using BEIR; a large and heterogeneous benchmark composed of multiple datasets, covering different retrieval tasks across various domains. Although BEIR has become a standard benchmark for the zero-shot setup, its exclusively English content…

[140]

To what extent does the choice of dense versus sparse retrieval method affect the correlation between BEIR rob

28 May 2026. Score: 6.33/10. Verification: L2, Source-grounded claims.

Abstract: Multi-hop question answering is a knowledge-intensive complex problem. Large Language Models (LLMs) use their Chain of Thoughts (CoT) capability to reason complex problems step by step, and retrieval-augmentation can effectively alleviate factual errors caused by outdated and unknown knowledge in LLMs. Recent works…

[139]

How does the alignment between retriever robustness scores on BEIR and downstream LLM reasoning accuracy in mu

28 May 2026. Score: 4.00/10. Verification: L2, Source-grounded claims.

Abstract: Scaling laws provide important insights that can guide the design of large language models (LLMs). Existing work has primarily focused on studying scaling laws for pretraining (upstream) loss. However, in transfer learning settings, in which LLMs are pretrained on an unsupervised dataset and then finetuned on a…

[138]

How does the inference throughput (queries per second) of Vendi-RAG compare to standard dense retriever RAG ba

28 May 2026. Score: 2.17/10. Verification: L1, Literature synthesis.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) for domain-specific question-answering (QA) tasks by leveraging external knowledge sources. However, traditional RAG systems primarily focus on relevance-based retrieval and often struggle with redundancy, especially when reasoning requires…

[137]

What is the trade-off between retrieval diversity and answer accuracy when applying Vendi-RAG to a dense retri

28 May 2026. Score: 7.33/10. Verification: L2, Source-grounded claims.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) for domain-specific question-answering (QA) tasks by leveraging external knowledge sources. However, traditional RAG systems primarily focus on relevance-based retrieval and often struggle with redundancy, especially when reasoning requires…

[136]

Can Promptriever's prompting capability be extended to improve robustness against adversarial query perturbati

28 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims.

Abstract: This article surveys and organizes research works in a new paradigm in natural language processing, which we dub “prompt-based learning.” Unlike traditional supervised learning, which trains a model to take in an input x and predict an output y as P ( y|x ), prompt-based learning is based on language models that…

[135]

What is the impact of instance-level instruction diversity (from MS MARCO) on Promptriever's zero-shot general

28 May 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20428080

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[134]

How do different retrieval evaluation strategies (e.g., recall-based vs relevance-based) affect the downstream

28 May 2026. Score: 7.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20427995

Abstract: Large Language Models (LLMs) showcase impressive capabilities but encounter challenges like hallucination, outdated knowledge, and non-transparent, untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases. This…

[133]

How does instruction-tuned retrieval performance on multi-hop queries from MuSiQue compare to single-context a

28 May 2026. Score: 6.23/10. Verification: L2, Source-grounded claims.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge to answer questions more accurately. However, research on evaluating RAG systems-particularly the retriever component-remains limited, as most existing work focuses on single-context retrieval rather than multi-hop…

[132]

What is the impact of varying the number of retrieved passages per hop on the accuracy of multi-hop reasoning

28 May 2026. Score: 6.17/10. Verification: L2, Source-grounded claims.

Abstract: Multi-hop question answering is a knowledge-intensive complex problem. Large Language Models (LLMs) use their Chain of Thoughts (CoT) capability to reason complex problems step by step, and retrieval-augmentation can effectively alleviate factual errors caused by outdated and unknown knowledge in LLMs. Recent works…

[131]

What is the impact of synthetic data generation scaling (e.g., number of training examples) on retriever MRR@1

28 May 2026. Score: 6.73/10. Verification: L2, Source-grounded claims.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge to answer questions more accurately. However, research on evaluating RAG systems-particularly the retriever component-remains limited, as most existing work focuses on single-context retrieval rather than multi-hop…

[130]

How does the domain shift between synthetic training data and target benchmarks (e.g., HotPotQA vs MuSiQue) af

28 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims.

Abstract: Large Language Models (LLMs) showcase impressive capabilities but encounter challenges like hallucination, outdated knowledge, and non-transparent, untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases. This…

[129]

What is the robustness degradation of 7B and 70B LLMs on HotPotQA when context window is extended from 32K to

28 May 2026. Score: 7.33/10. Verification: L2, Source-grounded claims.

Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family…

[128]

How does the computational throughput (tokens per second) and inference latency of 7B vs 70B LLMs scale when p

28 May 2026. Score: 7.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20427802

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…