Assignee Research: Index of Papers

[15]

An Exploration of Data Augmentation and Sampling Techniques for Domain-Agnostic

27 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: To produce a domain-agnostic question answering model for the Machine Reading Question Answering (MRQA) 2019 Shared Task, we investigate the relative benefits of large pre-trained language models, various data sampling strategies, as well as query and context paraphrases generated by back-translation. We find a…

[14]

Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, a

27 May 2026. Score: 7.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20409932

Abstract: Recent advances in test-time scaling of large language models (LLMs), exemplified by DeepSeek-R1 and OpenAI's o1, show that extending the chain of thought during inference can significantly improve general reasoning performance. However, the impact of this paradigm on legal reasoning remains insufficiently explored.…

[13]

S*: Test Time Scaling for Code Generation

27 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20409874

Abstract: Increasing test-time compute for LLMs shows promise across domains but remains underexplored in code generation, despite extensive study in math. In this paper, we propose S*, the first hybrid test-time scaling framework that substantially improves the coverage and selection accuracy of generated code. S* extends the…

[12]

Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriev

27 May 2026. Score: 7.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20409804

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge to answer questions more accurately. However, research on evaluating RAG systems-particularly the retriever component-remains limited, as most existing work focuses on single-context retrieval rather than multi-hop…

[11]

Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriev

27 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20409686

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge to answer questions more accurately. However, research on evaluating RAG systems-particularly the retriever component-remains limited, as most existing work focuses on single-context retrieval rather than multi-hop…

[10]

Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriev

27 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20409196

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge to answer questions more accurately. However, research on evaluating RAG systems-particularly the retriever component-remains limited, as most existing work focuses on single-context retrieval rather than multi-hop…

[9]

LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA

27 May 2026. Score: 7.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20408526

Abstract: Extractive reading comprehension question answering (QA) datasets are typically evaluated using Exact Match (EM) and F1-score, but these metrics often fail to fully capture model performance. With the success of large language models (LLMs), they have been employed in various tasks, including serving as judges…

[8]

Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriev

27 May 2026. Score: 7.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20408396

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge to answer questions more accurately. However, research on evaluating RAG systems-particularly the retriever component-remains limited, as most existing work focuses on single-context retrieval rather than multi-hop…

[7]

Learning Sparse Mixture of Experts for Visual Question Answering

27 May 2026. Score: 7.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: There has been a rapid progress in the task of Visual Question Answering with improved model architectures. Unfortunately, these models are usually computationally intensive due to their sheer size which poses a serious challenge for deployment. We aim to tackle this issue for the specific task of Visual Question…

[6]

Cofca: A Step-Wise Counterfactual Multi-hop QA benchmark

27 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20408050

Abstract: While Large Language Models (LLMs) excel in question-answering (QA) tasks, their real reasoning abilities on multiple evidence retrieval and integration on Multi-hop QA tasks remain less explored. Firstly, LLMs sometimes generate answers that rely on internal memory rather than retrieving evidence and reasoning in…

[5]

AnyExperts: On-Demand Expert Allocation for Multimodal Language Models with Mixt

27 May 2026. Score: 7.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20407901

Abstract: Multimodal Mixture-of-Experts (MoE) models offer a promising path toward scalable and efficient large vision-language systems. However, existing approaches rely on rigid routing strategies (typically activating a fixed number of experts per token) ignoring the inherent heterogeneity in semantic importance across…

[4]

Adapting Foundation Vision-Language Models to Medical Diagnosis via Query-Driven

27 May 2026. Score: 7.23/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Vision-language foundation models achieve promising performance in natural image classification, yet their direct application to medical imaging is limited by severe domain shifts, resolution mismatches, and the multi-label nature of clinical diagnosis. Training dedicated medical foundation models from scratch,…

[3]

SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs

27 May 2026. Score: 7.40/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Mixture-of-Experts (MoE) has become a prevalent backbone for large vision-language models (VLMs), yet how modality-specific signals should guide expert routing remains under-explored. Existing routing strategies are either hand-crafted or modality-agnostic, relying on idealized priors that ignore the layer-dependent…

[2]

SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs

27 May 2026. Score: 7.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20406928

Abstract: Mixture-of-Experts (MoE) has become a prevalent backbone for large vision-language models (VLMs), yet how modality-specific signals should guide expert routing remains under-explored. Existing routing strategies are either hand-crafted or modality-agnostic, relying on idealized priors that ignore the layer-dependent…

[1]

SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs

27 May 2026. Score: 7.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20406733

Abstract: Mixture-of-Experts (MoE) has become a prevalent backbone for large vision-language models (VLMs), yet how modality-specific signals should guide expert routing remains under-explored. Existing routing strategies are either hand-crafted or modality-agnostic, relying on idealized priors that ignore the layer-dependent…