Assignee Research: Index of Papers

[118]

To what extent does the choice of LLM-as-a-judge (e.g., GPT-4 vs. Llama-3-70B) affect the relative ranking of

28 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20426978

Abstract: Most Reading Comprehension methods limit themselves to queries which can be answered using a single sentence, paragraph, or document. Enabling models to combine disjoint pieces of textual evidence would extend the scope of machine comprehension methods, but currently no resources exist to train and test this…

[117]

How does the F1 score of LLM-as-a-judge evaluation compare to exact match for multi-hop HotPotQA when using it

28 May 2026. Score: 3.83/10. Verification: L2, Source-grounded claims.

Abstract: Extractive reading comprehension question answering (QA) datasets are typically evaluated using Exact Match (EM) and F1-score, but these metrics often fail to fully capture model performance. With the success of large language models (LLMs), they have been employed in various tasks, including serving as judges…

[116]

What is the generalization capability of SMoES on out-of-domain VQA benchmarks like VQA-CP v2 and A-OKVQA, and

28 May 2026. Score: 0.50/10. Verification: L2, Source-grounded claims.

Abstract: Mixture-of-Experts architectures have become the standard for scaling large language models due to their superior parameter efficiency. To accommodate the growing number of experts in practice, modern inference systems commonly adopt expert parallelism to distribute experts across devices. However, the absence of…

[115]

Can expert specialization patterns in SMoES be transferred across different vision-language tasks (e.g., capti

28 May 2026. Score: 3.00/10. Verification: L2, Source-grounded claims.

Abstract: A pivotal advancement in the progress of large language models (LLMs) is the emergence of the Mixture-of-Experts (MoE) LLMs. Compared to traditional LLMs, MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes. Different from previous…

[114]

How does SMoES dynamic routing compare to fixed routing baselines in terms of inference efficiency (latency an

28 May 2026. Score: 1.50/10. Verification: L2, Source-grounded claims.

Abstract: Sparse Mixture-of-Experts (MoE) models can outperform dense large language models at similar computation by activating only a small set of experts per token. However, stacking many expert modules introduces substantial parameter memory, which makes MoE models difficult to deploy in memory-constrained environments…

[113]

What is the trade-off between expert count (k) and throughput (tokens/sec) on edge CPU devices for sparse MoE

28 May 2026. Score: 5.17/10. Verification: L2, Source-grounded claims.

Abstract: Rapid advancements in large language models (LLMs) have increased interest in deploying them on mobile devices for on-device AI applications. Mobile users interact differently with LLMs compared to desktop users, creating unique expectations and data biases. Current benchmark datasets primarily target at server and…

[112]

For sparse MoE vision-language models, how does the optimal number of active experts (k) change when evaluated

28 May 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: Sparse Mixture-of-Experts (MoE) architectures enable efficient scaling of large language models through conditional computation, yet the routing mechanisms responsible for expert selection remain poorly understood. In this work, we introduce routing signatures, a vector representation summarizing expert activation…

[111]

How does varying the number of active experts (k) in sparse MoE vision-language models affect VQA accuracy and

28 May 2026. Score: 8.33/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20426299

Abstract: The rising popularity of explainable artificial intelligence (XAI) to understand high-performing black boxes raised the question of how to evaluate explanations of machine learning (ML) models. While interpretability and explainability are often presented as a subjectively validated binary property, we consider it a…

[110]

How does the MambaFormer hybrid MoE architecture's efficiency (FLOPs per token and throughput) scale with mode

28 May 2026. Score: 8.67/10. Verification: L2, Source-grounded claims.

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[109]

To what extent does the accuracy of multi-step retrieval pipelines for multi-hop QA degrade under noisy or adv

28 May 2026. Score: 9.17/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20426236

Abstract: This paper proposes a framework for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available data sets. We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal…

[108]

How does the MambaFormer hybrid MoE architecture's efficiency (FLOPs per token and throughput) scale with mode

28 May 2026. Score: 7.17/10. Verification: L2, Source-grounded claims.

Abstract: Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional…

[107]

Does GPT-4's multi-hop reasoning accuracy on HotpotQA degrade monotonically with increasing retrieval steps (2

28 May 2026. Score: 8.17/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20424631

Abstract: Few-shot prompting is a surprisingly powerful way to use Large Language Models (LLMs) to solve various tasks. However, this approach struggles as the task complexity increases or when the individual reasoning steps of the task themselves are hard to learn, especially when embedded in more complex tasks. To address…

[106]

What is the impact of token-level guided routing on inference latency and cross-modal reasoning accuracy in Mo

28 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims.

Abstract: Abstract In the past years, multimodal large language models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering and visual understanding and reasoning. However, the extensive model size and high training and inference costs have hindered the widespread application of MLLMs in…

[105]

What is the accuracy drop on the HotpotQA multi-hop dataset when using a 128K-context Llama-3 model without re

28 May 2026. Score: 7.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20424168

Abstract: Prompting-based large language models (LLMs) are surprisingly powerful at generating natural language reasoning steps or Chains-of-Thoughts (CoT) for multi-step question answering (QA). They struggle, however, when the necessary knowledge is either unavailable to the LLM or not up-to-date within its parameters. While…

[104]

How does the Tree of Reviews framework compare to standard chain-based retrieval on the MuSiQue multi-hop QA b

28 May 2026. Score: 6.67/10. Verification: L2, Source-grounded claims.

Abstract: Multi-hop question answering is a knowledge-intensive complex problem. Large Language Models (LLMs) use their Chain of Thoughts (CoT) capability to reason complex problems step by step, and retrieval-augmentation can effectively alleviate factual errors caused by outdated and unknown knowledge in LLMs. Recent works…

[103]

Does the Tree of Reviews iterative retrieval method improve robustness to irrelevant context in multi-hop QA c

28 May 2026. Score: 8.17/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20424153

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) for domain-specific question-answering (QA) tasks by leveraging external knowledge sources. However, traditional RAG systems primarily focus on relevance-based retrieval and often struggle with redundancy, especially when reasoning requires…

[102]

What is the impact of fine-tuning on negative interaction trajectories versus positive-only trajectories for L

28 May 2026. Score: 8.17/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20423642

Abstract: Large language models (LLMs) have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents. However, it remains challenging for these language agents to quickly and efficiently learn from trial-and-error as traditional reinforcement learning methods require…

[101]

How does the inference efficiency (tokens/sec and memory usage) of a 70B-parameter LLM agent compare when usin

28 May 2026. Score: 1.17/10. Verification: L2, Source-grounded claims.

Abstract: Retrieval-Augmented Generation (RAG) methods enhance LLM performance by efficiently filtering relevant context for LLMs, reducing hallucinations and inference cost. However, most existing RAG methods focus on single-step retrieval, which is often insufficient for answering complex questions that require multi-step…

[100]

Can AnyExperts' dynamic expert allocation maintain consistent accuracy improvements over dense baselines when

28 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20423344

Abstract: Despite their remarkable achievement, gigantic transformers encounter significant drawbacks, including exorbitant computational and memory footprints during training, as well as severe collapse evidenced by a high degree of parameter redundancy. Sparsely-activated Mixture-of-Experts (SMoEs) have shown promise to…

[99]

Can AnyExperts' dynamic expert allocation maintain consistent accuracy improvements over dense baselines when

28 May 2026. Score: 3.00/10. Verification: L2, Source-grounded claims.

Abstract: Sparse Mixture-of-Experts (MoE) architectures enable efficient scaling of large language models through conditional computation, yet the routing mechanisms responsible for expert selection remain poorly understood. In this work, we introduce routing signatures, a vector representation summarizing expert activation…

[98]

What is the impact of expert capacity imbalance on AnyExperts' performance degradation when evaluated on domai

28 May 2026. Score: 8.17/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20421444

Abstract: Vision-language foundation models achieve promising performance in natural image classification, yet their direct application to medical imaging is limited by severe domain shifts, resolution mismatches, and the multi-label nature of clinical diagnosis. Training dedicated medical foundation models from scratch,…

[97]

How does AnyExperts' on-demand routing strategy compare to fixed routing baselines in terms of inference laten

28 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20421249

Abstract: Sparse Mixture-of-Experts (MoE) architectures enable efficient scaling of large language models through conditional computation, yet the routing mechanisms responsible for expert selection remain poorly understood. In this work, we introduce routing signatures, a vector representation summarizing expert activation…

[96]

To what extent does NOVA's anomaly localization accuracy degrade when tested on out-of-distribution brain MRI

28 May 2026. Score: 8.00/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20421241

Abstract: In many real-world applications, deployed models encounter inputs that differ from the data seen during training. Out-of-distribution detection identifies whether an input stems from an unseen distribution, while open-world recognition flags such inputs to ensure the system remains robust as ever-emerging, previously…

[95]

What is the computational overhead of implementing expert bridging versus full fine-tuning in terms of inferen

28 May 2026. Score: 7.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20420842

Abstract: Fine-tuning Large Language Models (LLMs) is a common practice to adapt pre-trained models for specific applications. While methods like LoRA have effectively addressed GPU memory constraints during fine-tuning, their performance often falls short, especially in multi-task scenarios. In contrast, Mixture-of-Expert…

[94]

How does the performance of NOVA's open-world recognition capability compare to existing OOD detection methods

28 May 2026. Score: 7.83/10. Verification: L2, Source-grounded claims.

Abstract: Mahalanobis distance (MD) is a simple and popular post-processing method for detecting out-of-distribution (OOD) inputs in neural networks. We analyze its failure modes for near-OOD detection and propose a simple fix called relative Mahalanobis distance (RMD) which improves performance and is more robust to…