Assignee Research: Index of Papers

[388]

How do different routing strategies (top-k vs. noisy top-k) in SparseMoE vision-language models influence expe

29 May 2026. Score: 5.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Vision-Language Model (VLM) have gained widespread adoption in Open-Vocabulary (OV) object detection and segmentation tasks. Despite they have shown promise on OV-related tasks, their effectiveness in conventional vision tasks has thus far been unevaluated. In this work, we present the systematic review of VLM-based…

[387]

Does the self-invoking code generation task in HumanEval Pro reveal systematic failure modes in MambaFormer's

29 May 2026. Score: 7.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: As Large Language Models (LLMs) become increasingly integrated into secure software development workflows, a critical question remains unanswered: can these models not only detect insecure code but also reliably classify vulnerabilities according to standardized taxonomies? In this work, we conduct a systematic…

[386]

How does varying the number of active experts in a Mixture-of-Experts Transformer affect pass@k accuracy on Hu

29 May 2026. Score: 6.30/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Mixture-of-experts (MoE) models that employ sparse activation have demonstrated effectiveness in significantly increasing the number of parameters while maintaining low computational requirements per token. However, recent studies have established that MoE models are inherently parameter-inefficient as the…

[385]

What is the relationship between the number of active experts per token in SoftMoE and spatial reasoning perfo

29 May 2026. Score: 8.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Abstract Physics-Informed Neural Networks (PINN) are neural networks (NNs) that encode model equations, like Partial Differential Equations (PDE), as a component of the neural network itself. PINNs are nowadays used to solve PDEs, fractional equations, integral-differential equations, and stochastic PDEs. This novel…

[384]

Can Vendi-RAG maintain answer accuracy on the HotpotQA benchmark when the diversity weight hyperparameter is s

29 May 2026. Score: 7.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive…

[383]

How does Vendi-RAG's iterative retrieval process compare to fixed-top-k retrieval on the 2WikiMultihop benchma

29 May 2026. Score: 5.33/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: In the last few years, the deep learning (DL) computing paradigm has been deemed the Gold Standard in the machine learning (ML) community. Moreover, it has gradually become the most widely used computational approach in the field of ML, thus achieving outstanding results on several complex cognitive tasks, matching…

[382]

How does the accuracy of Tree of Reviews on MuSiQue at 128K context degrade when the number of distractor pass

29 May 2026. Score: 8.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20437593

Abstract: Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM shall enable its users to effortlessly process many originally exhausting tasks -e.g., digesting a long-form document to find answers v.s., directly asking an LLM about it.However, existing…

[381]

How does the Tree of Reviews framework's F1 score on the MuSiQue benchmark vary when evaluated with Llama-3-8B

29 May 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20437582

Abstract: Large Language Models (LLMs) showcase impressive capabilities but encounter challenges like hallucination, outdated knowledge, and non-transparent, untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases. This…

[380]

What is the token efficiency ratio (total input tokens processed per correct answer) of the Tree of Reviews me

29 May 2026. Score: 8.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20437574

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[379]

Does LongNav-R1's multi-turn RL approach improve generalization to unseen environments on the RxR-CE benchmark

29 May 2026. Score: 7.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Unlike previous studies on the Metaverse based on Second Life, the current Metaverse is based on the social value of Generation Z that online and offline selves are not different. With the technological development of deep learning-based high-precision recognition models and natural generation models, Metaverse is…

[378]

What is the impact of scaling the VLA model size (e.g., from 7B to 13B parameters) on the average reward and t

29 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20437422

Abstract: Recent advances in the areas of multimodal machine learning and artificial intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Embodied AI. Whereas many approaches and previous survey pursuits have characterised one or two of…

[377]

How does the multi-turn RL framework in LongNav-R1 compare to fixed-length memory baselines on the VLN-CE benc

29 May 2026. Score: 7.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent…

[376]

Does Vendi-RAG's diversity optimization maintain its latency and accuracy benefits when evaluated on the MMLU

29 May 2026. Score: 7.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Neural sequence-to-sequence models have provided a viable new approach for abstractive text summarization (meaning they are not restricted to simply selecting and rearranging passages from the original text). However, these models have two shortcomings: they are liable to reproduce factual details inaccurately, and…

[375]

What is the trade-off between answer accuracy (F1/EM) and inference time when applying Vendi-RAG's iterative d

29 May 2026. Score: 8.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20437293

Abstract: Abstract Large language model (LLM) systems, such as ChatGPT 1 or Gemini 2 , can show impressive reasoning and question-answering capabilities but often ‘hallucinate’ false outputs and unsubstantiated answers 3,4 . Answering unreliably or without the necessary information prevents adoption in diverse fields, with…

[374]

What is the impact of ReKV's streaming window size on VideoQA performance when evaluated on the VideoQA benchm

29 May 2026. Score: 3.33/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: Long streaming video QA remains challenging due to growing visual tokens and limited reasoning length of large language models (LLMs). KV-caching stores the Key-Value (KV) of the historical tokens via LLM prefill and enables more efficient streaming QA. However, existing methods cache every one or two frames, causing…

[373]

Does Reflexion's verbal reinforcement learning improve success rate on the ALFRED benchmark compared to behavi

29 May 2026. Score: 9.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20437230

Abstract: Recent advances in robot learning have shown promise in enabling robots to perform a variety of manipulation tasks and generalize to novel scenarios.One of the key contributing factors to this progress is the scale of robot data used to train the models.To obtain large-scale datasets, prior approaches have relied on…

[372]

How does the choice of hardware configuration (e.g., edge GPU vs. cloud TPU) impact the accuracy-throughput tr

29 May 2026. Score: 8.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20437206

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[371]

How does the ReKV method compare to baseline streaming methods in termseduce on end-to-end VideoQA benchmarks

29 May 2026. Score: 6.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Unlike previous studies on the Metaverse based on Second Life, the current Metaverse is based on the social value of Generation Z that online and offline selves are not different. With the technological development of deep learning-based high-precision recognition models and natural generation models, Metaverse is…

[370]

What is the impact of momentum contrastive learning on retrieval accuracy across different domain-shifted capt

29 May 2026. Score: 7.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: In the last few years, the deep learning (DL) computing paradigm has been deemed the Gold Standard in the machine learning (ML) community. Moreover, it has gradually become the most widely used computational approach in the field of ML, thus achieving outstanding results on several complex cognitive tasks, matching…

[369]

To what extent do different routing mechanisms in sparse MoE models influence inference latency and code gener

29 May 2026. Score: 7.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Most R novices will start with Appendix A [A sample session], page 80.This should give some familiarity with the style of R sessions and more importantly some instant feedback on what actually happens.Many users will come to R mainly for its graphical facilities.

[368]

How does MixLoRA-based MoE fine-tuning compare to full fine-tuning in terms of inference latency and memory us

29 May 2026. Score: 9.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20437147

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[367]

What is the impact of dynamic routing strategies on the energy efficiency of multimodal model inference on low

29 May 2026. Score: 2.67/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: In the last few years, the deep learning (DL) computing paradigm has been deemed the Gold Standard in the machine learning (ML) community. Moreover, it has gradually become the most widely used computational approach in the field of ML, thus achieving outstanding results on several complex cognitive tasks, matching…

[366]

How does the layer-wise score aggregation method generalize across different domains when evaluated on out-of-

29 May 2026. Score: 8.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, Steffen Eger. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.

[365]

What is the computational overhead of layer-wise score aggregation method compared to last-layer-only baseline

29 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across…

[364]

What is the trade-off between accuracy and tokens-per-second on the GSM8K benchmark for Qwen3 under dynamic ex

29 May 2026. Score: 0.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Mixture-of-Experts (MoE) has become a practical architecture for scaling LLM capacity while keeping per-token compute modest, but deploying MoE models on a single, memory-limited GPU remains difficult because expert weights dominate the HBM footprint. Existing expert offloading and prefetching systems reduce the…