Assignee Research: Index of Papers

[177]

How does the effectiveness of negative sampling for unanswerable questions in the MRQA dataset compare to SQuA

28 May 2026. Score: 7.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20432015

Abstract: In the last few years, the deep learning (DL) computing paradigm has been deemed the Gold Standard in the machine learning (ML) community. Moreover, it has gradually become the most widely used computational approach in the field of ML, thus achieving outstanding results on several complex cognitive tasks, matching…

[176]

How does back-translation paraphrasing affect the robustness of LLM question answering performance across diff

28 May 2026. Score: 7.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20431979

Abstract: NLP practitioners often want to take existing trained models and apply them to data from new domains. While fine-tuning or few-shot learning can be used to adapt a base model, there is no single recipe for making these techniques work; moreover, one may not have access to the original model weights if it is deployed…

[175]

How do domain-agnostic question answering models trained on mixed-domain datasets (SQuAD 2.0, NewsQA, and Triv

28 May 2026. Score: 8.00/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20431969

Abstract: Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on…

[174]

What is the quantifiable difference in inference latency and accuracy degradation when applying suboptimal dat

28 May 2026. Score: 3.17/10. Verification: L1, Literature synthesis.

Abstract: Introduction * Information and Likelihood Theory: A Basis for Model Selection and Inference * Basic Use of the Information-Theoretic Approach * Formal Inference From More Than One Model: Multi-Model Inference (MMI) * Monte Carlo Insights and Extended Examples * Statistical Theory and Numerical Results * Summary

[173]

How does negative sampling affect inference efficiency and accuracy tradeoffs across different model scales in

28 May 2026. Score: 7.17/10. Verification: L2, Source-grounded claims.

Abstract: This paper presents a focused investigation into real-time segmentation in unstructured environments, a crucial aspect for enabling autonomous navigation in off-road robots. To address this challenge, an improved variant of the DDRNet23-slim model is proposed, which includes a lightweight network architecture and…

[172]

What is the comparative evaluation of negative sampling versus domain-specific fine-tuning on MRQA 2019 benchm

28 May 2026. Score: 3.33/10. Verification: L2, Source-grounded claims.

Abstract: To produce a domain-agnostic question answering model for the Machine Reading Question Answering (MRQA) 2019 Shared Task, we investigate the relative benefits of large pre-trained language models, various data sampling strategies, as well as query and context paraphrases generated by back-translation. We find a…

[171]

How does negative sampling performance scale across different LLM architectures (7B vs 70B) when evaluated on

28 May 2026. Score: 1.67/10. Verification: L2, Source-grounded claims.

Abstract: To produce a domain-agnostic question answering model for the Machine Reading Question Answering (MRQA) 2019 Shared Task, we investigate the relative benefits of large pre-trained language models, various data sampling strategies, as well as query and context paraphrases generated by back-translation. We find a…

[170]

How does the inference throughput-accuracy trade-off differ between o1-preview and DeepSeek-R1 under constrain

28 May 2026. Score: 2.17/10. Verification: L1, Literature synthesis.

Abstract: Recent advances in test-time scaling of large language models (LLMs), exemplified by DeepSeek-R1 and OpenAI's o1, show that extending the chain of thought during inference can significantly improve general reasoning performance. However, the impact of this paradigm on legal reasoning remains insufficiently explored.…

[169]

How does the adversarial robustness of o1-preview and DeepSeek-R1 to synonym substitution perturbations scale

28 May 2026. Score: 3.67/10. Verification: L2, Source-grounded claims.

Abstract: Large Language Models (LLMs) exhibit impressive capabilities, but remain susceptible to a growing spectrum of safety risks, including jailbreaks, toxic content, hallucinations, and bias. Existing defenses often address only a single threat type or resort to rigid outright rejection, sacrificing user experience and…

[168]

What is the relationship between model size (e.g., 7B vs 70B parameters) and the transferability of token-leve

28 May 2026. Score: 2.33/10. Verification: L2, Source-grounded claims.

Abstract: Graph Neural Networks (GNNs), specifically designed to process the graph data, have achieved remarkable success in various applications. Link stealing attacks on graph data pose a significant privacy threat, as attackers aim to extract sensitive relationships between nodes (entities), potentially leading to academic…

[167]

How does the accuracy of DeepSeek-R1 and o1-preview scale with chain-of-thought length (number of reasoning to

28 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims.

Abstract: Despite increasing discussions on open-source Artificial Intelligence (AI), existing research lacks a discussion on the transparency and accessibility of state-of-the-art (SoTA) Large Language Models (LLMs). The Open Source Initiative (OSI) has recently released its first formal definition of open-source software.…

[166]

What is the robustness of test-time scaling gains for o1-preview and DeepSeek-R1 under adversarial legal input

28 May 2026. Score: 4.33/10. Verification: L2, Source-grounded claims.

Abstract: In an era dominated by Large Language Models (LLMs), understanding their capabilities and limitations, especially in high-stakes fields like law, is crucial. While LLMs such as Meta's LLaMA, OpenAI's ChatGPT, Google's Gemini, DeepSeek, and other emerging models are increasingly integrated into legal workflows, their…

[165]

How do the test-time compute scaling curves (accuracy vs. inference FLOPs) for DeepSeek-R1 and o1-preview diff

28 May 2026. Score: 7.57/10. Verification: L2, Source-grounded claims.

Abstract: Recent advances in test-time scaling of large language models (LLMs), exemplified by DeepSeek-R1 and OpenAI's o1, show that extending the chain of thought during inference can significantly improve general reasoning performance. However, the impact of this paradigm on legal reasoning remains insufficiently explored.…

[164]

Does the coordinated pass@k policy optimization proposed in Cast a Wider Net improve diversity of generated co

28 May 2026. Score: 7.00/10. Verification: L2, Source-grounded claims.

Abstract: Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional…

[163]

To what extent does the Cast a Wider Net approach reduce redundant sampling overhead (measured by inference co

28 May 2026. Score: 7.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20431010

Abstract: Abstract Self-determination theory (SDT) maintains that an understanding of human motivation requires a consideration of innate psychological needs for competence, autonomy, and relatedness. We discuss the SDT concept of needs as it relates to previous need theories, emphasizing that needs specify the necessary…

[162]

What is the impact of S* hybrid test-time scaling versus chain-of-thought parallel scaling on the robustness o

28 May 2026. Score: 8.17/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20430985

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[161]

How does the S* hybrid test-time scaling framework affect the inference efficiency (measured in average latenc

28 May 2026. Score: 6.33/10. Verification: L2, Source-grounded claims.

Abstract: We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of…

[160]

How does the adaptive distinguishing selection mechanism in Cast a Wider Net affect pass@k scores and coverage

28 May 2026. Score: 7.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20430952

Abstract: In the last few years, the deep learning (DL) computing paradigm has been deemed the Gold Standard in the machine learning (ML) community. Moreover, it has gradually become the most widely used computational approach in the field of ML, thus achieving outstanding results on several complex cognitive tasks, matching…

[159]

Does the adversarial robustness gap between DeepSeek-R1 and o1-preview on legal reasoning tasks generalize to

28 May 2026. Score: 1.67/10. Verification: L1, Literature synthesis.

Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and…

[158]

How does the S* hybrid test-time scaling framework compare to standard parallel scaling approaches in terms of

28 May 2026. Score: 7.33/10. Verification: L2, Source-grounded claims.

Abstract: We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the base problem and then utilize its solution to address the more…

[157]

How does the performance of DeepSeek-R1 compare to o1-preview on the APPS benchmark when evaluated under negat

28 May 2026. Score: 5.00/10. Verification: L2, Source-grounded claims.

Abstract: Recently, there is a high demand for deploying DeepSeek-R1 and V3 locally, possibly because the official service often suffers from being busy and some organizations have data privacy concerns. While single-machine deployment offers infrastructure simplicity, the models' 671B FP8 parameter configuration exceeds the…

[156]

To what extent does the S* selection mechanism improve the accuracy and throughput of code generation on Codef

28 May 2026. Score: 6.33/10. Verification: L2, Source-grounded claims.

Abstract: We report the observation of gravitational waves from two binary black hole coalescences during the fourth observing run of the LIGO–Virgo–KAGRA detector network, GW241011 and GW241110. The sources of these two signals are characterized by rapid and precisely measured primary spins, non-negligible spin–orbit…

[155]

To what extent does token pruning in SPLADE models degrade retrieval accuracy vs. improve latency on multi-hop

28 May 2026. Score: 6.17/10. Verification: L2, Source-grounded claims.

Abstract: Latency and efficiency issues are often overlooked when evaluating IR models based on Pretrained Language Models (PLMs) in reason of multiple hardware and software testing scenarios. Nevertheless, efficiency is an important part of such systems and should not be overlooked. In this paper, we focus on improving the…

[154]

What is the trade-off between inference efficiency and robustness to adversarial query perturbations for spars

28 May 2026. Score: 7.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20430118

Abstract: Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. It aims to provide effective, reproducible, and easy-to-use first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with…

[153]

How does the inference throughput (queries per second) of SPLADE-v3 compare to ColBERT-v2 under controlled spa

28 May 2026. Score: 8.00/10. Verification: L2, Source-grounded claims.

Abstract: Late interaction neural IR models like ColBERT offer a competitive effectiveness-efficiency trade-off across many benchmarks. However, they require a huge memory space to store the contextual representation for all the document tokens. Some works have proposed using either heuristics or statistical-based techniques…