Assignee Research: Index of Papers

[587]

How does the anomaly detection F1-score of Deepseek R1 compare to Mistral 7B on time-series datasets with distribution shifts

29 May 2026. Score: 2.33/10. Verification: L1, Literature synthesis.

Abstract: Anomaly detection presents a unique challenge in machine learning, due to the scarcity of labeled anomaly data. Recent work attempts to mitigate such problems by augmenting training of deep anomaly detection models with additional labeled anomaly samples. However, the labeled data often does not align with the target…

[586]

How does the inference latency of quantized LLaVA-1.5 models vary across different image resolutions in multimodal benchmarks

29 May 2026. Score: 2.67/10. Verification: L2, Source-grounded claims.

Abstract: Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take…

[585]

What is the correlation between Llama3's cross-domain anomaly detection accuracy and the percentage of

29 May 2026. Score: 5.00/10. Verification: L2, Source-grounded claims.

Abstract: Time series anomaly detection is important in modern large-scale systems and is applied in a variety of domains to analyze and monitor the operation of diverse systems. Unsupervised approaches have received widespread interest, as they do not require anomaly labels during training, thus avoiding potentially high…

[584]

What is the degradation in GQA benchmark scores for LLaVA-1.5 when applying activation-aware weight

29 May 2026. Score: 7.17/10. Verification: L2, Source-grounded claims.

Abstract: Large language models (LLMs) demand substantial computational and memory resources, creating deployment challenges. Quantization-aware training (QAT) addresses these challenges by reducing model precision while maintaining performance. However, the scaling behavior of QAT, especially at 4-bit precision (W4A4), is not…

[583]

How does Llama3's zero-shot performance in energy market anomaly detection compare to fine-tuned smaller

29 May 2026. Score: 6.67/10. Verification: L1, Literature synthesis.

Abstract: Large Language Models achieve remarkable performance but incur substantial computational costs unsuitable for resource-constrained deployments. This paper presents the first comprehensive task-specific efficiency analysis comparing 16 language models across five diverse NLP tasks. We introduce the…

[582]

How does varying the hot neuron activation threshold in PowerInfer affect Pass@1 scores on the HumanEval bench

29 May 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: Understanding and reasoning over diagrams is a fundamental aspect of human intelligence. While Large Multimodal Models (LMMs) have demonstrated impressive capabilities across various tasks, existing benchmarks lack comprehensive evaluation of their diagram interpretation and reasoning abilities, particularly in…

[581]

What is the average drop in forecasting accuracy for Llama3 compared to domain-specific models when evaluated

29 May 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: Time series forecasting remains a challenging task, particularly in the context of complex multiscale temporal patterns. This study presents LLM-Mixer, a framework that improves forecasting accuracy through the combination of multiscale time-series decomposition with pre-trained LLMs (Large Language Models).…

[580]

How does the inference efficiency (throughput, latency) of pruned BERT models compare to quantization-optimize

29 May 2026. Score: 4.17/10. Verification: L2, Source-grounded claims.

Abstract: In this paper, we present a comparative analysis of benign and malicious Android applications, based on static features. In particular, we focus our attention on the permissions requested by an application. We consider both binary classification of malware versus benign, as well as the multiclass problem, where we…

[579]

How does the accuracy and inference latency of lightweight BERT models compare to distilled versions of larger

29 May 2026. Score: 5.00/10. Verification: L1, Literature synthesis.

Abstract: The deployment of transformer-based models on resource-constrained edge devices represents a critical challenge in enabling real-time artificial intelligence applications. This comprehensive survey examines lightweight transformer architectures specifically designed for edge deployment, analyzing recent advances in…

[578]

What is the impact of cross-domain fine-tuning on the pass@1 accuracy of LLaMA-70B for Python function synthes

29 May 2026. Score: 4.17/10. Verification: L2, Source-grounded claims.

Abstract: Large Language Models (LLMs) have demonstrated significant capabilities in understanding and analyzing code for security vulnerabilities, such as Common Weakness Enumerations (CWEs). However, their reliance on cloud infrastructure and substantial computational requirements pose challenges for analyzing sensitive or…

[577]

How does dynamic hot neuron threshold adjustment in PowerInfer influence the alignment of LLaMA-70B outputs wi

29 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20450839

Abstract: Emerging research in Pluralistic Artificial Intelligence (AI) alignment seeks to address how intelligent systems can be designed and deployed in accordance with diverse human needs and values. We contribute to this pursuit with a dynamic approach for aligning AI with diverse and shifting user preferences through…

[576]

How does the performance of federated learning models (like FEDetect) compare to centralized deep neural netwo

29 May 2026. Score: 4.17/10. Verification: L2, Source-grounded claims.

Abstract: This work investigates the possibilities enabled by federated learning concerning IoT malware detection and studies security issues inherent to this new learning paradigm. In this context, a framework that uses federated learning to detect malware affecting IoT devices is presented. N-BaIoT, a dataset modeling…

[575]

To what extent does the use of differential privacy in federated learning-based malware detection (as seen in

29 May 2026. Score: 4.17/10. Verification: L2, Source-grounded claims.

Abstract: This work investigates the possibilities enabled by federated learning concerning IoT malware detection and studies security issues inherent to this new learning paradigm. In this context, a framework that uses federated learning to detect malware affecting IoT devices is presented. N-BaIoT, a dataset modeling…

[574]

How does the TAE token misalignment threshold impact the factual consistency score of Vicuna-13B versus Baichu

29 May 2026. Score: 8.00/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20450705

Abstract: Abstractive summarization models have demonstrated impressive progress in producing fluent, concise, and human-like summaries. Nevertheless, they also face long-term difficulties, including factual inconsistencies, hallucinations, and misunderstandings of idiomatic expressions, which commonly lead to distortions of…

[573]

What is the throughput difference in code generation tasks between Vicuna-13B and Baichuan 2 when varying TAE

29 May 2026. Score: 4.00/10. Verification: L2, Source-grounded claims.

Abstract: Understanding and reasoning over diagrams is a fundamental aspect of human intelligence. While Large Multimodal Models (LMMs) have demonstrated impressive capabilities across various tasks, existing benchmarks lack comprehensive evaluation of their diagram interpretation and reasoning abilities, particularly in…

[572]

Do Baichuan 2 and Vicuna-13B exhibit different sensitivity curves in alignment scores across multimodal benchm

29 May 2026. Score: 7.17/10. Verification: L1, Literature synthesis.

Abstract: Building on the foundations of language modeling in natural language processing, Next Token Prediction (NTP) has evolved into a versatile training objective for machine learning tasks across various modalities, achieving considerable success. As Large Language Models (LLMs) have advanced to unify understanding and…

[571]

What is the difference in execution accuracy between Llama3-70B and Codestral-7B on the MBPP benchmark after d

29 May 2026. Score: 5.33/10. Verification: L2, Source-grounded claims.

Abstract: Large Language Models (LLMs) have demonstrated significant capabilities in understanding and analyzing code for security vulnerabilities, such as Common Weakness Enumerations (CWEs). However, their reliance on cloud infrastructure and substantial computational requirements pose challenges for analyzing sensitive or…

[570]

How does instruction tuning on secure coding guidelines affect the codeBLEU scores of Llama3-70B compared to C

29 May 2026. Score: 4.17/10. Verification: L2, Source-grounded claims.

Abstract: Modern language models (LMs) have gained widespread acceptance in everyday and professional contexts, particularly in programming. An essential procedure enabling this adoption is instruction tuning, which substantially enhances LMs' practical utility by training them to follow user instructions and human…

[569]

To what extent does instruction length in BigCodeBench correlate with syntax error rates in generated code for

29 May 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: Chain-of-thought (CoT) has emerged as a groundbreaking tool in NLP, notably for its efficacy in complex reasoning tasks, such as mathematical proofs. However, its application in code generation faces a distinct challenge, i.e., although the code generated with CoT reasoning is logically correct, it faces the problem…

[568]

How do specialized code models like Code Llama perform relative to general foundation models on BigCodeBench t

29 May 2026. Score: 7.50/10. Verification: L1, Literature synthesis.

Abstract: We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of…

[567]

How does the pass@k metric for code generation models vary across BigCodeBench tasks requiring multi-library P

29 May 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: Repeated sampling with a verifier is the standard way to allocate test-time compute for code generation, with pass@\$K\$ as the canonical metric. Yet the standard policy class draws \$K\$ independent samples from a single answer distribution, so attempts often collapse onto near-duplicate reasoning paths and waste…

[566]

How does the memory bandwidth utilization of Qwen3-MoE architecture scale relative to dense Qwen3 models durin

29 May 2026. Score: 3.83/10. Verification: L1, Literature synthesis.

Abstract: In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter…

[565]

How does the zero-shot instruction following capability of Code Llama - Instruct compare to the base Code Llam

29 May 2026. Score: 6.83/10. Verification: L1, Literature synthesis.

Abstract: We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of…

[564]

What is the impact of large input context support in Code Llama on code completion accuracy for multi-file pro

29 May 2026. Score: 3.17/10. Verification: L1, Literature synthesis.

Abstract: Large Language Models (LLMs) have shown promise in automating code generation and software engineering tasks, yet they often struggle with complex, multi-file projects due to context limitations and knowledge gaps. We propose a novel context engineering workflow that combines multiple AI components: an Intent…

[563]

How does the accuracy-throughput trade-off of Llama3-70B and Codestral-34B compare when deployed on heterogene

29 May 2026. Score: 4.17/10. Verification: L2, Source-grounded claims.

Abstract: This paper proposes a neural architecture search (NAS) method for split computing. Split computing is an emerging machine-learning inference technique that addresses the privacy and latency challenges of deploying deep learning in IoT systems. In split computing, neural network models are separated and cooperatively…