Assignee Research: Index of Papers

[584]

What is the degradation in GQA benchmark scores for LLaVA-1.5 when applying activation-aware weight

29 May 2026. Score: 7.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Large language models (LLMs) demand substantial computational and memory resources, creating deployment challenges. Quantization-aware training (QAT) addresses these challenges by reducing model precision while maintaining performance. However, the scaling behavior of QAT, especially at 4-bit precision (W4A4), is not…

[583]

How does Llama3's zero-shot performance in energy market anomaly detection compare to fine-tuned smaller

29 May 2026. Score: 6.67/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: Large Language Models achieve remarkable performance but incur substantial computational costs unsuitable for resource-constrained deployments. This paper presents the first comprehensive task-specific efficiency analysis comparing 16 language models across five diverse NLP tasks. We introduce the…

[582]

How does varying the hot neuron activation threshold in PowerInfer affect Pass@1 scores on the HumanEval bench

29 May 2026. Score: 4.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Understanding and reasoning over diagrams is a fundamental aspect of human intelligence. While Large Multimodal Models (LMMs) have demonstrated impressive capabilities across various tasks, existing benchmarks lack comprehensive evaluation of their diagram interpretation and reasoning abilities, particularly in…

[581]

What is the average drop in forecasting accuracy for Llama3 compared to domain-specific models when evaluated

29 May 2026. Score: 6.50/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: Time series forecasting remains a challenging task, particularly in the context of complex multiscale temporal patterns. This study presents LLM-Mixer, a framework that improves forecasting accuracy through the combination of multiscale time-series decomposition with pre-trained LLMs (Large Language Models).…

[580]

How does the inference efficiency (throughput, latency) of pruned BERT models compare to quantization-optimize

29 May 2026. Score: 4.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: In this paper, we present a comparative analysis of benign and malicious Android applications, based on static features. In particular, we focus our attention on the permissions requested by an application. We consider both binary classification of malware versus benign, as well as the multiclass problem, where we…

[579]

How does the accuracy and inference latency of lightweight BERT models compare to distilled versions of larger

29 May 2026. Score: 5.00/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: The deployment of transformer-based models on resource-constrained edge devices represents a critical challenge in enabling real-time artificial intelligence applications. This comprehensive survey examines lightweight transformer architectures specifically designed for edge deployment, analyzing recent advances in…

[578]

What is the impact of cross-domain fine-tuning on the pass@1 accuracy of LLaMA-70B for Python function synthes

29 May 2026. Score: 4.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Large Language Models (LLMs) have demonstrated significant capabilities in understanding and analyzing code for security vulnerabilities, such as Common Weakness Enumerations (CWEs). However, their reliance on cloud infrastructure and substantial computational requirements pose challenges for analyzing sensitive or…

[577]

How does dynamic hot neuron threshold adjustment in PowerInfer influence the alignment of LLaMA-70B outputs wi

29 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20450839

Abstract: Emerging research in Pluralistic Artificial Intelligence (AI) alignment seeks to address how intelligent systems can be designed and deployed in accordance with diverse human needs and values. We contribute to this pursuit with a dynamic approach for aligning AI with diverse and shifting user preferences through…

[576]

How does the performance of federated learning models (like FEDetect) compare to centralized deep neural netwo

29 May 2026. Score: 4.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This work investigates the possibilities enabled by federated learning concerning IoT malware detection and studies security issues inherent to this new learning paradigm. In this context, a framework that uses federated learning to detect malware affecting IoT devices is presented. N-BaIoT, a dataset modeling…

[575]

To what extent does the use of differential privacy in federated learning-based malware detection (as seen in

29 May 2026. Score: 4.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This work investigates the possibilities enabled by federated learning concerning IoT malware detection and studies security issues inherent to this new learning paradigm. In this context, a framework that uses federated learning to detect malware affecting IoT devices is presented. N-BaIoT, a dataset modeling…

[574]

How does the TAE token misalignment threshold impact the factual consistency score of Vicuna-13B versus Baichu

29 May 2026. Score: 8.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20450705

Abstract: Abstractive summarization models have demonstrated impressive progress in producing fluent, concise, and human-like summaries. Nevertheless, they also face long-term difficulties, including factual inconsistencies, hallucinations, and misunderstandings of idiomatic expressions, which commonly lead to distortions of…

[573]

What is the throughput difference in code generation tasks between Vicuna-13B and Baichuan 2 when varying TAE

29 May 2026. Score: 4.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Understanding and reasoning over diagrams is a fundamental aspect of human intelligence. While Large Multimodal Models (LMMs) have demonstrated impressive capabilities across various tasks, existing benchmarks lack comprehensive evaluation of their diagram interpretation and reasoning abilities, particularly in…

[572]

Do Baichuan 2 and Vicuna-13B exhibit different sensitivity curves in alignment scores across multimodal benchm

29 May 2026. Score: 7.17/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: Building on the foundations of language modeling in natural language processing, Next Token Prediction (NTP) has evolved into a versatile training objective for machine learning tasks across various modalities, achieving considerable success. As Large Language Models (LLMs) have advanced to unify understanding and…

[571]

What is the difference in execution accuracy between Llama3-70B and Codestral-7B on the MBPP benchmark after d

29 May 2026. Score: 5.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Large Language Models (LLMs) have demonstrated significant capabilities in understanding and analyzing code for security vulnerabilities, such as Common Weakness Enumerations (CWEs). However, their reliance on cloud infrastructure and substantial computational requirements pose challenges for analyzing sensitive or…

[570]

How does instruction tuning on secure coding guidelines affect the codeBLEU scores of Llama3-70B compared to C

29 May 2026. Score: 4.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Modern language models (LMs) have gained widespread acceptance in everyday and professional contexts, particularly in programming. An essential procedure enabling this adoption is instruction tuning, which substantially enhances LMs' practical utility by training them to follow user instructions and human…

[569]

To what extent does instruction length in BigCodeBench correlate with syntax error rates in generated code for

29 May 2026. Score: 4.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Chain-of-thought (CoT) has emerged as a groundbreaking tool in NLP, notably for its efficacy in complex reasoning tasks, such as mathematical proofs. However, its application in code generation faces a distinct challenge, i.e., although the code generated with CoT reasoning is logically correct, it faces the problem…

[568]

How do specialized code models like Code Llama perform relative to general foundation models on BigCodeBench t

29 May 2026. Score: 7.50/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of…

[567]

How does the pass@k metric for code generation models vary across BigCodeBench tasks requiring multi-library P

29 May 2026. Score: 6.50/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: Repeated sampling with a verifier is the standard way to allocate test-time compute for code generation, with pass@\$K\$ as the canonical metric. Yet the standard policy class draws \$K\$ independent samples from a single answer distribution, so attempts often collapse onto near-duplicate reasoning paths and waste…

[566]

How does the memory bandwidth utilization of Qwen3-MoE architecture scale relative to dense Qwen3 models durin

29 May 2026. Score: 3.83/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter…

[565]

How does the zero-shot instruction following capability of Code Llama - Instruct compare to the base Code Llam

29 May 2026. Score: 6.83/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of…

[564]

What is the impact of large input context support in Code Llama on code completion accuracy for multi-file pro

29 May 2026. Score: 3.17/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: Large Language Models (LLMs) have shown promise in automating code generation and software engineering tasks, yet they often struggle with complex, multi-file projects due to context limitations and knowledge gaps. We propose a novel context engineering workflow that combines multiple AI components: an Intent…

[563]

How does the accuracy-throughput trade-off of Llama3-70B and Codestral-34B compare when deployed on heterogene

29 May 2026. Score: 4.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This paper proposes a neural architecture search (NAS) method for split computing. Split computing is an emerging machine-learning inference technique that addresses the privacy and latency challenges of deploying deep learning in IoT systems. In split computing, neural network models are separated and cooperatively…

[562]

How does the precision-recall tradeoff in Gemini 1.5 Pro with an 8M context window compare to Llama3-70B with

29 May 2026. Score: 3.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Considerable delays often exist between the discovery of a vulnerability and the issue of a patch. One way to mitigate this window of vulnerability is to use a configuration workaround, which prevents the vulnerable code from being executed at the cost of some lost functionality – but only if one is available. Since…

[561]

When fine-tuned on domain-specific security corpora, how do Llama3 and Code Llama 7B compare in few-shot (5-15

29 May 2026. Score: 5.57/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: We propose a meta learning framework for detecting anomalies in human language across diverse domains with limited labeled data. Anomalies in language ranging from spam and fake news to hate speech pose a major challenge due to their sparsity and variability. We treat anomaly detection as a few shot binary…

[560]

What is the impact of incorporating multimodal context (e.g., UML diagrams or execution traces) on the CWE cla

29 May 2026. Score: 6.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Large Language Models (LLMs) have demonstrated significant capabilities in understanding and analyzing code for security vulnerabilities, such as Common Weakness Enumerations (CWEs). However, their reliance on cloud infrastructure and substantial computational requirements pose challenges for analyzing sensitive or…