Assignee Research: Index of Papers

[496]

To what extent does MMICL improve cross-domain generalization in zero-shot retrieval tasks when evaluated on o

29 May 2026. Score: 7.63/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440982

Abstract: In the last few years, the deep learning (DL) computing paradigm has been deemed the Gold Standard in the machine learning (ML) community. Moreover, it has gradually become the most widely used computational approach in the field of ML, thus achieving outstanding results on several complex cognitive tasks, matching…

[495]

What is the inference latency and throughput trade-off for LLaMA models of varying sizes (7B to 70B) when eval

29 May 2026. Score: 6.50/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This paper introduces PowerInfer, a high-speed Large Language Model (LLM) inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key principle underlying the design of PowerInfer is exploiting the high locality inherent in LLM inference, characterized by a power-law distribution…

[494]

Does cross-domain fine-tuning (e.g., pre-training on general code vs. security-focused code) combined with tar

29 May 2026. Score: 7.40/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Large Language Models (LLMs) have significantly impacted numerous domains, including Software Engineering (SE). Many recent publications have explored LLMs applied to various SE tasks. Nevertheless, a comprehensive understanding of the application, effects, and possible limitations of LLMs on SE is still in its early…

[493]

How does the Baichuan 2 model's performance in low-resource inference settings compare to Meta AI's LLaMA-3 on

29 May 2026. Score: 7.30/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Intervening the internal activations of large language models (LLMs) provides an effective inference-time alignment approach to mitigate undesirable behaviors, such as generating erroneous or harmful content, thereby ensuring safe and reliable applications of LLMs.However, previous methods neglect the misalignment…

[492]

What is the impact of model size scaling (7B vs 34B vs 70B) on Code Llama's performance in code infilling task

29 May 2026. Score: 7.17/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of…

[491]

How does the inclusion of domain-specific text preprocessing (e.g., code-specific transformations) impact the

29 May 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440910

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[490]

How does the performance of Flamingo compare to PaLI and BLIVA in zero-shot cross-modal retrieval tasks, parti

29 May 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440902

Abstract: Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex…

[489]

What is the quantitative trade-off between DPO alignment and tokens-per-second inference speed on code generat

29 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440893

Abstract: Large Language Models (LLMs) have garnered remarkable advancements across diverse code-related tasks, known as Code LLMs, particularly in code generation that generates source code with LLM from natural language descriptions. This burgeoning field has captured significant interest from both academic researchers and…

[488]

How does Baichuan 2 perform in low-resource inference settings compared to Meta AI's LLaMA-3 in terms of laten

29 May 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440847

Abstract: Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs emerges as an important…

[487]

What is the performance gap in code generation tasks (HumanEval) between multimodal models and text-only LLMs

29 May 2026. Score: 7.73/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440838

Abstract: We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of…

[486]

How do alignment techniques such as RLHF and DPO affect the performance of LLMs on LLM-as-a-Judge benchmarks l

29 May 2026. Score: 9.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440834

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[485]

How does fine-tuning Llama3, Codestral, and Deepseek R1 on the full Big-Vul dataset impact their vulnerability

29 May 2026. Score: 8.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440826

Abstract: With the advent of the modern pre-trained Transformers, the text preprocessing has started to be neglected and not specificly addressed in recent NLP literature. However, both from a linguistic and from a computer science point of view, we believe that even when using modern Transformers, text preprocessing can…

[484]

How do alignment techniques (e.g., RLHF, DPO) affect the trade-off between MATH accuracy and inference efficie

29 May 2026. Score: 9.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440795

Abstract: In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality…

[483]

What is the trade-off between inference latency and quality-of-service (e.g., response accuracy) when deployin

29 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440782

Abstract: The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can…

[482]

What is the impact of input modality interleaving (e.g., images and text) on GPT-4o's reasoning accuracy in AR

29 May 2026. Score: 8.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440780

Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family…

[481]

How does the performance of Llama, Mistral, Qwen, and DeepSeek compare on code generation benchmarks like Huma

29 May 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440774

Abstract: Large Language Models (LLMs) have garnered remarkable advancements across diverse code-related tasks, known as Code LLMs, particularly in code generation that generates source code with LLM from natural language descriptions. This burgeoning field has captured significant interest from both academic researchers and…

[480]

What is the impact of model size (e.g., 7B vs. 70B parameters) on HumanEval and GSM8K performance across LLaMA

29 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440747

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[479]

How does GPT-4o's performance on ARC-Challenge compare to other state-of-the-art multimodal models like Flamin

29 May 2026. Score: 7.20/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: Getting a good feature representation of data is paramount for Human Activity Recognition (HAR) using wearable sensors. An increasing number of feature learning approaches-in particular deep-learning based-have been proposed to extract an effective feature representation by analyzing large amounts of data. However,…

[478]

How does Flamingo's zero-shot or few-shot generalization ability scale with increasing model size or different

29 May 2026. Score: 9.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440609

Abstract: Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although…

[477]

How does Flamingo's cross-domain adaptability scale with increasing model size, and does it outperform GPT-4o

29 May 2026. Score: 7.97/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440605

Abstract: Recently, ChatGPT, along with DALL-E-2 and Codex,has been gaining significant attention from society. As a result, many individuals have become interested in related resources and are seeking to uncover the background and secrets behind its impressive performance. In fact, ChatGPT and other Generative AI (GAI)…

[476]

How does the performance of Flamingo compare to other state-of-the-art multimodal models like PaLI or BLIVA on

29 May 2026. Score: 9.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440526

Abstract: Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex…

[475]

How does Flamingo's multimodal few-shot learning performance compare to GPT-4o on vision-language benchmarks l

29 May 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440508

Abstract: Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in…

[474]

How does GPT-4o's performance on HumanEval compare to other state-of-the-art LLMs like Claude 3 Opus or Llama

29 May 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440459

Abstract: Large Language Models (LLMs) have garnered remarkable advancements across diverse code-related tasks, known as Code LLMs, particularly in code generation that generates source code with LLM from natural language descriptions. This burgeoning field has captured significant interest from both academic researchers and…

[473]

DeepSeek Qwen open source model benchmark evaluation MMLU HumanEval MATH scores comparison 2024

29 May 2026. Score: 8.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440428

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[472]

Instruction-following model evaluation benchmark MMLU HumanEval GSM8K scores table results

29 May 2026. Score: 8.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440424

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…