Assignee Research: Index of Papers

[471]

Foundation model evaluation study MMLU HellaSwag ARC WinoGrande TruthfulQA scores

29 May 2026. Score: 7.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440412

Abstract: Recent advancements in Natural Language Processing (NLP) technologies have been driven at an unprecedented pace by the development of Large Language Models (LLMs). However, challenges remain, such as generating responses that are misaligned with the intent of the question or producing incorrect answers. This paper…

[470]

Language model capability evaluation benchmark suite results comparison 2024 frontier models

29 May 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440390

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[469]

LLM evaluation framework comprehensive benchmark results coding reasoning math multiple models

29 May 2026. Score: 8.07/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440382

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[468]

MATH dataset benchmark evaluation language model performance scores comparison 2024

29 May 2026. Score: 6.90/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: As Large Language Models (LLMs) become increasingly integrated into secure software development workflows, a critical question remains unanswered: can these models not only detect insecure code but also reliably classify vulnerabilities according to standardized taxonomies? In this work, we conduct a systematic…

[467]

GSM8K mathematical reasoning benchmark evaluation multiple LLM scores comparison 2024

29 May 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440346

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[466]

HumanEval benchmark pass@1 scores comparison GPT-4o Claude Gemini LLaMA DeepSeek 2024

29 May 2026. Score: 9.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440344

Abstract: Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their…

[465]

SWE-bench software engineering benchmark evaluation language model scores GPT-4 Claude 2024

29 May 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440332

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[464]

MMLU benchmark results multiple language models comparison GPT-4 Claude Gemini scores accuracy 2024

29 May 2026. Score: 9.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440308

Abstract: Abstract Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a…

[463]

Gemini evaluation benchmark results MMLU HumanEval GSM8K MATH performance scores Google

29 May 2026. Score: 9.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440306

Abstract: We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B. They represent our most capable models that are trained with all the insights and lessons gained from the…

[462]

Mistral evaluation benchmark results MMLU HumanEval GSM8K performance scores comparison

29 May 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440292

Abstract: Recent advancements in Natural Language Processing (NLP) technologies have been driven at an unprecedented pace by the development of Large Language Models (LLMs). However, challenges remain, such as generating responses that are misaligned with the intent of the question or producing incorrect answers. This paper…

[461]

LLaMA-3 evaluation benchmark results MMLU HumanEval GSM8K coding performance Meta AI

29 May 2026. Score: 9.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440270

Abstract: Large language models (LLMs) have demonstrated remarkable performance on a variety of natural language tasks based on just a few examples of natural language instructions, reducing the need for extensive feature engineering. However, most powerful LLMs are closed-source or limited in their capability for languages…

[460]

Claude-3 evaluation benchmark MMLU HumanEval GSM8K MATH coding performance scores Anthropic

29 May 2026. Score: 9.17/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440250

Abstract: We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69\% on MMLU and 8.38 on MT-bench), despite being…

[459]

State of large language models benchmark evaluation GPT-4 Claude Gemini performance comparison 2024 2025

29 May 2026. Score: 9.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440239

Abstract: Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive…

[458]

GPT-4 technical report benchmark evaluation MMLU HumanEval GSM8K HellaSwag scores performance

29 May 2026. Score: 9.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440237

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[457]

Open LLM leaderboard evaluation results Llama Mistral Qwen DeepSeek benchmark scores comparison

29 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440227

Abstract: Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive…

[456]

Holistic evaluation LLM benchmark results MMLU HumanEval GSM8K MATH SWE-bench accuracy scores table 2024

29 May 2026. Score: 9.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440218

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[455]

Comprehensive benchmark evaluation comparing GPT-4, Claude, Gemini, LLaMA on MMLU HumanEval GSM8K MATH scores

29 May 2026. Score: 9.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20440206

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[454]

Large language model leaderboard benchmark comparison scores GPT-4o Claude-3 Gemini-1.5 LLaMA-3 performance 20

29 May 2026. Score: 1.00/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: In this paper, we explore the capabilities of state-of-the-art large language models (LLMs) such as GPT-4, GPT-4o, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, Llama 3, and Llama 3.1 in solving some selected undergraduate-level transportation engineering problems. We introduce TransportBench, a benchmark dataset…

[453]

How does the retrieval efficiency of Llama-3-8B-128K vary across context lengths 32K, 64K, and 128K on the MuS

29 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20439595

Abstract: Processing long contexts presents a significant challenge for large language models (LLMs). While recent advancements allow LLMs to handle much longer contexts than before (e.g., 32K or 128K tokens), it is computationally expensive and can still be insufficient for many applications. Retrieval-Augmented Generation…

[452]

Does incorporating multi-turn reinforcement learning during training improve the nDTW score of vision-language

29 May 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20439563

Abstract: The speaker-follower models have proven to be effective in vision-and-language navigation, where a speaker model is used to synthesize new instructions to augment the training data for a follower navigation model. However, in many of the previous methods, the generated instructions are not directly trained to…

[451]

How does the multi-turn RL approach in LongNav-R1 compare to single-turn RL baselines on the RxR-CE benchmark

29 May 2026. Score: 8.23/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20439559

Abstract: Recent advances in the areas of multimodal machine learning and artificial intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Embodied AI. Whereas many approaches and previous survey pursuits have characterised one or two of…

[450]

Does increasing VLA parameter count from 7B to 13B improve long-horizon task completion rate and average rewar

29 May 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20439548

Abstract: Incremental decision making in real-world environments is one of the most challenging tasks in embodied artificial intelligence. One particularly demanding scenario is Vision and Language Navigation (VLN) which requires visual and natural language understanding as well as spatial and temporal reasoning capabilities.…

[449]

What is the impact of VLA model scale (7B vs 13B) on object grounding accuracy and path completion rate in Lon

29 May 2026. Score: 9.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20439545

Abstract: Embodied AI is widely recognized as a cornerstone of artificial general intelligence (AGI) because it involves controlling embodied agents to perform tasks in the physical world. Building on the success of large language models (LLMs) and vision-language models (VLMs), a new category of multimodal models-referred to…

[448]

How does scaling VLA model size from 7B to 13B affect success rate and SPL on the R2R-CE benchmark when using

29 May 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20439543

Abstract: Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial General Intelligence (AGI) and serves as a foundation for various applications (e.g., intelligent mechatronics systems, smart manufacturing) that bridge cyberspace and the physical world. Recently, the emergence of Multi-modal Large…

[447]

What are the computational efficiency tradeoffs of sparse attention mechanisms in large-scale language models

29 May 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20439541

Abstract: Many real-world applications require the prediction of long sequence time-series, such as electricity consumption planning. Long sequence time-series forecasting (LSTF) demands a high prediction capacity of the model, which is the ability to capture precise long-range dependency coupling between output and input…