Assignee Research: Index of Papers

[464]

MMLU benchmark results multiple language models comparison GPT-4 Claude Gemini scores accuracy 2024

29 May 2026. Score: 9.33/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20440308

Abstract: Abstract Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a…

[463]

Gemini evaluation benchmark results MMLU HumanEval GSM8K MATH performance scores Google

29 May 2026. Score: 9.17/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20440306

Abstract: We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B. They represent our most capable models that are trained with all the insights and lessons gained from the…

[462]

Mistral evaluation benchmark results MMLU HumanEval GSM8K performance scores comparison

29 May 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20440292

Abstract: Recent advancements in Natural Language Processing (NLP) technologies have been driven at an unprecedented pace by the development of Large Language Models (LLMs). However, challenges remain, such as generating responses that are misaligned with the intent of the question or producing incorrect answers. This paper…

[461]

LLaMA-3 evaluation benchmark results MMLU HumanEval GSM8K coding performance Meta AI

29 May 2026. Score: 9.00/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20440270

Abstract: Large language models (LLMs) have demonstrated remarkable performance on a variety of natural language tasks based on just a few examples of natural language instructions, reducing the need for extensive feature engineering. However, most powerful LLMs are closed-source or limited in their capability for languages…

[460]

Claude-3 evaluation benchmark MMLU HumanEval GSM8K MATH coding performance scores Anthropic

29 May 2026. Score: 9.17/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20440250

Abstract: We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69\% on MMLU and 8.38 on MT-bench), despite being…

[459]

State of large language models benchmark evaluation GPT-4 Claude Gemini performance comparison 2024 2025

29 May 2026. Score: 9.00/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20440239

Abstract: Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive…

[458]

GPT-4 technical report benchmark evaluation MMLU HumanEval GSM8K HellaSwag scores performance

29 May 2026. Score: 9.33/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20440237

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[457]

Open LLM leaderboard evaluation results Llama Mistral Qwen DeepSeek benchmark scores comparison

29 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20440227

Abstract: Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive…

[456]

Holistic evaluation LLM benchmark results MMLU HumanEval GSM8K MATH SWE-bench accuracy scores table 2024

29 May 2026. Score: 9.00/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20440218

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[455]

Comprehensive benchmark evaluation comparing GPT-4, Claude, Gemini, LLaMA on MMLU HumanEval GSM8K MATH scores

29 May 2026. Score: 9.50/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20440206

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[454]

Large language model leaderboard benchmark comparison scores GPT-4o Claude-3 Gemini-1.5 LLaMA-3 performance 20

29 May 2026. Score: 1.00/10. Verification: L1, Literature synthesis.

Abstract: In this paper, we explore the capabilities of state-of-the-art large language models (LLMs) such as GPT-4, GPT-4o, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, Llama 3, and Llama 3.1 in solving some selected undergraduate-level transportation engineering problems. We introduce TransportBench, a benchmark dataset…

[453]

How does the retrieval efficiency of Llama-3-8B-128K vary across context lengths 32K, 64K, and 128K on the MuS

29 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20439595

Abstract: Processing long contexts presents a significant challenge for large language models (LLMs). While recent advancements allow LLMs to handle much longer contexts than before (e.g., 32K or 128K tokens), it is computationally expensive and can still be insufficient for many applications. Retrieval-Augmented Generation…

[452]

Does incorporating multi-turn reinforcement learning during training improve the nDTW score of vision-language

29 May 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20439563

Abstract: The speaker-follower models have proven to be effective in vision-and-language navigation, where a speaker model is used to synthesize new instructions to augment the training data for a follower navigation model. However, in many of the previous methods, the generated instructions are not directly trained to…

[451]

How does the multi-turn RL approach in LongNav-R1 compare to single-turn RL baselines on the RxR-CE benchmark

29 May 2026. Score: 8.23/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20439559

Abstract: Recent advances in the areas of multimodal machine learning and artificial intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Embodied AI. Whereas many approaches and previous survey pursuits have characterised one or two of…

[450]

Does increasing VLA parameter count from 7B to 13B improve long-horizon task completion rate and average rewar

29 May 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20439548

Abstract: Incremental decision making in real-world environments is one of the most challenging tasks in embodied artificial intelligence. One particularly demanding scenario is Vision and Language Navigation (VLN) which requires visual and natural language understanding as well as spatial and temporal reasoning capabilities.…

[449]

What is the impact of VLA model scale (7B vs 13B) on object grounding accuracy and path completion rate in Lon

29 May 2026. Score: 9.00/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20439545

Abstract: Embodied AI is widely recognized as a cornerstone of artificial general intelligence (AGI) because it involves controlling embodied agents to perform tasks in the physical world. Building on the success of large language models (LLMs) and vision-language models (VLMs), a new category of multimodal models-referred to…

[448]

How does scaling VLA model size from 7B to 13B affect success rate and SPL on the R2R-CE benchmark when using

29 May 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20439543

Abstract: Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial General Intelligence (AGI) and serves as a foundation for various applications (e.g., intelligent mechatronics systems, smart manufacturing) that bridge cyberspace and the physical world. Recently, the emergence of Multi-modal Large…

[447]

What are the computational efficiency tradeoffs of sparse attention mechanisms in large-scale language models

29 May 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20439541

Abstract: Many real-world applications require the prediction of long sequence time-series, such as electricity consumption planning. Long sequence time-series forecasting (LSTF) demands a high prediction capacity of the model, which is the ability to capture precise long-range dependency coupling between output and input…

[446]

How does the sample efficiency of multi-turn RL for long-horizon VLN-CE tasks compare to imitation learning ba

29 May 2026. Score: 7.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20439535

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[445]

Does layer-wise score aggregation improve SuperGLUE task accuracy over last-layer baselines when evaluated on

29 May 2026. Score: 9.00/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20439531

Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate…

[444]

What is the impact of routing strategy choice on the throughput and accuracy trade-off when scaling sparse MoE

29 May 2026. Score: 9.33/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20439529

Abstract: Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial intelligence (AI), reshaping both research paradigms and practical applications. Distinguished from their predecessors by unprecedented scale and advanced capabilities, LLMs necessitate new frameworks for…

[443]

How does COCO-DR's zero-shot recall@5 on NQ and TriviaQA compare to supervised dense retrievers like DPR and C

29 May 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20439523

Abstract: Effective information retrieval (IR) from vast datasets relies on advanced techniques to extract relevant information in response to queries.Recent advancements in dense retrieval have showcased remarkable efficacy compared to traditional sparse retrieval methods.To further enhance retrieval performance, knowledge…

[442]

Can AlphaX framework be adapted to improve sample efficiency in neural architecture search by incorporating un

29 May 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20439521

Abstract: Neural Architecture Search (NAS) has shown great success in automating the design of neural networks, but the prohibitive amount of computations behind current NAS methods requires further investigations in improving the sample efficiency and the network evaluation cost to get better results in a shorter time. In…

[441]

Can AlphaX framework be adapted to improve sample efficiency in neural architecture search by incorporating un

29 May 2026. Score: 6.73/10. Verification: L2, Source-grounded claims.

Abstract: Deep convolutional neural networks have performed remarkably well on many Computer Vision tasks. However, these networks are heavily reliant on big data to avoid overfitting. Overfitting refers to the phenomenon when a network learns a function with very high variance such as to perfectly model the training data.…

[440]

How does the choice of LoRA rank in cross-attention layers influence the trade-off between FVD and LPIPS score

29 May 2026. Score: 7.17/10. Verification: L2, Source-grounded claims.

Abstract: We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for…