Papers
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What is the impact of quantization techniques (e.g., 4-bit, 8-bit) on the inference efficiency and accuracy of SecLM-fine-tuned Llama3 for text classification tasks on edge devices with limited. Abstract The rapid…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the inference latency of Mistral-Large-2 scale with input sequence length on MBPP code completion tasks. We release Code Llama, a family of large language models for code based on Llama 2 providing…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the reasoning accuracy of Mistral-Large-2 on GSM8K compared to other 7B parameter models. Large Language Models (LLMs) have demonstrated remarkable versatility in recent years, offering potential…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of model size and training data on Mistral-Large-2's LiveCodeBench performance, and how does it scale with increasing parameter count. In this report, we introduce Qwen2.5, a comprehensive…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does Mistral-Large-2's inference latency scale across different sequence lengths on ARC-Challenge questions. We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the human evaluation score for code quality and functional correctness of Mistral-Large-2 generated solutions on MBPP compared to ground truth implementations. Several Deep Learning (DL)-based techniques…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does fine-tuning Mistral-Large-2 on domain-specific math datasets (e.g., Math-PT) improve its MATH benchmark scores compared to zero-shot or few-shot evaluation. The use of large language models (LLMs) for…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the pass@1 accuracy of Mistral-Large-2 on the MBPP benchmark compared to other state-of-the-art code generation models. We introduce self-invoking code generation, a new task designed to evaluate the…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the performance of Qwen3-235B degrade under PPTC-R adversarial user instructions compared to standard instructions. The growing dependence on Large Language Models (LLMs) for finishing user instructions…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the inference efficiency (tokens/sec or latency) of Mistral-Large-2 when solving MATH problems compared to smaller specialized math-focused models. Large language models (LLMs) have been explored in a…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does the inference efficiency of Mistral-Large-2 on GSM8K benchmark change with different context window sizes. In this report, we introduce the Gemini 1.5 family of models, representing the next generation of…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does Mistral-Large-2's performance on MATH vary across different languages when evaluated on multilingual math benchmarks like Math-PT. Large Language Models (LLMs) have demonstrated remarkable versatility in…
Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: How does the inference efficiency of Qwen3-235B vary across different programming languages in the LiveCodeBench evaluation. In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the performance gap between monolingual Portuguese LLMs and multilingual models (e.g., Qwen2.5-72B) on MATH-PT, and does this gap persist when evaluating on other non-English reasoning. In this work, we…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the inference efficiency of Qwen3-235B vary when evaluated on SWE-bench Verified tasks with different computational constraints. In this work, we present Qwen3, the latest version of the Qwen model…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of training data contamination on Qwen3-235B's performance across different model sizes on SWE-bench Verified. Abstract The rapid evolution of large language models (LLMs) has driven a…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: Can explanation methods that perform well on traditional accuracy metrics maintain similar performance on the human attention explanation quality metric. Multilayer neural networks trained with the…
Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: How does the inference efficiency (e.g., tokens per second) of Qwen2.5-72B compare to other state-of-the-art models (e.g., Mistral-7B, Llama3-8B) when processing MATH-PT problems. We introduce MiniMax-01 series,…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How do different saliency explanation methods compare in terms of human interpretability when evaluated on the proposed human attention benchmark across vision and language domains. Multilayer neural networks…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the correlation between computational efficiency (FLOPs, inference time) and explanation quality scores on the human attention benchmark. In this paper we report the set-up and results of the Multimodal…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the performance of Qwen2.5-72B on HumanEval-V compare to its performance on standard code generation benchmarks like HumanEval and MBPP. In this work, we present Qwen3, the latest version of the Qwen…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: Can the human attention benchmark be used to improve the training of attention-based models through multi-task learning frameworks. Deep convolutional neural networks have performed remarkably well on many…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What is the impact of using multi-layer human attention masks versus single-layer attention mechanisms on explanation quality scores. Deep convolutional neural networks have performed remarkably well on many…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What are the cross-domain reasoning capabilities of DeepSeek-V4-Pro when evaluated on the ARC and HellaSwag benchmarks. Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the human attention benchmark compare to existing synthetic attention evaluation metrics in terms of correlation with model performance on downstream tasks. Many computational models of visual attention…