Papers
Abstract: This report synthesises findings from 2 peer-reviewed papers addressing the following research question: What is the inference efficiency difference between Codestral-7B and Llama3-70B when fine-tuned on C/C++ security vulnerability detection tasks. Software vulnerabilities pose significant risks to the security and…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: What is the effect of multi-agent context engineering workflows on the reasoning accuracy of LLMs in niche domain code generation tasks measured by ReCode. Large Language Models (LLMs) have garnered remarkable…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How robust are fine-tuned Codestral-7B and Llama3-70B models when evaluated on cross-domain code generation tasks in low-resource languages. Pre-trained models for Natural Languages (NL) like BERT and GPT have…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the cross-domain generalization accuracy of fine-tuned Codestral-7B compare to Llama3-70B on unseen programming languages beyond Python for security vulnerability classification. Finetuning language…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does semantic retrieval augmentation impact pass@k scores for LLMs on multi-file code generation benchmarks compared to standard context window extension. Large Language Models (LLMs) showcase impressive…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the impact of different inference optimization techniques on the latency and accuracy trade-off between Llama3-70B and Codestral-34B for SIMCOPILOT's infill tasks. Deep ensemble learning has been shown to…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: Can fine-tuning Llama3-70B with retrieval augmentation on a synthetic multi-file vulnerability dataset improve its classification performance on the CodeXGLUE security subset, and how does this. A detailed study…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the retrieval-augmented performance of Llama3-70B on the CodeXGLUE security subset compare to other state-of-the-art LLMs like Claude 3 Opus when evaluated on precision, recall, and F1-score. Anomaly…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the accuracy difference between retrieval-augmented Gemini 1.5 Pro and Llama3-70B on the CodeXGLUE security subset when evaluated with few-shot learning versus zero-shot learning. The rapid expansion of…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of quantization on the throughput-accuracy trade-off for fine-tuned SecLM models deployed on resource-constrained hardware. As the rapid scaling of large language models (LLMs) poses…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the inference latency of SecLM variants scale with model size when processing multimodal inputs on edge devices compared to cloud GPUs. With the breakthroughs in deep learning, the recent years have…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the human evaluation accuracy score for code correctness of Mistral-Large-2 generated solutions on the MBPP benchmark compared to reference implementations. We introduce self-invoking code generation, a…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the code generation quality of Mistral-Large-2 on MBPP benchmark compare to ground truth implementations when evaluated by human reviewers on functional correctness and code quality metrics. The creation…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: Does the improved post-training strategy in Qwen2.5 yield higher alignment scores on instruction-following benchmarks compared to models trained with equivalent data but earlier alignment techniques. Despite…
Abstract: This report synthesises findings from 2 peer-reviewed papers addressing the following research question: How does Mistral-Large-2 perform on the original MBPP benchmark compared to its performance on the self-invoking MBPP Pro variant. We introduce self-invoking code generation, a new task designed to evaluate the…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: Do smaller specialized math models achieve higher accuracy-per-token than large general models like Mistral-Large-2 when evaluated under constrained compute budgets on competitive math datasets. Large Language…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the correlation between base problem accuracy and complex problem success rates for code generation models on the HumanEval Pro benchmark. We introduce self-invoking code generation, a new task designed…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How do Mistral-Large-2 generated code solutions on MBPP compare to ground truth implementations in terms of functional correctness and code quality as measured by human evaluation scores. In recent years,…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the inference efficiency degradation of Qwen3-235B under PPTC-R's sentence-level attacks compared to baseline performance metrics. This chapter introduces the concept of adversarial attacks on image…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of model inference efficiency on the correlation between human attention prediction accuracy and downstream task performance in large-scale vision models. Object detection is one of the most…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: To what extent does the thinking mode in Qwen3 improve performance on multi-step reasoning tasks in SWE-bench Verified compared to non-thinking mode, and how does this trade-off affect inference. Small language…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: How do monolingual Portuguese LLMs compare to multilingual models like Qwen2.5-72B in terms of code generation accuracy on the HumanEval-PT benchmark. In this work, we present Qwen3, the latest version of the Qwen…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the inference efficiency of Qwen3-235B compare to other dense and MoE-based LLMs of similar scale on SWE-bench Verified tasks under constrained memory budgets. Long-term memory is a cornerstone of human…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does inference efficiency (latency and throughput) vary across Qwen3-235B model sizes when processing SWE-bench Verified tasks, and does training data contamination exacerbate or mitigate. The issue-resolving…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do multi-layer attention masks improve robustness in multimodal models compared to single-layer attention when evaluated on cross-domain benchmarks like VQA or MM-ReAct. People with hearing impairments are…