Papers
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What is the impact of training data language distribution on the zero-shot vulnerability classification performance of DeepSeek-V3 across non-C/C++ programming languages. Abstract The rapid evolution of large…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: To what extent does the choice of embedding model for semantic similarity metrics impact the reasoning accuracy of large language models on few-shot logical deduction tasks. Abstract The rapid evolution of large…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: Does optimizing inference efficiency through dynamic few-shot example selection based on semantic similarity degrade multimodal model performance on cross-domain visual question answering benchmarks. Abstract The…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does semantic similarity-based few-shot example retrieval compare to random selection in reducing false positive rates for code vulnerability detection models on the Big-Vul benchmark. This survey paper…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: To what extent does removing time constraints improve the accuracy of DeepSeek R1 on the Big-Vul dataset compared to Codestral, and is this performance gain consistent across different vulnerability. Since the…
Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: What is the correlation between tokenization efficiency and inference latency for Romanized Nepali tasks across Llama-3.1, Mistral, and Qwen architectures. Romanized Nepali, the Nepali language written in the…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does instruction tuning data quality versus quantity affect pass@1 accuracy for low-resource Romanized scripts in 7B-8B parameter LLMs. Rapid developments in large language models (LLMs) have created new…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of different fine-tuning strategies (e.g., multi-task learning vs. sequential fine-tuning) on the robustness of Codestral in detecting vulnerabilities in low-resource programming. Abstract The…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does scaling DeepSeek-V3 from 7B to 33B parameters impact robustness accuracy on GPQA Diamond under synthetic distribution shifts. Abstract The rapid evolution of large language models (LLMs) has driven a…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: Does increasing the parameter scale of DeepSeek-V3 improve cross-domain generalization metrics on synthetic distribution shift benchmarks compared to smaller variants. Abstract The rapid evolution of large…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: To what extent does taxonomy-aligned vulnerability fine-tuning improve zero-shot generalization on out-of-distribution code repair benchmarks like QuixBugs versus general code corpora. As Large Language Models…
Abstract: This report synthesises findings from 5 peer-reviewed papers addressing the following research question: What is the correlation between Code Property Graph representation fidelity and the classification accuracy of GCN-based false positive predictors across diverse SAST tools. Software vulnerabilities pose…
Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: Does Vendi-RAG's diversity optimization improve FLAN-T5-xl accuracy on the HANS syntactic distractor subset compared to standard BM25 retrieval. Abstract Deep learning (DL) is revolutionizing evidence-based…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the inference latency of Vendi-RAG scale with context window size on the NaturalQuestions benchmark relative to dense retrieval baselines. A major obstacle to the wide-spread adoption of neural retrieval…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: How do multimodal models like DeepSeek R1 generalize to out-of-domain code repair tasks compared to Codestral when evaluated on cross-language benchmarks like VulDeePecker and Devign. Large language models (LLMs)…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: Can energy-to-token efficiency be optimized without degrading robustness scores on adversarial datasets like HANS when tuning diversity parameters in retrieval-augmented generation. Abstract Deep learning (DL) is…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: To what extent does the data-centric innovation approach improve the throughput of DeepSeek R1 compared to Codestral when repairing vulnerabilities in large codebases with varying code lengths. As Large Language…
Abstract: This report synthesises findings from 5 peer-reviewed papers addressing the following research question: What is the impact of varying Vendi-RAG diversity weights on the trade-off between answer accuracy and energy consumption for FLAN-T5-xl across natural language inference benchmarks. Large Language Models (LLMs)…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the energy-per-token metric correlate with latency and throughput variations in FLAN-T5-xl when applying diversity-weighted RAG on the ANLI and HANS datasets. This article presents a comprehensive and…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: How does diversity-weighted retrieval in RAG pipelines affect FLAN-T5-xl robustness against syntactic perturbations on the HANS benchmark compared to standard dense retrieval. The rapid advancement of Large…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: What is the impact of varying the diversity-weight parameter in Vendi-RAG on the zero-shot accuracy of FLAN-T5-xl across the three rounds of the ANLI adversarial inference dataset. Abstract Deep learning (DL) is…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How transferable is LogicScore's evaluation framework when applied to multimodal RAG systems that incorporate both textual and visual information. In this paper we report the set-up and results of the Multimodal…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: What impact does the integration of LogicScore have on the computational efficiency of RAG systems during inference, particularly in low-resource settings. Large Language Models (LLMs) showcase impressive…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does hybrid retrieval combining BM25 and dense vectors impact code generation accuracy and inference latency on the HumanEval benchmark compared to single-retriever approaches. Abstract The rapid evolution of…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: What is the throughput degradation of multi-vector retrieval architectures in RAG pipelines when scaling knowledge bases for complex reasoning tasks on GSM8K. Abstract The rapid evolution of large language models…