Papers
Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: How do cross-model robustness metrics vary for Qwen3-235B versus Llama2-70B when subjected to adversarial attacks on code generation tasks. The emergence of Transformer-based Large Language Models (LLMs) has…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: To what extent do distributionally robust optimization techniques improve the alignment of Dice score and Hausdorff distance metrics with human evaluation in vision-language segmentation models. A joint…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the scaling of multimodal context (varying the ratio of text to diagram information) affect the robustness of DeepSeek-R1's iterative code repair performance across different programming. Code repair is…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the correlation between synthetic segmentation metrics and human rater agreement differ when replacing pure visual encoders with vision-language models in multimodal medical image benchmarks. Determining…
Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: What is the throughput difference between FP8 and INT4 quantized Llama-3.1-70B on HumanEval when deployed on A100 vs. H100 GPUs, and is the accuracy degradation consistent across both hardware. Large language…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does INT4 quantization impact the zero-shot code generation performance of Llama-3.1-70B compared to smaller variants (e.g., 8B) on HumanEval, and does the trade-off scale with model size. Recent progress in…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How robust are the code adaptation capabilities of DeepSeek-R1, CodeLlama, and WizardCoder when evaluated on out-of-distribution MLOps tasks, and how do their performance metrics (e.g., pass@k,. This paper…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the inference latency of DeepSeek-R1 compare to CodeLlama and WizardCoder when performing few-shot code generation on HumanEval-V, and what is the accuracy trade-off at different latency. Large language…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the integration of dynamic code execution traces with static analysis visualizations in LLaVul impact its vulnerability classification accuracy on the Big-Vul dataset compared to static-only. Increasing…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does instruction tuning with security-specific code examples affect Llama3's zero-shot vulnerability detection accuracy on Big-Vul compared to general code instruction tuning. One of the most impressive…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the correlation between model size and inference latency for Codestral when performing severity-level classification on C and C++ code in Big-Vul. Context: Traditional software security analysis methods…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: Does increasing Codestral's parameter count improve robustness against obfuscated code variants in vulnerability detection benchmarks compared to smaller variants. As large language models (LLMs) are increasingly…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does scaling Codestral from 7B to 33B parameters affect false positive rates in vulnerability detection across the Big-Vul dataset. Software vulnerabilities can cause numerous problems, including crashes,…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the performance gap between retrieval-augmented prompting and fine-tuning scale when evaluating DeepSeek-V3 on cross-language vulnerability datasets beyond the C/C++ focus of Big-Vul. As Large Language…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: To what extent does the semantic similarity metric used for retrieving few-shot examples impact the false positive rate of DeepSeek-V3 on the Big-Vul benchmark compared to random example selection. Deep…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: Does the Loss-Free Balancing strategy in DeepSeek-V3 maintain consistent performance stability across different programming languages in the GPQA Diamond domain when evaluated using the MBPP benchmark. We present…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the trade-off between model size and inference latency for Llama3, Codestral, and Deepseek R1 when classifying software vulnerabilities in the Big-Vul dataset. This study investigates the performance of…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How do Llama3, Codestral, and Deepseek R1 compare in cross-language generalization for vulnerability detection when fine-tuned on a subset of Big-Vul and evaluated on unseen programming languages. Large Language…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What is the correlation between model scaling and consistency metrics for DeepSeek-V3 when evaluated on out-of-distribution reasoning benchmarks. Recently, there is a high demand for deploying DeepSeek-R1 and V3…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of varying dataset sizes (e.g., 1K, 5K, 10K samples) on the pass@1 accuracy of fine-tuned Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B for Romanized Nepali tasks, and how does this. Romanized…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does increasing parameter count from 7B to 33B in DeepSeek-V3 affect accuracy variance on GPQA Diamond under synthetic distribution shifts. In electronic trading markets, limit order books (LOBs) provide…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the performance of fine-tuned Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B on Romanized Nepali generalize to other low-resource language variants (e.g., Romanized Hindi or Marathi) when. Romanized Nepali,…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does fine-tuning Codestral on taxonomy-aligned vulnerability datasets affect zero-shot repair success rates on Big-Vul compared to fine-tuning on general code corpora. Within the realm of software…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of dataset alignment on the false positive rate of Codestral when evaluating vulnerability severity predictions on the SWCC benchmark. Static Application Security Testing (SAST) tools play a…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: Does optimizing the diversity-weight in Vendi-RAG improve FLAN-T5-xl robustness against syntactic distractors in HANS compared to standard relevance-based RAG baselines. Retrieval-augmented generation (RAG)…