Papers
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does multimodal context (text + code diagrams) affect the iterative code repair performance of DeepSeek-R1 on FeedbackEval compared to text-only context, measured by repair success rate and token. Code repair…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the token efficiency of DeepSeek-R1 compare to Claude-3 when performing few-shot code generation on HumanEval, measured by pass@1 accuracy per token consumed. How far are Large Language Models (LLMs) in…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does INT4 quantization affect the zero-shot code generation performance of Llama-3.1 models on HumanEval, and does this trade-off persist across different hardware configurations (e.g., A100 vs.. Quantization…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the trade-off between inference latency and code generation accuracy for DeepSeek-R1 versus other LLMs (e.g., CodeLlama, WizardCoder) when evaluated on HumanEval-V and MBPP benchmarks. This paper explores…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the impact of context window scaling on the security vulnerability detection performance of DeepSeek-R1 compared to other models across different code lengths and complexity levels. Many studies have…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does fine-tuning Llama3 with the Big-Vul dataset's vulnerability classification annotations impact its performance on the FeedbackEval benchmark compared to the base model. Detecting toxic content using…
Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: What is the efficiency-accuracy trade-off when deploying Deepseek R1 and Claude in secure code review pipelines, measured by inference latency and vulnerability detection F1-scores on the Big-Vul. Large language…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: To what extent does multimodal training with static code analysis visualizations improve Codestral's ability to classify vulnerabilities in the Big-Vul dataset compared to text-only training. Increasing…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does instruction tuning with code security examples improve Llama3's zero-shot performance on the Big-Vul dataset compared to general code instruction tuning. Large Language Models (LLMs) have demonstrated…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the impact of model size scaling (e.g., 7B vs 33B) on Codestral's vulnerability classification accuracy across different severity levels in Big-Vul. While automated vulnerability detection techniques have…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does few-shot prompting with vulnerability taxonomy examples affect DeepSeek-V3's precision on Big-Vul compared to fine-tuning approaches. Few-shot prompting has emerged as a practical alternative to…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the auxiliary-loss-free load balancing strategy in DeepSeek-V3 influence model performance stability on code generation tasks in the GPQA Diamond domain compared to traditional MoE load. For…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the impact of model size scaling on the pass@1 accuracy of Llama3, Codestral, and Deepseek R1 when evaluating vulnerability classification on the Big-Vul dataset. Recent advancements in generative AI have…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the inclusion of multimodal context (e.g., commit messages, code diffs) affect the vulnerability detection accuracy of LLMs compared to text-only file context on the Big-Vul dataset. Detecting…
Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: How does the performance of DeepSeek-R1 compare to Claude on SWE-bench Verified across different programming languages when provided with issue-specific file context versus baseline context-free. The evaluation…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: To what extent does scaling the model size of DeepSeek-V3 from 7B to 33B parameters improve its robustness to distribution shifts in GPQA Diamond questions, as evaluated by accuracy and consistency. Foundation…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of fine-tuning on the pass@1 accuracy of Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B for Romanized Nepali language tasks using the same bilingual dataset. Romanized Nepali, the Nepali language…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B compare in terms of inference efficiency (throughput and latency) when generating code on MBPP under constrained hardware conditions. Romanized Nepali, the…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of fine-tuning Codestral on taxonomy-aligned vulnerability datasets compared to general code datasets, as measured by repair success rates on the Big-Vul dataset and the SWCC. Context:…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the integration of multimodal inputs (e.g., AST + control flow graphs) affect the vulnerability repair capabilities of DeepSeek R1 versus Codestral, measured by accuracy and throughput on. With the…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does varying the diversity-weight parameter in Vendi-RAG affect the performance of FLAN-T5-xl on adversarial benchmarks like ANLI and HANS, as measured by accuracy and F1-score. Retrieval-augmented generation…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do multimodal models (e.g., visual+code) compare to text-only LLMs in solving self-invoking code generation tasks on HumanEval Pro and MBPP Pro, measured by both accuracy and inference latency at. We…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the Performance-Efficiency Ratio scale with model size (0.5B to 13B parameters) when tested on the original vs. progressively harder versions of HumanEval and MBPP benchmarks under the same. We introduce…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the performance of Vendi-RAG with adaptive diversity-weight tuning vary across different domains (e.g., code generation with HumanEval vs. multimodal reasoning with MMQA) when measured by. Understanding…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the diversity-weight parameter in Vendi-RAG influence the model's performance on the ELI5 dataset when evaluated using human judgments for factuality and coherence, compared to automated. While humans…