Papers
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How do Llama3 and Deepseek R1 compare in code vulnerability classification accuracy when evaluated on the Big-Vul dataset with standardized CWE taxonomies. 12 claims were extracted from source literature; 5 were…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: What is the impact of synthetic data augmentation on the inference efficiency and false positive rates of DeepSeek Coder in vulnerability detection benchmarks. 11 claims were extracted from source literature; 5…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: Does synthetic data augmentation improve the robustness of Code Llama and DeepSeek Coder against obfuscated code patterns compared to models trained solely on Big-Vul. 0 claims were extracted from source…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How do inference latency and throughput metrics differ between Llama3.1 and Mistral 7B when processing complex genomic sequence classifications under adversarial noise. 10 claims were extracted from source…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does synthetic code vulnerability augmentation affect the cross-dataset generalization accuracy of Code Llama compared to training on curated Big-Vul subsets. 10 claims were extracted from source literature;…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of scientific domain-specific pre-training on the safety alignment scores of LLMs when evaluated on multimodal molecular representation tasks. 0 claims were extracted from source literature; 0…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What is the correlation between inference latency and vulnerability classification accuracy for open-weight LLMs processing obfuscated C/C++ code. 7 claims were extracted from source literature; 6 were…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: To what extent does chain-of-thought prompting improve the robustness of Codestral against syntax-preserving semantic obfuscation in vulnerability detection tasks. 6 claims were extracted from source literature;…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the robustness of Llama3.1 compare to Mistral 7B in detecting code vulnerabilities when subjected to adversarial syntax perturbations. 4 claims were extracted from source literature; 4 were independently…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does adversarial code obfuscation affect the vulnerability detection F1-score of Llama3 versus Deepseek R1 on the Big-Vul dataset. 9 claims were extracted from source literature; 9 were independently verified…
Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: What is the robustness comparison between Llama3.1 and Mistral 7B with and without RAG integration when evaluated on adversarial or noisy cyber-physical system battery management datasets, measured. 8 claims were…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the cross-domain adaptability of Llama3.1 versus Mistral 7B with RAG integration perform when fine-tuned on battery management datasets and then evaluated on other energy system anomaly. 8 claims were…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: To what extent does chain-of-thought prompting improve the classification robustness of open-weight LLMs against adversarial code obfuscation techniques in static analysis benchmarks. 12 claims were extracted from…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: What is the correlation between context window size and false positive rates when evaluating Deepseek R1 and Llama3 on long-sequence vulnerable code patterns in the Big-Vul dataset. 10 claims were extracted from…
Abstract: This report synthesises findings from 2 peer-reviewed papers addressing the following research question: How does the performance gap between Llama3 and Codestral in vulnerability classification (F1-score) vary when evaluated on Big-Vul samples with different programming languages (e.g., C vs. Java). 10 claims were…
Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: How does the retrieval-augmented generation (RAG) integration affect the inference latency and memory efficiency of Llama3.1 compared to Mistral 7B on cyber-physical system battery management. 8 claims were…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: What is the impact of model size (e.g., 7B vs 70B) on the robustness of Llama3 and Codestral in classifying vulnerabilities in Big-Vul, measured by F1-score degradation under increasing levels of. 8 claims were…
Abstract: This report synthesises findings from 5 peer-reviewed papers addressing the following research question: How does the performance of Deepseek R1 on vulnerability detection tasks degrade when fine-tuned on code with varying cyclomatic complexity levels, as evaluated by F1-score and false negative rate on. 8 claims…
Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: How does the F1-score of Llama3 and Codestral change when classifying vulnerabilities in Big-Vul samples with different levels of semantic-aware obfuscation compared to syntactic-only obfuscation. 7 claims were…
Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: What is the computational efficiency (inference time and memory usage) of Deepseek R1 when detecting vulnerabilities in high-cyclomatic-complexity code versus low-complexity code, as measured on the. 7 claims were…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does curriculum learning affect the inference efficiency of large multimodal models when evaluated on the MedQA benchmark compared to random data ordering. 10 claims were extracted from source literature; 10…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does adversarial training against data poisoning impact the out-of-domain generalization of CLIP-based models on non-standard benchmarks like ImageNetV2 or ImageNet-Sketch. 4 claims were extracted from source…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How do different alignment strategies in multimodal models influence reasoning performance when evaluated on the BRATS benchmark with varying levels of image-text sparsity. 11 claims were extracted from source…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the impact of cross-domain pre-training on the segmentation accuracy of multimodal models when evaluated on the BRATS benchmark versus other medical imaging datasets. 11 claims were extracted from source…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does curriculum-based multi-task learning affect the inference throughput of large multimodal models on sparse medical image-text pairs compared to traditional single-task learning methods. 0 claims were…