Papers
Abstract: This report synthesises findings from 2 peer-reviewed papers addressing the following research question: What is the throughput and latency trade-off between Codestral-7B and Codestral-70B when classifying vulnerabilities in Big-Vul under varying levels of parallelized inference and model quantization. 5 claims were…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the F1-score degradation under synthetic obfuscation compare between Llama3-7B and Llama3-70B when fine-tuned on domain-specific vulnerability classification tasks (e.g., using SARD or OWASP. 9 claims…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: What is the correlation between cyclomatic complexity levels in training data and the false negative rate of Deepseek R1 on the Big-Vul vulnerability detection benchmark. 8 claims were extracted from source…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the vulnerability detection F1-score of Deepseek R1 vary when fine-tuned on code subsets stratified by cyclomatic complexity using the Big-Vul dataset. 9 claims were extracted from source literature; 9…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the vulnerability detection accuracy of Llama3 and Codestral degrade under adversarial code obfuscation techniques compared to standard Big-Vul samples. 12 claims were extracted from source literature; 7…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: What is the correlation between code structural complexity and false positive rates in Deepseek R1's vulnerability detection performance on the Big-Vul benchmark. 11 claims were extracted from source literature; 4…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does the inference latency of Deepseek R1 scale with increasing cyclomatic complexity when evaluating code vulnerability datasets like Big-Vul. 8 claims were extracted from source literature; 7 were…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the memory footprint of Deepseek R1 during vulnerability analysis compare between high-complexity and low-complexity code samples in standardized evaluations. 8 claims were extracted from source…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How do different alignment strategies in multimodal models impact inference throughput in low-resource settings when evaluated on BRATS with simulated versus real MR scans. 7 claims were extracted from source…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: What is the comparative robustness of multimodal reasoning in language models with different alignment strategies when applied to cross-domain medical imaging tasks, as measured by segmentation. 7 claims were…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How do different multi-scale feature fusion strategies in 3D CNNs affect the robustness of brain lesion segmentation models across heterogeneous medical imaging datasets beyond BRATS. 9 claims were extracted from…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of varying patch sizes and dense training schemes on the segmentation accuracy and computational efficiency of the 11-layer 3D CNN when evaluated on BRATS and other volumetric. 11 claims were…
Abstract: This report synthesises findings from 5 peer-reviewed papers addressing the following research question: What is the impact of curriculum-based multi-task learning on the accuracy of large multimodal models in cross-domain medical image-text pair tasks, as measured by the RadNet benchmark. 6 claims were extracted…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does curriculum-based multi-task learning affect the alignment between image and text embeddings in sparse medical datasets compared to single-task learning, as evaluated using the CLIP score on. 11 claims…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the performance of the proposed 3D CNN with fully connected CRF for brain lesion segmentation compare to transformer-based architectures on the BRATS benchmark in terms of accuracy and. 10 claims were…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the inference throughput of curriculum-based multi-task learning compare to single-task learning on sparse medical image-text pairs when evaluated using the CHEST-i7 benchmark for multimodal. 10 claims…
Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: What is the correlation between batch size during adversarial training and the robustness of Codestral against syntax-perturbed MBPP benchmarks. 10 claims were extracted from source literature; 10 were…
Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: How does MFOUR's Synaptic Routing affect Codestral's robustness to adversarial inputs in the AdvBench benchmark when scaling from 8K to 32K context lengths. 11 claims were extracted from source literature; 5 were…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the impact of continual learning strategies on the retention of code generation capabilities in large language models as measured by performance degradation on MultiPL-E after sequential task. 9 claims…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: How does the integration of synaptic routing mechanisms affect the pass@1 scores of code generation models like Codestral on the HumanEval benchmark when subjected to adversarial syntax perturbations. 11 claims…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How robust are Tulu 3 models to adversarial prompts compared to Deepseek R1 on the BBH benchmark for alignment and safety evaluation. 13 claims were extracted from source literature; 11 were independently…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the inference throughput of Mamba-based selective state space models compare to FlashAttention-optimized Transformers on the HumanEval+ code generation benchmark for sequences exceeding 32k. 13 claims…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the inference efficiency trade-off between Tulu 3 and Deepseek R1 when running on low-resource devices for code generation tasks measured in tokens per second. 10 claims were extracted from source…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does fine-tuning on security-specific datasets impact the cross-domain robustness of Llama3 and Deepseek R1 in vulnerability classification tasks. 12 claims were extracted from source literature; 9 were…
Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: What is the impact of code complexity metrics (e.g., cyclomatic complexity, Halstead volume) on the inference latency and throughput of state-of-the-art code LLMs when processing obfuscated versus. 11 claims were…