Papers
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does domain adaptation via cross-task fine-tuning affect the robustness of SLMs in detecting CWEs in Python code under adversarial perturbations compared to a baseline of pre-trained LLMs. A joint measurement…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the trade-off between inference throughput and pass@1 accuracy for SLMs vs. LLMs in CWE detection tasks on private Python codebases when deployed on-device vs. in cloud environments. Large Language Models…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the alignment of LLaMA-70B with human preferences via PowerInfer's dynamic threshold adjustment scale with model size, as measured by accuracy on MBPP and the degree of preference divergence. Aligning…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the robustness of MORL-based preference alignment in PowerInfer when evaluated across diverse programming languages beyond Python (e.g., JavaScript, Java) using the HumanEval benchmark. Fine-grained…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does the dynamic hot neuron threshold adjustment in PowerInfer compare to fixed threshold methods in terms of inference latency and memory efficiency when applied to LLaMA-70B on MBPP Python. Large Language…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the detection accuracy of federated learning models compare to centralized deep neural networks when evaluated on the AndroZoo benchmark with varying levels of code obfuscation and. This work investigates…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How robust are federated learning-based malware detection models to adversarial attacks targeting the aggregation process, measured by the degradation in F1-score when subjected to gradient poisoning. This work…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: What is the throughput scalability of federated learning frameworks like FEDetect when increasing the number of client devices in a distributed IoT malware detection setting. This work investigates the…
Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: To what extent does domain adaptation via federated transfer learning improve model generalization in malware detection when trained on N-BaIoT and evaluated on unseen IoT device types, measured by. This work…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the integration of differential privacy in federated learning-based malware detection models affect the trade-off between model accuracy and communication efficiency, measured by F1-score. This work…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the NIASM hybrid approach perform on cross-lingual factual consistency (F1 score) compared to monolingual fine-tuning in multilingual models like Bloom and Llama-2 on the XSUM and CNN/DM. In an era…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: To what extent does the NIASM framework improve inference efficiency (tokens/sec) compared to baseline models like Vicuna-13B and Baichuan-2 when deployed on low-resource hardware for long-form. Customized…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does varying the TAE token misalignment threshold affect the hallucination rates of Vicuna-13B and Baichuan-2 across different domains in the FactCC and HalluEval benchmarks. Large language models (LLMs) have…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does the alignment score sensitivity of Baichuan 2 and Vicuna-13B compare when evaluated on multimodal benchmarks with varying degrees of token misalignment under constrained inference budgets. Multimodal LLMs…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: Does model size scaling (e.g., 7B vs. 13B vs. 30B parameters) correlate with syntax error reduction in CoT-generated code for structured data tasks on BigCodeBench. Large language models (LLMs) have demonstrated…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the performance of Code Llama and Code Llama - Python models scale with increasing model size (7B to 70B parameters) on BigCodeBench tasks measuring cross-library function composition,. We release Code…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the cross-domain generalization accuracy of fine-tuned Codestral-7B versus Llama3-70B on unseen programming languages beyond Python for security vulnerability classification. Many ML-based approaches have…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of combining semantic literature retrieval (Elicit) with code-focused context engineering on the accuracy of generated code for niche domains in multi-file projects, measured by. Large Language…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the effect of model partitioning strategies in split computing on the throughput of Llama3-70B versus Codestral-34B for code generation tasks on HumanEval-hard. We introduce SIMCOPILOT, a benchmark that…
Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: How does the performance of Gemini 1.5 Pro with an 8M context window compare to Llama3-70B with retrieval augmentation in classifying vulnerabilities on the CodeXGLUE security subset when the input. Large Language…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the robustness of vulnerability classification models like Gemini 1.5 Pro and Llama3-70B with retrieval augmentation vary when presented with adversarial or noisy inputs in the CodeXGLUE. We release Code…
Abstract: This report synthesises findings from 5 peer-reviewed papers addressing the following research question: What is the inference efficiency difference between Gemini 1.5 Pro and Llama3-70B with retrieval augmentation when processing large-scale security vulnerability classification tasks on the CodeXGLUE. Large…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: How do small language models (SLMs) fine-tuned with multimodal context compare to larger multimodal LLMs in terms of CWE detection accuracy and alignment metrics on the extended Big-Vul dataset. In this paper, we…
Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: What is the trade-off between model size and inference throughput for SecLM variants fine-tuned with multimodal inputs, as measured by latency comparisons on edge devices versus cloud infrastructure. Probably no…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the maximum context length that Mistral-Large-2 can handle. We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms…