Papers
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the computational efficiency of retrieval-augmented generation (RAG) compare to parametric-only models in large-scale code generation tasks evaluated using the MBPP benchmark. This research presents and…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the choice of metapath length in deep heterogeneous graph networks affect the inference efficiency and memory usage in large-scale molecular property prediction tasks compared to standard. Heterogeneous…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does the inference efficiency of GNN-based code generation models compare to traditional LLM-based approaches when evaluated on the BIGCode dataset using metrics like latency and tokens per second. Tokens are…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the impact of integrating repository context (e.g., imports, parent classes) on the accuracy of code completion tasks when using multimodal GNN-based models evaluated on the BIGCode benchmark.…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How do multimodal models combining HGNNs with metapath context convolution and vision-language models perform on adversarial robustness benchmarks for code generation compared to unimodal HGNN. Generative…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: What is the efficiency trade-off between using reciprocal normalization versus standard batch normalization in code generation models when evaluated on inference latency and throughput for tasks. Current search…
Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: How does the performance of Deepseek R1 on MultiMedQA vary when fine-tuned on datasets with controlled levels of training set contamination across Bloom's Taxonomy levels. Public health reasoning requires…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: What is the efficiency trade-off in terms of inference time and memory usage between standard message-passing HGNNs and HGNNs with metapath context convolution on large-scale graph-structured code. Since…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of different alignment techniques on the robustness of Llama3 and Codestral in maintaining F1-score stability under high data contamination rates in code vulnerability detection. Abstract The…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: How does the choice of stratified versus random sampling affect the trade-off between F1-score variance and computational efficiency in Llama3 and Codestral when detecting code vulnerabilities with. Abstract Data…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: Do domain-finetuned multilingual M2QA models demonstrate improved reasoning accuracy on out-of-distribution adversarial examples compared to zero-shot baselines. Finetuning language models on a collection of…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does fine-tuning multilingual M2QA models on domain-specific corpora affect their adversarial robustness scores compared to zero-shot cross-domain transfer. In response to rising concerns surrounding the…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the integration of channel-wise feature misalignment correction in multimodal models affect the accuracy and inference latency when evaluated on the MM-ReAct benchmark for scientific. In this paper we…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of domain adaptation on the inference latency and throughput of multilingual question answering models under adversarial perturbations. Natural language processing (NLP) has significantly…
Abstract: This report synthesises findings from 2 peer-reviewed papers addressing the following research question: What is the impact of hybrid retrieval methods (dense + sparse) on the factual consistency of RAG systems when evaluated on the Telco-DPR benchmark's table-heavy subcorpus compared to text-heavy. Advancements in…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the integration of MA-DPR versus lexical methods impact the reasoning accuracy and latency trade-offs in RAG systems when evaluated on complex multi-hop question-answering benchmarks like. Large Language…
Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: How does the performance of MA-DPR-based RAG systems degrade under adversarial attacks compared to lexical retrieval methods when evaluated on the AdversarialQA benchmark for robustness. Abstract Transformer-based…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the throughput efficiency comparison between MA-DPR and traditional BM25 retrieval methods in RAG systems when scaling to large-scale code generation tasks using the HumanEval benchmark. Large Language…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: Does adversarial contrastive learning with few-shot prompting improve robustness to adversarial examples in code generation tasks evaluated on HumanEval, measured by pass@1 and pass@k metrics. Large Language…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the zero-shot question generation re-ranking method compare to retrieval-augmented generation (RAG) models in terms of downstream QA accuracy on the TriviaQA benchmark. Large Language Models (LLMs)…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: Does the zero-shot question generation approach generalize to multimodal retrieval tasks, and if so, how does it perform compared to CLIP-based retrieval on the LAION-5B dataset. A big convergence of language,…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: What is the throughput trade-off between MA-DPR and quantized Euclidean DPR models when evaluated on the BEIR benchmark using edge AI accelerators. Encoder-only transformer models such as BERT offer a great…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: What is the impact of varying the number of negative samples in adversarial contrastive learning on inference throughput for cross-lingual rumor detection in TyDi QA subsets. Infinite numbers of real-world…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: Does the alignment stability of large multilingual models under adversarial prompting in technical domains scale differently than in general conversational benchmarks when measured by refusal rate. As Large…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the ranking consistency of multilingual LLMs on technical code generation benchmarks like HumanEval-Multi compare to their performance on general knowledge benchmarks as model scale increases. Abstract…