Papers
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the performance of fine-tuned Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B on Romanized Nepali generalize to other low-resource language variants (e.g., Romanized Hindi or Marathi) when. Romanized Nepali,…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does fine-tuning Codestral on taxonomy-aligned vulnerability datasets affect zero-shot repair success rates on Big-Vul compared to fine-tuning on general code corpora. Within the realm of software…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of dataset alignment on the false positive rate of Codestral when evaluating vulnerability severity predictions on the SWCC benchmark. Static Application Security Testing (SAST) tools play a…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: Does optimizing the diversity-weight in Vendi-RAG improve FLAN-T5-xl robustness against syntactic distractors in HANS compared to standard relevance-based RAG baselines. Retrieval-augmented generation (RAG)…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the integration of multimodal inputs like AST and control flow graphs affect the vulnerability repair capabilities of DeepSeek R1 compared to Codestral, when evaluated on the Big-Vul dataset. The…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the effect of varying Vendi-RAG's diversity-weight on FLAN-T5-xl inference latency and token throughput when evaluated on ANLI and HANS datasets. LLM inference is still evaluated mainly as a model or…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the pass@k metric degrade for code generation models between 1B and 10B parameters when transitioning from standard HumanEval to self-invoking HumanEval Pro tasks under fixed token budgets. We introduce…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the diversity-weight parameter in Vendi-RAG impact FLAN-T5-xl accuracy and F1-score on the ANLI and HANS adversarial benchmarks. State-of-the-art few-shot learning (FSL) methods leverage prompt-based…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does varying the diversity-weight parameter in Vendi-RAG affect the trade-off between factuality and coherence scores on the ELI5 dataset compared to standard RAG baselines. Current evaluation methods for…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the effect of combining BM25 and dense retrievers on the inference latency and throughput of RAG pipelines in production environments. Retrieval-Augmented Generation (RAG) enhances Large Language Models…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the performance of manifold-aware distance metrics compare to traditional distance metrics (cosine, Euclidean) in dense passage retrieval when evaluated on long-context benchmarks like. Dense Passage…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: Do manifold-aware dense retrieval models demonstrate improved robustness and stability in Recall@1000 scores under out-of-distribution query shifts in biomedical or legal domain QA benchmarks. Dense Passage…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How do manifold-aware distance metrics perform in cross-domain and cross-lingual retrieval tasks (e.g., FEVER, MLQA) compared to multilingual models like mDPR or LaBSE, particularly when evaluated on. Dense…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does replacing Euclidean distance with manifold-aware metrics in dense retrieval affect Recall@1000 performance on multi-hop reasoning datasets like HotpotQA compared to standard DPR baselines. Dense Passage…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the computational overhead and throughput impact of manifold-aware distance metrics (MA-DPR) compared to standard distance metrics in large-scale retrieval systems, when scaled to billions of. Dense…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does Vendi-RAG's diversity-quality trade-off impact pass@k metrics on the HumanEval and MBPP code generation benchmarks compared to dense retrieval baselines. Retrieval-augmented generation (RAG) enhances…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the computational overhead of Vendi-RAG's iterative joint optimization process compared to traditional RAG, measured in terms of latency and throughput on the MS MARCO passage ranking. Retrieval-augmented…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the impact of semantics-guided adversarial perturbations on the code generation success rates of multimodal models when evaluated on cross-domain programming tasks. Adversarial examples reveal the blind…
Abstract: This report synthesises findings from 17 peer-reviewed papers addressing the following research question: What is the impact of adversarial training on the calibration of probabilistic occupancy grid predictions in urban autonomous driving models evaluated on the Waymo Open Dataset. Being able to generate realistic…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What is the impact of token scheduling strategies on the inference throughput and alignment scores of sparse multimodal models versus dense architectures on OK-VQA. Recent advancements in Multimodal Large Language…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the inference latency of Tree of Reviews compare to chain-based retrieval methods on MuSiQue when scaling retrieval hops from 2 to 4 on Llama-3-8B. Compared to black-box neural networks, logic rules…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the performance of Llama-3-8B-128K compare to other 8B-parameter models like Mistral-8B or Qwen-8B in multi-hop retrieval accuracy on HotPotQA and MuSiQue benchmarks when using chain-based. Prompt…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the impact of varying the number of hops on the trade-off between retrieval accuracy and latency in Tree of Reviews versus chain-based retrieval for Llama-3-8B-128K on SQuAD and HotPotQA.…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the integration of graph neural networks in multimodal fusion architectures impact zero-shot reasoning accuracy on long-horizon navigation benchmarks compared to attention-based models. Multimodal…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the robustness of LongNav-R1 to instruction ambiguity on the RxR-CE benchmark compare to standard single-turn VLA policies in terms of trajectory deviation metrics. This paper develops LongNav-R1, an…