Papers
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: Do manifold-aware embeddings derived from Wikipedia-based semantic relatedness metrics improve cross-lingual dense retrieval performance on XQuAD compared to standard cosine similarity, as measured. Dense Passage…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does few-shot prompting variation affect SWE-bench pass@k scores in GPT-4o compared to closed-source models like Claude 3. Prompt engineering reduces reasoning mistakes in Large Language Models (LLMs).…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: Benchmark archaeology: investigate SWE-bench score discrepancy for GPT-4o — reported 7.0\%–83.4\% (spread 76.4pp) across 2 papers. Sources: 'SWE-bench Goes Live!' (7.0\%); 'FeedbackEval: A Benchmark for. The…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How do multimodal RAG architectures (incorporating text and image retrieval) compare to text-only RAG systems in terms of Recall@1000 and reasoning accuracy on cross-domain benchmarks like JURIS-AQA.…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the impact of domain-specific fine-tuning (e.g., legal domain) on the robustness of RAG models against adversarial attacks compared to general-domain fine-tuning, as measured by Recall@1000. Retrieval…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: Do manifold-aware distance functions improve cross-domain robustness in code generation models when evaluated on perturbed benchmark suites like HumanEval compared to traditional metric baselines. Code generation…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of manifold regularization on zero-shot cross-lingual retrieval accuracy for low-resource languages within the BEIR evaluation suite. Zero-shot evaluation of information retrieval (IR) models…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the effect of scaling multilingual models with manifold-aware distance metrics (e.g., MA-DPR) on cross-lingual retrieval performance across different language families in the MLQA benchmark,. Dense…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How do manifold-aware distance metrics (e.g., MA-DPR) improve the robustness of multilingual models like LaBSE against adversarial cross-lingual retrieval attacks on MLQA, as evaluated by accuracy. While…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the performance of Llama3, Codestral, and Deepseek R1 on vulnerability classification in Big-Vul compare to specialized vulnerability detection models like GitHub CodeQL in terms of. Modern software…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the inference latency of manifold-aware dense retrieval models compare to standard DPR baselines when evaluated on the HotpotQA benchmark. Dense Passage Retrieval (DPR) typically relies on Euclidean or…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: To what extent does varying the level of semantic overlap in retrieved documents affect the hallucination rates of large language models in retrieval-augmented generation settings. Retrieval-augmented generation…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does document redundancy in retrieval corpora impact the answer accuracy and latency of joint optimization RAG frameworks on the Natural Questions benchmark. Retrieval-Augmented Generation (RAG) systems…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What are the trade-offs between retrieval efficiency and generation quality when applying diversity-aware re-ranking strategies in RAG systems evaluated on open-domain QA tasks. Retrieval-augmented generation…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the effect of adversarial perturbations on the calibration error of transformer-based trajectory forecasters evaluated on the Argoverse 2 Sensor Dataset. Predicting the trajectories of surrounding objects…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the multi-granularity capability of M3-Embedding affect retrieval latency and throughput scalability on the HotpotQA benchmark compared to single-granularity dense retrievers. Visual localization is of…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the fact-chaining accuracy of Llama-3-8B-128K compare to Qwen-8B and Mistral-8B on the BABILong benchmark when context length increases from 32K to 128K. In recent years, the input context sizes of large…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How do Llama-3-8B-128K, Qwen-8B, and Mistral-8B differ in robustness to irrelevant context noise within the BABILong dataset as the total sequence length scales to 128K. We study the continual pretraining recipe…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the robustness of Tree of Reviews retrieval compare to chain-based retrieval for Llama-3-8B-128K when evaluated on adversarial or noisy versions of SQuAD using different embedding models. Dense retrieval…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of varying embedding dimensionality (e.g., 384, 768, 1024) on retrieval-augmented generation (RAG) performance for Llama-3-8B-128K on SQuAD when using Tree of Reviews versus.…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of graph-augmented attention mechanisms on inference latency and throughput for large-scale multimodal information extraction tasks relative to standard Vision-Language Models. While multimodal…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: Do structural graph priors improve the robustness of zero-shot multimodal reasoning against adversarial text perturbations in evaluation suites like MM-Vet compared to dependency-free architectures. We propose…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does integrating structural graph priors into multimodal transformers affect zero-shot extraction accuracy on noisy image-text benchmarks like NoisyVisDial compared to pure attention baselines. Deep neural…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How do multimodal grounding models perform in disambiguating long-horizon navigation instructions in the Matterport3D benchmark when compared to LongNav-R1's interactive learning framework, measured. This paper…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the inference latency of deep residual architectures compare to transformer-based models in zero-shot image classification on ImageNet. The remarkable success of Vision Transformers in Artificial Neural…