Papers
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the performance of cross-lingual question answering models trained on fewer than 10 languages compare to models trained on 50+ languages when evaluated on the TyDiQA benchmark using. This paper presents…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the alignment performance of LaBSE on the MLQA benchmark change when evaluated with MA-DPR versus cosine similarity under different inference efficiency constraints (e.g., latency, FLOPs). Dense Passage…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the impact of model size scaling on the robustness of multilingual models against adversarial cross-lingual perturbations in the MLQA benchmark when measured with MA-DPR and cosine similarity.…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does adversarial cross-lingual perturbation affect the performance of multilingual models like LaBSE on the XQuAD benchmark when evaluated using MA-DPR versus cosine similarity. Information retrieval across…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the computational overhead and throughput trade-off of manifold-aware distance metrics in DPR compared to standard baselines when evaluated on the BEIR benchmark suite. Dense Passage Retrieval (DPR)…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the effect of combining manifold-aware distance metrics with sparse retrieval methods on exact match accuracy and retrieval latency in low-resource settings using the NQ benchmark. Dense Passage Retrieval…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: To what extent do synthetic question-answer pairs generated for specialized domains improve the zero-shot generalization of retrieval models compared to fine-tuning on standard benchmarks. Recent advancements in…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: Does Vendi-RAG's adaptive approach improve robustness against adversarial or out-of-distribution queries in specialized domains such as legal or financial QA, as evaluated using metrics like BLEU or. In the…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the trade-off between retrieval latency and answer accuracy in Vendi-RAG when evaluated on the TriviaQA benchmark with different model sizes. Accurate and contextually faithful responses are critical when…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does varying the diversity-aware retrieval threshold in Vendi-RAG impact downstream code generation performance on HumanEval compared to standard RAG. Current search techniques are limited to standard RAG…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: Does the adaptive trade-off mechanism in Vendi-RAG improve robustness against noisy retrieval contexts in code synthesis benchmarks like MBPP compared to relevance-only baselines. Retrieval-augmented generation…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does Vendi-RAG's iterative diversity optimization affect pass@k scores on HumanEval compared to standard RAG when evaluated on Llama2-70B versus Mistral-7B. Retrieval-augmented generation (RAG) enhances large…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does the performance of Llama-3-8B-128K, Qwen-8B, and Mistral-8B vary on long-context tasks across different domains (e.g., legal, scientific, literary) when evaluated with a domain-specific. We study the…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What is the robustness gain (measured by adversarial accuracy) of semantics-guided adversarial training over standard training when scaling to larger transformer models like Llama-2 in code. Predicting the…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does semantics-guided adversarial training compare to standard adversarial training in terms of inference latency and memory usage when applied to transformer-based language models on the GLUE. Predicting the…
Abstract: This report synthesises findings from 5 peer-reviewed papers addressing the following research question: How does the performance of Blended RAG scale with increasing dataset sizes on multi-domain benchmarks like MMLU or HELM, compared to baseline RAG methods, when evaluated using exact match accuracy.…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: Does the performance gap between gist-based and verbatim memory compression in long-video QA tasks persist when evaluated on out-of-domain temporal reasoning datasets. While multimodal large language models have…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the trade-off between inference latency and reasoning accuracy when applying graph-augmented attention with different memory distillation ratios in multimodal video agents. While multimodal large language…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of hybrid embeddings (combining Sentence-T5 and MPNet) on the robustness of Tree of Reviews against adversarial noise in multi-hop QA benchmarks like HotpotQA and TriviaQA. Symmetries are…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the integration of structural graph priors affect the scaling laws of multimodal models compared to pure attention architectures on vision-language benchmarks. Multimodal Transformers serve as the…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the Tree of Reviews retrieval framework compare to chain-based retrieval in terms of latency and throughput when scaling to SQuAD variants with 100K+ documents using Llama-3-8B-128K. Multi-hop question…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the inference efficiency of graph-based multimodal models compare to dependency-free models under adversarial perturbations when evaluated on MM-Vet. Real-time traffic prediction models play a pivotal…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: Can LongNav-R1's multi-turn RL approach be extended to multimodal models like Flamingo, and how does it compare in terms of navigation success rate and trajectory smoothness on the Habitat-3D. This paper develops…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the multi-turn RL framework in LongNav-R1 compare to single-turn approaches in terms of accuracy on the RxR-CE benchmark when evaluated with Success Weighted by Path Length (SPL) and goal. This paper…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: Can the horizon-adaptive multi-turn RL approach in LongNav-R1 be extended to improve robustness in cross-domain navigation tasks, as measured by performance on the R2R-UNSEEN benchmark compared to. This paper…