Papers
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of semantics-guided adversarial training on the generalization gap between in-domain and out-of-domain trajectory prediction tasks. Predicting the trajectories of surrounding objects is a…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How do adversarially trained trajectory prediction models compare in inference latency and accuracy trade-offs when evaluated on standard autonomous driving planning benchmarks. We introduce a motion forecasting…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the robustness of alignment-weighted DPO scale across LLaMA-2 variants (7B, 13B, 70B) on adversarial TruthfulQA prompts compared to standard DPO alignment. Adversarial robustness of deep learning models…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What is the inference latency impact of applying alignment-weighted DPO on code generation tasks using HumanEval and MBPP benchmarks. We introduce self-invoking code generation, a new task designed to evaluate the…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: Does the inference efficiency of sparse multimodal models with varying numbers of experts improve with higher alignment scores on VQAv2 and OK-VQA, and how does this trade-off compare to dense models. Sparse…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the alignment score (e.g., via RLHF or DPO) of sparse multimodal models with varying numbers of experts correlate with their performance on the OK-VQA benchmark compared to dense models. Background:…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the trade-off between retrieval latency and answer accuracy when scaling the number of hops in Tree of Reviews vs. chain-based retrieval for Llama-3-8B-128K on the HotPotQA and MuSiQue.…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the impact of varying the number of retrieval hops (e.g., 2-hop vs. 3-hop) on the F1 score stability of the Tree of Reviews framework compared to chain-based retrieval in Llama-3-8B-128K when. Multi-hop…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the cross-validation performance of LongNav-R1 vary across different multimodal input modalities when processing long-horizon navigation tasks. Robot vision has greatly benefited from advancements in…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the inference latency of LongNav-R1 compare to single-turn VLA policies when evaluated on the RxR-CE navigation benchmark using standard desktop GPUs. This paper develops LongNav-R1, an end-to-end…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the Tree of Reviews retrieval framework compare to other tree-based retrieval methods in terms of accuracy and computational overhead when applied to Llama-3-8B models on the MultiHopQA. Multi-hop…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of varying retrieval-augmentation contexts (e.g., different music metadata sources, retrieval depths) on Llama-3-8B-128K's response accuracy for fact-based versus interpretive. Recent work on…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: Can retrieval-augmented generation (RAG) improve the consistency of Llama-3-8B-128K's responses in multi-track comparative music QA when evaluated using a novel semantic consistency metric across. The advent of…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does Oracle-RLAIF's sample efficiency compare to traditional supervised fine-tuning when evaluated on the RxR-CE benchmark's nDTW score across different training compute budgets. Recent advances in large…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the performance of Llama-3-8B-128K compare to other open-source LLMs (e.g., Falcon-40B, Mistral-7B) on Jamendo-MT-QA when evaluated using both human annotations and automated metrics like. Recently,…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does Oracle-RLAIF maintain cross-lingual generalization capabilities on RxR-CE when scaling from English-only pretraining to multilingual human preference data. To democratize large language models (LLMs) to…
Abstract: This report synthesises findings from 2 peer-reviewed papers addressing the following research question: What is the computational efficiency (inference latency, FLOPs, or energy consumption) of VELMA compared to Flamingo and PaLI when deployed on standard vision-language benchmarks like VQA-v2 or. We explore…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: How does the multi-turn reinforcement learning approach in LongNav-R1 compare to other state-of-the-art RL-based navigation models in terms of sample efficiency and convergence speed on the R2R. We introduce…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: What is the impact of instruction complexity on the path completion rate of Embodied-R1 compared to 7B and 13B VLAs when evaluated on the ALFRED benchmark for embodied task completion. Abstract The rapid evolution…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: Does the performance gap between 7B and 13B VLAs in object grounding persist when evaluated on cross-domain vision-language benchmarks such as LVIS or COCO-Text. We introduce InternVL 2.5, an advanced multimodal…
Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: How does the 3B VLM in Embodied-R1 compare to 7B and 13B VLAs in terms of inference efficiency and memory footprint when evaluated on LongNav-R1 with R2R-CE instructions of varying complexity. The field of fluid…
Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: Can 13B VLA models achieve better zero-shot cross-dataset generalization than 7B models on the R2R-CE benchmark when augmented with external multimodal pretraining data. The proliferation of Large Language Models…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: How does the performance of 13B VLA models compare to 7B models on the R2R-CE benchmark when evaluated with multi-stage navigation tasks under noisy or adversarial linguistic inputs. Recently, Multimodal Large…
Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: What is the correlation between instruction complexity in LongNav-R1 and the grounding accuracy of 7B vs. 13B VLA models, as measured by entity detection F1 scores on R2R-CE validation splits. Multimodal datasets…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What are the effects of alignment techniques (e.g., RLHF, constitutional AI) on the robustness of sparse MoE models in self-invoking code generation tasks, measured by accuracy on adversarial. Large Language…