Papers
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the computational efficiency trade-off between the confusion-based interactive method and the more complex policy-gradient-based approach in LongNav-R1 when evaluated on the House3D benchmark. This paper…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the generalization performance of LongNav-R1's multi-turn RL framework compare to single-turn VLA policies when transferred to unseen environments in the R2R benchmark, measured by success. This paper…
Abstract: This report synthesises findings from 17 peer-reviewed papers addressing the following research question: Does the multi-turn reasoning architecture of LongNav-R1 improve success rate metrics on the RxR-CE benchmark compared to standard single-turn VLA approaches. Vision-and-Language Models (VLMs) have shown…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the multi-turn RL framework of LongNav-R1 compare to single-turn VLA policies in terms of SPL (Success weighted by Path Length) and nDTW (normalized Dynamic Time Warping) on the R2R. This paper develops…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the reduction in GPU memory consumption achieved by LongNav-R1 versus baseline single-turn VLA models during long-horizon task execution on RxR-CE. This paper develops LongNav-R1, an end-to-end multi-turn…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the throughput impact of using DPO versus RLHF for alignment when evaluating LLMs on the HEIGER benchmark for adversarial code generation tasks. Direct Preference Optimization (DPO) has emerged as a…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does Direct Preference Optimization (DPO) compare to RLHF in terms of sample efficiency and convergence speed when fine-tuning LLMs on the SQuTR benchmark with noisy inputs. Aligning language models with…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does quantized influence measure compare to standard attention-based retrieval in improving code generation accuracy on multi-file dependency benchmarks. This study presents an innovative enhancement to…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the effect of different retrieval strategies (e.g., dense vs. sparse retrieval) on the end-to-end throughput and accuracy of Llama-3-8B in RAG-augmented question answering on the MusicQA.…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: Does fine-tuning Llama-3-8B with LongRAG objectives improve generalization scores on cross-domain long-context QA tasks relative to domain-specific fine-tuning alone. Large Language Models (LLMs) have been widely…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the integration of hybrid retrieval methods (combining dense and sparse) in RAG systems impact inference latency and accuracy trade-offs on multi-track music QA benchmarks compared to.…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: To what extent do different retrieval augmentation strategies (e.g., multi-stage RAG, re-ranking) improve the robustness of Llama-3-8B on adversarial or ambiguous multi-track music QA benchmarks. Recent…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: To what extent does Oracle-RLAIF improve the alignment and error correction capabilities of large language models under adversarial input perturbations compared to traditional RLHF methods on code. Reinforcement…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the impact of varying levels of domain-shift noise on the inference efficiency and accuracy trade-offs of deep learning models evaluated on multimodal reasoning benchmarks. Visual Question Answering (VQA)…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the robustness of CNN architectures to synthetic acoustic noise perturbations compare between standard supervised training and reinforcement learning from human feedback (RLHF) on. The success of…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of reward-weighted alignment versus direct preference optimization on inference latency and throughput for multilingual code generation models. The automatic generation of counter-speech (CS)…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: Does the Oracle-RLAIF training method improve inference latency compared to SFT on the MSVD benchmark, and how does this scaling behavior differ for models with 1B, 7B, and 13B parameters. Recent advances in…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the robustness of RLAIF-trained multimodal models compare to SFT baselines on out-of-domain video captioning benchmarks like MSR-VTT versus in-domain MSVD. It is encouraged to see that progress has been…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the impact of varying the quality of AI feedback (e.g., synthetic vs. human-annotated rewards) on the CIDEr score improvement of Oracle-RLAIF on the MSVD benchmark for models with 7B, 13B,. Recent…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the CIDEr score improvement of Oracle-RLAIF over SFT compare to other reinforcement learning methods (e.g., PPO, DQN) on the MSVD benchmark across different model sizes. In post-training for reasoning…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the fine-tuning process for Qwen2.5 affect its performance on code generation benchmarks like HumanEval and MBPP compared to models trained on smaller pre-training datasets. We introduce self-invoking…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What is the trade-off between throughput and code generation accuracy when comparing Mistral-7B and Llama-3-8B-128K in multi-threaded environments using the HumanEval benchmark. As machine learning models are…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the sliding window attention mechanism in Mistral-7B affect its performance on long-context reasoning benchmarks compared to Llama-3-8B-128K under memory-constrained inference conditions. We introduce…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the scaling of quantized InternLM models (7B vs. 13B) influence performance stability in the presence of adversarial multimodal inputs compared to full-precision baselines on the LLaVA. We introduce…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: Can the tryout controller mechanism in RxR-trained agents generalize to other language-grounded navigation benchmarks, such as Room-Across-Room (RxR), with measurable improvements in success rate and. We…