Papers
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: To what extent does the horizon-adaptive mechanism in LongNav-R1 improve success rates on out-of-distribution navigation instructions compared to standard fine-tuned VLA models. Language models (LMs) possess a…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the horizon-adaptive multi-turn RL approach in LongNav-R1 compare to other RL-based navigation frameworks like PointGoalRL in terms of sample efficiency and convergence speed when trained on. This paper…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the inference latency of LongNav-R1's multi-turn RL policy compare to single-turn VLA baselines on the RxR-CE benchmark when measured in tokens per second. This paper develops LongNav-R1, an end-to-end…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: Can a multi-stage validation framework with progressively complex unit tests (e.g., HumanEval, MBXP) improve the accuracy of reward signals while maintaining training stability in code generation. Current large…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: To what extent does combining implicit and explicit reward signals from unit tests improve the robustness of LLM-generated code across different programming languages on the MultiPL-E benchmark. Current large…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the use of dynamic reward scaling in unit test-based reward modeling affect the trade-off between alignment quality and inference efficiency in code generation tasks on the SQuTR benchmark. Current large…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the impact of dataset size on the robustness of DPO versus RLHF alignment methods when evaluated on multimodal reasoning benchmarks with corrupted image-text pairs. This paper studies the alignment…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: Does difficulty-based preference data selection improve inference efficiency and alignment quality on long-context reasoning benchmarks compared to standard RLHF pipelines. Aligning large language models (LLMs)…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the inclusion of rationales in preference data influence the robustness of DPO-trained models to adversarial prompts, measured by accuracy on the AdversarialQA benchmark across different.…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the integration of human-annotated rationales in preference data impact the alignment performance of DPO on the MMLU benchmark compared to standard RLHF, measured by accuracy across. Aligning language…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the impact of sliding window attention on the inference efficiency of GitHub Copilot in generating large-scale code snippets, and how does this trade-off between speed and accuracy compare to. Synthetic…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: Does fine-tuning on Adversarial GLUE datasets improve the stability of gradient-based attribution methods compared to attention-based methods under perturbed inputs. Adversarial perturbations are noise-like…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does the correlation between Integrated Decision Gradients and Attention Rollout attribution consistency vary across different adversarial attack types in the Adversarial GLUE benchmark. Deep neural networks…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How do different adversarial attack strategies on graph structure affect the inference latency of GNN-based NIDS models when evaluated using the UNSW-NB15 dataset compared to models trained on the. Deep neural…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the robustness of Integrated Decision Gradients compare to Attention Rollout in maintaining feature attribution consistency under adversarial text perturbations across standard NLP benchmark. Large-scale…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does head-tail-aware KL divergence scaling affect alignment metrics in large language models compared to standard KL divergence during distillation. Standard Knowledge Distillation (KD) compresses Large…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of gradient masking techniques on the robustness of GNN-based NIDS models against structural adversarial attacks as measured by the AUC-ROC score on the KDD Cup 99 dataset. We identify…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How do multimodal models perform on spatio-temporal graph datasets with synthetic noise compared to unimodal graph neural networks in terms of inference throughput, as evaluated on benchmarks like. In order to…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: To what extent do adversarial perturbations in input text degrade the consistency of feature attribution maps generated by Integrated Gradients compared to Attention Rollout. Attribution algorithms are frequently…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the impact of varying levels of synthetic noise on the reasoning capabilities of transformer-based language models when fine-tuned on spatio-temporal graph datasets, as measured by accuracy. Dynamic Graph…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the self-mutual learning approach compare to teacher-only baselines in inference efficiency on spatio-temporal graph datasets when evaluated using standard graph neural network benchmarks. Knowledge…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the integration of multimodal inputs (e.g., diagrams or UML representations) in self-invoking code generation tasks affect the accuracy (pass@1) and latency of LLMs compared to text-only. We introduce…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does domain-specific fine-tuning (e.g., Python vs. JavaScript) affect GPT-4o's code generation robustness as measured by HumanEval+ test suite accuracy. Large Language Models (LLMs) have demonstrated…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the correlation between model size (1B–175B parameters) and HumanEval score stability across different evaluation protocols (e.g., deterministic vs. probabilistic sampling). Large language models (LLMs)…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the effect of multi-task fine-tuning (e.g., combining HumanEval Pro with MBPP Pro) on model robustness (measured by pass@k) in self-invoking code generation tasks across different problem. We introduce…