Papers
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the finetuning dataset size impact the performance of Qwen2.5 on the HumanEval Pro and MBPP Pro benchmarks compared to models with smaller pretraining datasets. We introduce self-invoking code…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the impact of sliding window attention on code generation accuracy compared to full attention mechanisms in long-sequence programming benchmarks. GitHub Copilot, an extension for the Visual Studio Code…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the performance of RxR-trained agents compare to those trained on Room-to-Room (R2R) when evaluated on the ALFRED benchmark for long-horizon language-grounded navigation tasks. Existing Vision-Language…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: What is the trade-off between inference efficiency (latency/throughput) and reasoning accuracy when applying mixed-precision quantization to multimodal models like InternLM on benchmarks such as MMMU.…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does sliding window attention affect inference latency and memory usage when processing context lengths exceeding 32K tokens in LLM reasoning tasks. The quadratic compute and memory costs of global…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: Do multilingual VLN agents trained on RxR demonstrate improved cross-lingual transfer learning capabilities when evaluated on the Room-to-Region (R2R) dataset for English and non-English instructions.…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does mixed-precision quantization (e.g., 4-bit vs. 8-bit) affect the performance of quantized InternLM models on multimodal reasoning benchmarks like MMBench and ITP compared to the LLaVA. Reducing the…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does quantization-aware training (QAT) impact the reasoning capabilities of large language models (LLMs) on mathematical benchmarks compared to post-training quantization (PTQ) when evaluated on. Post-training…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: Does increasing the number of experts in sparse MoE models improve inference efficiency (throughput) while maintaining pass@1 accuracy on self-invoking code generation tasks as benchmarked on. Among parallel…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: How does the performance gap between sparse MoE models and dense transformers on self-invoking code generation tasks vary when evaluated on MBPP Pro compared to HumanEval Pro. We introduce self-invoking code…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the path efficiency of RxR-trained agents with the tryout controller scale with increasing complexity of unseen environments (e.g., larger maps, more obstacles) compared to R2R-trained agents. Eccentric…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of RLHF alignment on the pass@1 accuracy of multimodal models (e.g., text-to-code) compared to text-only models in solving self-invoking code generation tasks on HumanEval Pro. We introduce…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the path efficiency of RxR-trained agents with the tryout controller compare to agents trained with other navigation benchmarks (e.g., ALFRED, Room-Across-Room) when evaluated on unseen. We introduce…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of varying the size of the language model backbone on the path efficiency and communication success rate of RxR-trained agents in the R2R benchmark. Large language models (LLMs) have achieved…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does Wan2.1 I2V-14B with LoRA adaptation perform on out-of-domain cinematic scenes (e.g., sci-fi) compared to its performance on historical scenes, as evaluated by CLIP-based metrics like FID. We present a…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the relative performance of RxR-trained agents versus R2R-trained agents on the ALFRED benchmark for task and language grounding in realistic indoor environments. We introduce Room-Across-Room (RxR), a…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the impact of varying LoRA rank dimensions (e.g., 4, 8, 16) on the temporal consistency scores of Wan2.1 I2V-14B as measured by the FVD (Frechet Video Distance) benchmark. We present a practical pipeline…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the communication efficiency of MADRL agents scale with the number of agents when evaluated on the SCAN benchmark for natural language grounding tasks. Communication is an effective mechanism for…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: Does the tryout controller mechanism in RxR-trained agents improve robustness to ambiguous natural language instructions when evaluated on the Room-to-Room (R2R) benchmark with a focus on instruction. This report…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does adjusting the LoRA rank in Wan2.1 I2V-14B impact the FVD (Frechet Video Distance) and KID (Kernel Inception Distance) scores on benchmarks like UCF-101 or Kinetics-400 compared to full. We present a…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the trade-off between inference latency and temporal consistency (measured by TSSIM or LPRO) when applying LoRA to video diffusion models like Make-A-Video or AnimateDiff across different. We present a…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: Do parameter-efficient fine-tuning methods like LoRA in text-to-video generation models achieve comparable temporal stability metrics (e.g., FVD-128, FID-128) to full fine-tuning when evaluated on. We present a…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How do different adversarial training methods affect the trade-off between generation quality and sampling efficiency in large-scale diffusion model deployment. Predicting the trajectories of surrounding objects…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the impact of reducing adapter rank in LoRA on the MotionScore benchmark for video generation tasks, particularly when evaluated on cross-domain generalization (e.g., historical vs. sci-fi. We present a…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the Directional Preference Alignment (DPA) framework compare to traditional RLHF in terms of recommendation diversity metrics (e.g., coverage, novelty) on sequential recommendation. Recent studies have…