Papers
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How do multi-modal lightweight Transformers perform relative to text-only models on mixed code-generation and reasoning benchmarks (e.g., MBPP + MMLU) when evaluated for alignment with human. Large language…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the cross-domain robustness of fine-tuned multilingual models on Arabic QA when evaluated across multiple Arabic datasets (e.g., ArabiQA, ArSQuAD) compared to monolingual models. The rapid expansion of…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the alignment score (e.g., via RLHF or DPO) of a dense multimodal model compare to a sparse model with varying numbers of experts on the VQAv2 benchmark, and does this correlation hold for. Reinforcement…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the impact of quantization-aware training on the reasoning capabilities of pruned Transformers compared to full-precision models when measured by MBPP pass@k scores under latency constraints. Large…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of Tree of Reviews vs. chain-based retrieval on the inference latency of Llama-3-8B-128K when processing multi-hop questions with varying context lengths on the MuSiQue benchmark.…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How do MoE-based language models trained with dynamic expert routing perform on cross-domain generalization tasks (measured by GLUE benchmark accuracy) compared to fixed-capacity MoE models and dense. Recent…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the Tree of Reviews framework compare to the chain-based retrieval method in terms of F1 score stability when scaling Llama-3-8B-128K's context length from 4K to 128K on the MuSiQue benchmark. Multi-hop…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the computational overhead of the follower-aware speaker model (FOAM) compare to single-turn policy gradient methods in terms of inference time and memory usage during deployment on the. This paper…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the Tree of Reviews retrieval framework compare to chain-based retrieval in terms of computational efficiency and latency when applied to Llama-3-8B models on the MuSiQue benchmark at 128K. Multi-hop…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How robust is the retrieval-augmented generation of Llama-3-8B-128K across different music-related question types (fact-based, interpretive, comparative) on MuSiQue when evaluated using. Recent work on music…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the use of reinforcement learning with human feedback (RLHF) during multi-turn training affect the nDTW score of vision-language navigation models on the RxR-CE benchmark compared to. Recent advances in…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: How does the sample efficiency of the LongNav-R1 multi-turn RL method compare to single-turn approaches in terms of environment steps required to converge on the RxR-CE validation unseen split. This paper develops…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the impact of multi-turn reinforcement learning training on the Success Rate (SR) and Goal Progress (GP) metrics of LongNav-R1 compared to imitation learning baselines on the R2R dataset. This paper…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the performance of VELMA compare to other multimodal LLMs (e.g., Flamingo, PaLI) on the Obstructed-R2R benchmark in terms of success rate and path length efficiency. Large Vision-Language Models (LVLMs)…
Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: How does the performance of 7B and 13B VLA models compare in terms of object grounding accuracy and path completion rate in LongNav-R1 when evaluated on R2R-CE with instructions of varying. Generalization in…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: Does increasing the VLA parameter count from 7B to 13B improve long-horizon task completion rate and average reward on R2R-CE when evaluated with zero-shot cross-dataset generalization. Existing Vision-Language…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How do different alignment techniques (e.g., RLHF, DPO) affect the inference efficiency (tokens/sec) and output quality (measured by AlignBench scores) of LLMs on long-horizon reasoning tasks. Large language…
Abstract: This report synthesises findings from 2 peer-reviewed papers addressing the following research question: Does the inference efficiency (latency/throughput) of 7B and 13B VLA models scale linearly with instruction complexity in LongNav-R1 on R2R-CE, and how does this correlate with their grounding and. The ability to…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does incorporating uncertainty quantification through Bayesian neural networks with Monte Carlo sampling impact AlphaX's architectural search efficiency in code generation tasks, as measured by. Over the past…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the effect of multimodal input (e.g., code + natural language prompts) on the accuracy of sparse MoE models for code generation tasks compared to text-only inputs, measured using HumanEval. We introduce…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the choice of routing algorithm (e.g., expert dropout, top-k) in sparse MoE models impact the trade-off between code generation accuracy (measured by HumanEval pass@1) and throughput. Foundation models,…
Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: How does varying the LoRA rank in cross-attention layers of Wan2.1 I2V-14B affect the FVD and LPIPS scores compared to full fine-tuning. Human video generation remains challenging due to the difficulty of jointly…
Abstract: This report synthesises findings from 5 peer-reviewed papers addressing the following research question: How does the causal encoder design in W.A.L.T influence the trade-off between FVD scores and inference throughput in photorealistic video generation. We present W.A.L.T, a transformer-based approach for…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of quantizing DeepCoNN-style architectures on inference throughput and recommendation accuracy in low-latency e-commerce serving environments. With the breakthroughs in deep learning, the…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: Does joint modeling of user reviews improve alignment metrics in LLM-based recommendation agents compared to instruction-tuned models without review context. In the last few years, the deep learning (DL)…