Papers
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How do quantized MobileVLM models (1.4B and 2.7B) compare to full-precision 3B-13B VLMs on MME and MM1K benchmarks in terms of reasoning accuracy and inference latency. 8 claims were extracted from source…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the performance gap between MobileVLM and state-of-the-art VLMs on the MM1K benchmark when evaluated under low-resource settings (e.g., 5-shot learning) for robotic manipulation tasks. 8 claims were…
Abstract: This report synthesises findings from 5 peer-reviewed papers addressing the following research question: How does Adaptive Reasoning Suppression compare to Speculative Decoding in terms of GSM8K accuracy and throughput for Llama-3-8B models. 10 claims were extracted from source literature; 0 were independently…
Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: What is the impact of different Top-K sampling strategies in ESP's speculative token tree construction on the code completion accuracy of DeepSeek-V3 across the HumanEval and MBPP benchmarks. 12 claims were…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the performance of DeepSeek-V3's multi-token prediction objective compare to standard next-token prediction on code completion accuracy in low-resource programming languages using the. 16 claims were…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: Does training on rationale-augmented preference data improve the robustness of DPO-aligned models against adversarial prompts on the AlpacaEval 2.0 benchmark compared to standard PPO alignment. 6 claims were…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: To what extent does instruction tuning impact the ability of 7B-8B parameter LLMs to identify low-density regions in tabular data compared to their base pre-trained counterparts using F1-score metrics. 8 claims…
Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: How robust are Mistral 7B and Llama 3.1 8B to distribution shifts in tabular anomaly detection tasks when measured by the area under the precision-recall curve across different noise levels. 8 claims were…
Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: How does the inclusion of explicit rationales in preference datasets impact the win rate scaling of DPO compared to PPO on AlpacaEval 2.0 for 7B versus 70B parameter models. 6 claims were extracted from source…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the accuracy of MobileVLM's 1.4B and 2.7B models on the MME and MM1K benchmarks compare to quantized versions of larger 3B to 13B VLMs. 12 claims were extracted from source literature; 12 were…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does the zero-shot anomaly detection precision-recall performance of Llama 3.1 8B compare to Mistral 7B when evaluated on synthetic tabular datasets with varying degrees of feature correlation. 0 claims were…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the correlation between the complexity of neuro-symbolic logical constraints and verification accuracy degradation under adversarial perturbations in formal proof datasets. 10 claims were extracted from…
Abstract: This report synthesises findings from 5 peer-reviewed papers addressing the following research question: How does the throughput latency of MobileVLM's efficient projector architecture scale when deployed on heterogeneous mobile hardware compared to standard transformer-based projectors. 9 claims were extracted from…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does quantization-aware training influence multimodal benchmark performance on ScienceQA compared to post-training quantization. 15 claims were extracted from source literature; 9 were independently verified…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: Can reinforcement learning from human feedback (RLHF) improve Bayesian Network-based condition monitoring systems' performance in dynamic environments as measured by real-time risk assessment accuracy. 8 claims…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of multimodal fusion techniques on the accuracy of failure detection in WECSs when evaluated against SCADA system benchmarks. 9 claims were extracted from source literature; 8 were…
Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: How does the performance of GRACE compare to other quantization-aware training methods on the MMBench and COCO-Text benchmarks in terms of multimodal alignment accuracy and inference latency. 5 claims were…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does GRACE's quantization-aware training scale with model size, and how does it affect performance on the MME and MM1K benchmarks when applied to VLMs with 3B to 13B parameters. 8 claims were extracted from…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does dynamic quantization affect multimodal alignment performance on the VQA v2 dataset compared to static quantization methods. 14 claims were extracted from source literature; 12 were independently verified…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the activation sparsity of DeepSeek-V3's 37B active parameters correlate with accuracy degradation on multi-step reasoning tasks in the MMLU and BBH datasets relative to dense model. 10 claims were…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does dynamic quantization of attention layers impact pass@1 scores on the HumanEval benchmark for code generation models. 9 claims were extracted from source literature; 9 were independently verified against…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: What is the comparative performance of DeepSeek-V3's multi-token prediction training objective on code generation benchmarks like HumanEval and MBPP versus standard next-token prediction baselines. 15 claims were…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the comparative accuracy degradation of vision-language models versus standalone CNN architectures on document recognition tasks under structured adversarial attacks. 9 claims were extracted from source…
Abstract: This report synthesises findings from 5 peer-reviewed papers addressing the following research question: Does on-the-job learning improve robustness against unseen conversational scenarios in dialogue systems as measured by ConvEval failure rates. 10 claims were extracted from source literature; 9 were independently…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does DeepSeek-V3's auxiliary-loss-free load balancing strategy impact token throughput and latency on long-context reasoning benchmarks compared to traditional MoE routing mechanisms. 10 claims were extracted…