Papers
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the zero-shot accuracy of LLaVA compare to T5-11B on math word problems when both models are provided with the same image captions. 0 claims were extracted from source literature; 0 were independently…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the mathematical reasoning performance of Gemma-2-7B compare to Mistral-7B and Llama-2-7B on BIG-Bench subsets when controlling for instruction finetuning scale. 8 claims were extracted from source…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the zero-shot vs few-shot performance gap between Gemma-2-7B and larger parameter models vary across different BIG-Bench mathematical problem domains (e.g., algebra, calculus, logic). 10 claims were…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the impact of varying image resolution inputs on the code generation accuracy of multimodal transformers on the HumanEval-V benchmark. 12 claims were extracted from source literature; 11 were…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What is the comparative robustness of Gemini 1.5 Flash versus Pro against adversarial perturbations in complex diagram interpretation tasks within HumanEval-V. 12 claims were extracted from source literature; 8…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the retrieval accuracy of fine-grained visual details degrade in Gemini 1.5 models as the number of interleaved image-text tokens exceeds 500k. 11 claims were extracted from source literature; 11 were…
Abstract: This report synthesises findings from 5 peer-reviewed papers addressing the following research question: What is the correlation between the semantic complexity of natural language instructions and the error propagation rate in multi-step GUI automation tasks. 9 claims were extracted from source literature; 6 were…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How do vision-language models compare to text-only LLMs in accuracy on HumanEval-V when evaluated with chain-of-thought prompting. 9 claims were extracted from source literature; 9 were independently verified…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: To what extent does training on diverse application interfaces improve the zero-shot generalization of GUI agents to unseen software environments. 10 claims were extracted from source literature; 10 were…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the task success rate of compositional GUI agents degrade as the number of sequential steps increases in complex post-production workflows. 4 claims were extracted from source literature; 4 were…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of test-time compute scaling strategies on the robustness of InternVL 2.5 against adversarial perturbations in the ChartQA dataset. 9 claims were extracted from source literature; 9 were…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How robust is the updated Gemini 1.5 Pro to out-of-distribution shifts in code generation benchmarks compared to its February release counterpart. 12 claims were extracted from source literature; 8 were…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does scaling model parameters from 7B to 32B affect zero-shot performance on the CLUE benchmark compared to few-shot settings. 13 claims were extracted from source literature; 6 were independently verified…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: What is the correlation between pretraining data volume and robustness to adversarial perturbations in Chinese NLU tasks within the CLUE suite. 9 claims were extracted from source literature; 6 were independently…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does FLoRIST-OLMo-1B's performance on the MMBench benchmark compare to larger multimodal models when evaluating diagrams with varying levels of occlusion or noise. 7 claims were extracted from source…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How robust are the reasoning capabilities of Gemini 1.5 Pro on long-context mathematical problem-solving tasks compared to specialized models like GPT-4 when evaluated on the MathQA benchmark. 15 claims were…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the exact match accuracy of Mistral-7B-Instruct-v0.2 compare to Llama-2-7B and Gemma-7B on university-level calculus problems in the MathOdyssey dataset. 8 claims were extracted from source literature; 8…
Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: What is the impact of varying quantization bit-widths (e.g., INT2, INT4, INT8) on GRACE-LLaVA-1.5-7B's performance across different multimodal benchmarks, including MMBench and MMATH. 10 claims were extracted from…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: Does GRACE's confidence-based distillation approach improve robustness to adversarial multimodal inputs compared to standard quantization-aware training methods for VLMs. 6 claims were extracted from source…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the performance of INT4-quantized GRACE-LLaVA-1.5-7B compare to other state-of-the-art quantized multimodal models on MultiModal-Multilingual-HumanEval in terms of accuracy and latency. 9 claims were…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What is the impact of training stability techniques employed in OLMo 2 on the robustness of OLMoE-1B-7B-0125 when evaluated on adversarial language understanding tasks like ANLI or AdversarialQA. 11 claims were…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does the robustness of GRACE-LLaVA-1.5-7B-INT4 compare to that of other quantized multimodal models like Qwen-VL-Chat-INT4 on adversarial visual perturbations across language understanding. 12 claims were…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the performance of GRACE-LLaVA-1.5-7B-INT4 scale with model size (e.g., 7B vs. 13B) on adversarial visual perturbation tasks compared to unquantized models, as measured by accuracy on. 8 claims were…
Abstract: This report synthesises findings from 2 peer-reviewed papers addressing the following research question: What is the impact of OLMo2's modified architecture and training stability techniques on the throughput and latency of inference for the OLMoE-1B-7B-0125-Instruction model across different hardware. 14 claims were…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does Qwen3's performance on mathematical reasoning benchmarks (e.g., GSM8K, MATH) compare to other state-of-the-art LLMs like GPT-4 and Claude 3 in terms of accuracy and scaling with model size. 13 claims…