Papers
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the load balancing efficiency of DeepSeek-V3's auxiliary-loss-free policy compare to traditional routing methods during long-context inference tasks. 13 claims were extracted from source literature; 4…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the ALF-LB load balancing method compare to traditional auxiliary-loss-based approaches in terms of training throughput and final model accuracy on the HumanEval code generation benchmark. 13 claims were…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the trade-off between instruction-following accuracy and inference latency when comparing Claude-3.5-Sonnet with quantized versions of Llama-3 on the Multi-Turn Robotic Instruction Following. 0 claims…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the performance of LongVA-7B and LLaVA-1.6 on HumanEval-V vary when evaluated with different diagram types (e.g., flowcharts vs. UML diagrams), and can this inform model-specific. 11 claims were…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the performance of Claude-3.5-Sonnet compare to state-of-the-art open-source multimodal models on the MobileAloha benchmark when evaluated for instruction adherence in robotic manipulation. 17 claims…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How robust are instruction-following capabilities of Claude-3.5-Sonnet and quantized mobile models when tested with adversarial perturbations in the MobileAloha dataset, measured by success rate and. 16 claims…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of multi-turn refinement loops on the robustness of code generation models against adversarial prompts in the HumanEval dataset. 0 claims were extracted from source literature; 0 were…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of varying levels of visual complexity in diagrams on the reasoning accuracy of LLaVA-NeXT and Video-LLaVA-8B, and how does this correlate with their performance on standard. 14 claims were…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: What is the cross-domain transferability of visual reasoning capabilities in LMMs when trained on HumanEval-V versus traditional multimodal benchmarks like VQA or COCO. 15 claims were extracted from source…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the correlation between model parameter scale and performance degradation on visual logic puzzles within the LogicVista dataset under low-resolution conditions. 17 claims were extracted from source…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: Does applying self-refinement loops to codegen-2b yield diminishing returns in accuracy improvement after three iterations on the APPS competition-level dataset. 14 claims were extracted from source literature; 0…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What are the robustness and generalization capabilities of Gemini 1.5 Pro when evaluated on LongVideoBench across different video domains (e.g., lectures, tutorials, documentaries) and how does this. 0 claims…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does Gemini 1.5 Pro handle long-term dependency modeling in video-language understanding tasks compared to prior models, and what metrics (e.g., F1 score, latency) best capture this performance. 17 claims…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the difference in pass@k metrics between iterative self-refinement and single-pass decoding for codegen-2b on the HumanEval benchmark. 12 claims were extracted from source literature; 0 were independently…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the retrieval accuracy of Gemini 1.5 Pro compare to other multimodal models like GPT-4V or PaLM-M on long-context benchmarks such as LongBench or Needle-in-a-Haystack. 16 claims were extracted from source…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the retrieval accuracy of Gemini 1.5 Flash degrade on the Needle In A Haystack benchmark compared to Gemini 1.5 Pro when context length exceeds 500k tokens. 0 claims were extracted from source literature;…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the impact of domain-adaptive RAG on the calibration metrics and false positive rates of quantized Mistral 7B when detecting anomalies in multimodal cyber-physical system logs. 0 claims were extracted…
Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: How do multimodal models like Gemini 1.5 Pro compare to prior models in terms of accuracy and computational cost when processing interleaved video-language inputs of varying lengths, particularly for. 7 claims were…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the performance degradation of XGLM-564M on imbalanced educational dialogue datasets vary between Indonesian and English across different difficulty levels. 13 claims were extracted from source…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the difference in adversarial robustness scores for XGLM-564M when classifying tutoring dialogue acts across high school versus undergraduate level datasets in English and Indonesian. 10 claims were…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the out-of-domain generalization accuracy of XGLM-564M compare between Indonesian and English on low-resource educational dialogue act classification tasks. 0 claims were extracted from source…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the integration of faithfulness constraints in RAG pipelines affect the accuracy of Phi-3-mini and Mistral-7B-v0.1 on low-resource language benchmarks. 7 claims were extracted from source literature; 0…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of retrieval context length on the factuality scores of Phi-3-mini versus Mistral-7B-v0.1 in multi-hop question answering tasks. 16 claims were extracted from source literature; 3 were…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: What is the performance gap in F1 scores for Indonesian hate speech detection between feature-based multilingual models and fine-tuned monolingual approaches across varying training data sizes. 20 claims were…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do Phi-3-mini and Mistral-7B-v0.1 compare in hallucination rates on long-context RAG benchmarks for specialized religious domains. 8 claims were extracted from source literature; 0 were independently verified…