Papers
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: LiveCodeBench benchmark: robustness and generalization analysis — rotation 0. 16 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: Can Shift Parallelism maintain high token throughput efficiency when scaled to multimodal LLMs processing variable-length image-text sequences compared to pipeline parallelism. 16 claims were extracted from…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the impact of KV cache variance mitigation techniques in Shift Parallelism on multi-turn dialogue reasoning accuracy compared to standard tensor parallelism. 0 claims were extracted from source literature;…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the performance degradation of current alignment techniques in multimodal models when evaluated on out-of-distribution engineering diagrams from the Uni-MMMU benchmark. 0 claims were extracted from source…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: Language model inference efficiency throughput benchmark comparison. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated multi-reviewer quality…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: Multimodal language model vision reasoning benchmark evaluation analysis. 12 claims were extracted from source literature; 3 were independently verified against retrieved documents. An automated multi-reviewer…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How robust are current alignment techniques in multimodal models when evaluated on adversarial or out-of-distribution samples from Uni-MMMU's science and engineering disciplines. 7 claims were extracted from…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: Chain-of-thought extended thinking benchmark accuracy improvement survey. 13 claims were extracted from source literature; 1 was independently verified against retrieved documents. An automated multi-reviewer…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: Test-time compute scaling reasoning benchmark performance accuracy tradeoff. 12 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated multi-reviewer…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: Open source language model benchmark leaderboard systematic review. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated multi-reviewer quality…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the pass@k performance of large language models on LiveCodeBench correlate with their inference latency and token throughput across different model scales. 0 claims were extracted from source literature;…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the impact of integrating fine-grained intermediate feedback from PRMs on the inference efficiency and token consumption of autonomous coding agents on the SWE-bench dataset. 13 claims were extracted from…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the verification protocol of HLE-Verified impact the correlation between model performance on noisy vs. verified subsets of the Humanity Last Exam benchmark. 0 claims were extracted from source…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the comparative efficiency of different inference optimization techniques when evaluating frontier models on the revised HLE-Verified benchmark in terms of throughput and accuracy trade-offs. 0 claims…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: BIG-Bench Hard reasoning task language model evaluation comparison. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated multi-reviewer quality…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: HumanEval code generation state of the art language model survey. 16 claims were extracted from source literature; 1 was independently verified against retrieved documents. An automated multi-reviewer quality…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: MMMU multimodal understanding benchmark evaluation systematic review. 7 claims were extracted from source literature; 1 was independently verified against retrieved documents. An automated multi-reviewer quality…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: GPQA Diamond benchmark frontier model performance evaluation recent literature. 11 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: LiveCodeBench competitive programming language model performance analysis. 20 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated multi-reviewer…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: Humanity Last Exam benchmark frontier model evaluation comparison. 14 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated multi-reviewer quality…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: AIME mathematical competition language model benchmark evaluation. 20 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated multi-reviewer quality…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: SWE-bench Verified autonomous coding agent state of the art results. 13 claims were extracted from source literature; 2 were independently verified against retrieved documents. An automated multi-reviewer quality…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the impact of ensemble defense mechanisms on the accuracy and robustness of LLMs in code generation tasks, as measured by the HumanEval+ benchmark. 11 claims were extracted from source literature; 1 was…
Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: How does the computational efficiency of adversarial contrastive pre-trained models compare to traditional supervised models in rumor detection tasks, as measured by inference latency and throughput. 6 claims were…
Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: How does the choice of different metapath sampling granularities (coarse vs. fine-grained) affect the inference efficiency and throughput of Metapath Context Convolution-based HGNNs on large-scale. 5 claims were…