Papers
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: Does joint training on English and Indonesian datasets improve robustness against adversarial perturbations in PAWS-X compared to single-language fine-tuning for mid-sized multilingual transformers. 0 claims were…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the zero-shot cross-lingual transfer accuracy of XGLM on Indonesian XNLI tasks scale relative to English as model size increases from 564M to 7.5B parameters. 12 claims were extracted from source…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the performance variance of Phi-3-mini versus Mistral-7B-v0.1 on GSM-Symbolic generated instances across non-English languages compared to the original MGSM dataset. 0 claims were extracted from source…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How do Qwen3 and Qwen2-1.5B differ in robustness against adversarial docstring perturbations across diverse programming languages in the HumanEval-X dataset. 10 claims were extracted from source literature; 0…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the comparative safety alignment performance of Qwen2.5 models versus prior versions on adversarial benchmarks like RedBench or WildQA, measured by safety score variance across different. 15 claims were…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of dynamic attention head selection on multi-turn dialogue coherence scores compared to static multi-head attention in 7B parameter models. 12 claims were extracted from source literature; 0…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the robustness of SpikingBrain compare to Llama 2 13B in repository-level coding tasks when evaluated under adversarial conditions (e.g., corrupted or obfuscated code) using the pass@1 metric. 0 claims…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of model size scaling (e.g., 7B vs. 13B vs. 30B) on the LawBench benchmark performance of RLHF-aligned models, particularly in the Legal knowledge level, and does the performance. 15 claims…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the impact of interleaving long video sequences with code documentation on Gemini 1.5 Flash's reasoning performance in multimodal software engineering benchmarks. 0 claims were extracted from source…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the retrieval accuracy of Gemini 1.5 Pro degrade on diagram-dependent coding tasks when context length exceeds 500k tokens compared to the 100k baseline. 0 claims were extracted from source literature; 0…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of repository size (measured in lines of code) on the pass@1 scores of SpikingBrain versus Llama 2 13B when benchmarked on multi-file repository-level coding tasks. 0 claims were extracted from…
Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: How does the pass@1 performance of SpikingBrain compare to Llama 2 13B and Claude 3 Sonnet when evaluated on repository-level coding tasks with mixed programming languages (Python + Java + JavaScript). 8 claims…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the perplexity of Sliding Window Attention adapted Mistral 7B compare to full attention baselines on the LongCodeEval benchmark for contexts exceeding 16k tokens. 10 claims were extracted from source…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the accuracy-throughput trade-off of Kimi Delta Attention (KDA) versus full attention on the GEMM benchmark when processing sequences longer than 8k tokens. 9 claims were extracted from source literature;…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: Does the training-inference mismatch in sliding window attention cause significant accuracy degradation on the Needle In A Haystack test for code repositories larger than 32k tokens. 12 claims were extracted from…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the inference latency of multimodal dental X-ray models compare to single-modality CNNs when deployed on edge devices with quantized weights. 10 claims were extracted from source literature; 1 was…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the chunkwise algorithm in Kimi Linear impact zero-shot reasoning performance on the MMLU benchmark compared to standard RNN-based architectures when trained with limited memory constraints. 0 claims were…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the F1-score gap between English and Italian QA tasks scale when comparing Gemma2-2B and Gemma2-7B on adversarial cross-lingual datasets generated via beam search. 13 claims were extracted from source…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the correlation between pretraining loss reduction and downstream HHH benchmark accuracy for AdaptToken models across the 1B to 10B parameter range. 16 claims were extracted from source literature; 1 was…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does the scaling of alignment performance on the HHH dataset vary between 3B and 8B parameter models when fine-tuned with different data mixture ratios. 0 claims were extracted from source literature; 0 were…
Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: Can the mean shift-based feature space analysis improve the cross-domain generalization of AdaptToken-3B on AdvGLUE, and how does this compare to adversarial training with Jacobian regularization in. 7 claims were…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How do Gemini 1.5 Flash and Pro perform in zero-shot cross-domain adaptation tasks on the MMBench benchmark, and what are the trade-offs in accuracy and inference time between the two models. 12 claims were…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does Qwen2.5-7B perform relative to Llama-2-7B and Mistral-7B on code generation tasks in HumanEval and MBPP after normalizing for supervised fine-tuning dataset size. 12 claims were extracted from source…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the mean shift clustering technique impact the adversarial robustness of AdaptToken-8B vs. AdaptToken-3B when fine-tuned on AdvGLUE tasks, as measured by accuracy under targeted FGSM attacks. 11 claims…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the introduction of noise in interleaved image-text sequences affect the robustness of factual recall in Gemini 1.5 Flash compared to Gemini 1.5 Pro at context lengths above 200k tokens. 11 claims were…