Papers
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How do instruction-tuned Llama3 and Deepseek R1 models compare in robustness scores when evaluated against taxonomy-specific adversarial perturbations in code security benchmarks. 7 claims were extracted from…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What are the differences in inference efficiency and latency throughput between Llama3 and Deepseek R1 when processing adversarially perturbed code generation inputs. 11 claims were extracted from source…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: To what extent does alignment tuning in Llama3 and Deepseek R1 mitigate helpfulness degradation across diverse adversarial taxonomies in automated code repair tasks. 8 claims were extracted from source literature;…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the pass@1 performance of Codestral compare to Llama3 on the Multilingual HumanEval dataset across non-Python programming languages. 8 claims were extracted from source literature; 7 were independently…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How do inference latency and token throughput differ between Codestral and Llama3 when generating solutions for LiveCodeBench's multi-step programming problems. 8 claims were extracted from source literature; 8…
Abstract: This report synthesises findings from 5 peer-reviewed papers addressing the following research question: How does the integration of geoparsing modules affect the end-to-end inference latency and token throughput of LLMs on qualitative spatial reasoning benchmarks compared to baseline models. 9 claims were extracted…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the relationship between adversarial code complexity (measured by cyclomatic complexity) and the inference latency of Deepseek R1 when generating solutions for HumanEval, and can efficiency. 11 claims…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the cyclomatic complexity of adversarial test cases impact the robustness of Deepseek R1's generated code when evaluated using the MBXP benchmark, and can this be quantified by comparing. 6 claims were…
Abstract: This report synthesises findings from 2 peer-reviewed papers addressing the following research question: How does the pass@1 performance of Codestral compare to Llama3 on LiveCodeBench's time-split evaluation to measure contamination effects in code generation. 11 claims were extracted from source literature; 10 were…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the cross-domain transferability of Deepseek R1's code generation performance compare to other LLMs when evaluated on the DS-1000 benchmark across programming languages with varying. 12 claims were…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What is the impact of attention mechanisms in Enformer on long-range dependency modeling compared to traditional sequence models, evaluated using synthetic benchmarks with controlled interaction. 11 claims were…
Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: How does the generalization performance of Enformer-derived variant effect predictions compare to Clustal Omega-based methods across diverse sequence families, measured by cross-family prediction. 11 claims were…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the inference throughput of DeepSeek-V3's Multi-head Latent Attention (MLA) at varying context window sizes when processing adversarial code samples, measured in tokens per second on A100 GPUs. 9 claims…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the multi-token prediction objective in DeepSeek-V3 improve adversarial code generation accuracy compared to single-token objectives, measured by HumanEval pass@1 on code completion tasks. 10 claims were…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: How does fine-tuning metagenomic language models on variant effect prediction tasks affect their ability to generalize to unseen protein families, as measured by cross-domain performance on. 8 claims were…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How do universal biomedical pretrained models compare to domain-specific models in terms of zero-shot segmentation accuracy across diverse MRI modalities. 0 claims were extracted from source literature; 0 were…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does adversarial fine-tuning affect the cross-language vulnerability detection F1 scores of Llama3 compared to Codestral on C++ and Python codebases. 7 claims were extracted from source literature; 7 were…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the effect of multi-task learning strategies on the memory efficiency and convergence speed of foundational models trained on sparse biomedical imaging datasets. 9 claims were extracted from source…
Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: How do Llama3 and Codestral compare in zero-shot cross-lingual code vulnerability identification accuracy when evaluated on mixed C++ and Python datasets. 12 claims were extracted from source literature; 12 were…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How do inference efficiency and latency trade-offs correlate with adversarial robustness scores for Deepseek R1 when processing perturbed code inputs compared to Codestral. 3 claims were extracted from source…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: Do instruction-tuned variants of Llama3 and Deepseek R1 exhibit different degradation patterns in helpfulness scores when subjected to taxonomy-specific adversarial perturbations in code security. 9 claims were…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the accuracy degradation of Deepseek R1 compare to Llama3 and Codestral under adversarial code perturbations across diverse programming languages in the Big-Vul dataset. 4 claims were extracted from…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the pass@1 performance of Codestral compare to Llama3 on the HumanEval dataset when evaluated under few-shot prompting conditions. 10 claims were extracted from source literature; 9 were independently…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How do Codestral and Llama3 differ in inference latency and token generation throughput while achieving comparable pass@1 accuracy on code generation benchmarks. 7 claims were extracted from source literature; 7…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: What is the difference in pass@10 scores between Codestral and Llama3 on the MBPP dataset across varying model parameter scales. 9 claims were extracted from source literature; 9 were independently verified…