Papers
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the difference in inference throughput and token generation latency between sparse MoE and dense architectures when evaluated on code generation tasks like HumanEval. 12 claims were extracted from source…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does test-time compute scaling compare to other inference efficiency techniques (e.g., distillation, quantization) in improving reasoning performance on medical question-answering benchmarks like. 12 claims…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does task-conditioned expert routing in MoE models impact accuracy on GSM8K and MATH benchmarks compared to dense transformers of equivalent parameter count. 9 claims were extracted from source literature; 1…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the effect of instruction fine-tuning on language model mathematical problem-solving accuracy. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does pretraining data quality affect language model reasoning benchmark performance. 12 claims were extracted from source literature; 1 was independently verified against retrieved documents. An automated…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How do language models perform on formal theorem proving and mathematical verification tasks. 10 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What techniques enable language models to solve competition-level software engineering problems. 13 claims were extracted from source literature; 4 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does reinforcement learning from human feedback improve language model mathematical reasoning. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the relationship between model scale and emergent reasoning capabilities in transformers. 17 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: What architectural innovations improve transformer performance on multi-step logical reasoning. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What are the scaling laws for chain-of-thought reasoning in large language models. 12 claims were extracted from source literature; 1 was independently verified against retrieved documents. An automated…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How do sparse mixture-of-experts models compare to dense transformers on mathematical reasoning. 20 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does test-time compute scaling improve language model performance on reasoning benchmarks. 17 claims were extracted from source literature; 7 were independently verified against retrieved documents. An…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the false positive rate of Codestral-7B compare to Codestral-70B when detecting Solidity smart contract vulnerabilities under high-concurrency inference loads. 8 claims were extracted from source…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How do multimodal Llama3 variants compare to text-only variants in detecting OWASP Top 10 vulnerabilities when evaluating response safety metrics like accuracy and precision under adversarial. 0 claims were…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the correlation between model parameter scale and few-shot learning capability for detecting novel Common Weakness Enumerations in proprietary codebases without fine-tuning. 0 claims were extracted from…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: To what extent does instruction fine-tuning on synthetic obfuscation datasets improve the robustness of Llama3-70B against adversarial code perturbations compared to base models. 10 claims were extracted from…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the comparative false positive rate of Deepseek R1 versus CodeLlama on buffer overflow vulnerabilities within the Big-Vul benchmark under varying context window sizes. 0 claims were extracted from source…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does the inference latency scaling exponent of Deepseek R1 change when processing nested control flow structures compared to linear code sequences in automated security auditing tasks. 18 claims were…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does the token generation rate of Deepseek R1 correlate with cyclomatic complexity metrics when performing vulnerability detection on the Big-Vul dataset. 8 claims were extracted from source literature; 2…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the vulnerability detection performance of Deepseek R1 on the Big-Vul dataset vary across different levels of cyclomatic complexity compared to Llama3 and Codestral. 0 claims were extracted from source…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the correlation between token-level perplexity changes induced by code obfuscation and the drop in vulnerability detection accuracy for Llama3 models on the SARD benchmark. 0 claims were extracted from…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the correlation between code cyclomatic complexity and the false positive rates of Deepseek R1, Llama3, and Codestral in automated vulnerability scanning tasks. 0 claims were extracted from source…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does fine-tuning Llama3 and Codestral on obfuscated code samples from the Big-Vul dataset affect their F1 scores compared to unobfuscated baselines. 12 claims were extracted from source literature; 1 was…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of mixed obfuscation techniques (e.g., combining variable renaming, control flow flattening, and dead code insertion) on the detection accuracy of Llama3 versus Codestral when. 9 claims were…