Papers
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the alignment of Llama3-70B with human security review judgments (measured by EM score on SECURITYBENCH) evolve compared to Codestral-7B across different iterations of instruction fine-tuning. Large…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How do Llama3-70B and Codestral-34B generalize to low-resource programming languages beyond Java and Python, such as Rust or Go, when fine-tuned on limited domain-specific datasets, as measured by.…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does the integration of multi-agent context engineering workflows impact the throughput of niche domain code generation in Code LLMs, measured by tokens per second on HumanEval or MBPP benchmarks. Large…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the impact of model scaling on instruction following accuracy when evaluated on out-of-domain code generation tasks. Despite widespread deployment of Large Language Models, systematic evaluation of…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the computational efficiency trade-off when applying retrieval augmentation to Llama3-70B for code vulnerability classification, and how does it compare to smaller models like Llama-13B in. With many…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does fine-tuning affect the zero-shot and few-shot performance of Llama3-70B and Gemini 1.5 Pro on the CodeXGLUE security subset compared to retrieval-augmented approaches. Few-shot prompting has emerged as a…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the impact of train-test split contamination on F1-score inflation for code generation models on CodeXGLUE security subsets. Anomaly detection is a widely explored domain in machine learning. Many models…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: What is the trade-off between inference latency and accuracy when using retrieval-augmented generation for Llama3-70B versus Gemini 1.5 Pro on the CodeXGLUE security subset under few-shot learning. The advent of…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the alignment of Mistral-Large-2's self-invoking code generation affect its performance on cross-domain tasks (e.g., math vs. string manipulation) in MBPP Pro, and can fine-tuning improve. We introduce…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the performance of retrieval-augmented Gemini 1.5 Pro and Llama3-70B compare on the CodeXGLUE security subset when evaluated with few-shot versus zero-shot learning across different. Few-shot prompting…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the self-invoking code generation performance of Mistral-Large-2 compare to other state-of-the-art LLMs like GPT-4 or Claude 3 on the MBPP Pro benchmark in terms of solution correctness and. We introduce…
Abstract: This report synthesises findings from 17 peer-reviewed papers addressing the following research question: How does the inference efficiency of Mistral-Large-2 scale with model size when generating code on the MBPP benchmark, as measured by tokens per second and latency metrics. Large-scale video generative models,…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: To what extent does the complexity of the base problem in self-invoking code generation tasks impact the throughput and efficiency of Mistral-Large-2 during inference. We introduce self-invoking code generation,…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of self-instruct methods based on GPT-4 on the performance of Japanese language models compared to traditional human-annotated benchmarks, as measured by BLEU or ROUGE scores. Despite…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the code generation quality of Mistral-Large-2 compare to other state-of-the-art LLMs like GPT-4 on the MBPP benchmark when evaluated using execution-based metrics such as pass@k. Large language models…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How robust is Mistral-Large-2's solution transferability across different programming domains when evaluated on a cross-domain adaptation of the MBPP Pro benchmark. Reusing pre-collected data from different…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: Do multimodal models exhibit higher PER than text-only models on math word problems (e.g., SVAMP, AQuA) when evaluated with equal compute budgets, and how does modality fusion impact efficiency. Recent progress…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of multimodal model scaling on inference efficiency when processing sign language video-to-text tasks, as measured by throughput and latency on benchmarks such as DAILY-1M or LSLR. Multimodal…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the performance of Mistral-Large-2 in generating code solutions on MBPP scale with model size, and how does this scaling affect both functional correctness and human evaluation scores. Although large…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the functional correctness and code quality of Mistral-Large-2 generated solutions on MBPP compare when evaluated using automated test suites versus human evaluation scores. The use of machine learning…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the cross-model robustness comparison between Qwen3-235B and Llama2-70B under PPTC-R attacks, evaluated using accuracy drop and token efficiency. In this paper, we investigate the problem of distributed…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do vision-language models compare to pure visual models in terms of correlation between synthetic segmentation metrics and human rater agreement on multimodal medical image tasks like BRATS,. Training a deep…
Abstract: This report synthesises findings from 20 peer-reviewed papers addressing the following research question: To what extent does model size scaling in multimodal transformers (e.g., ViT, CLIP vs. small-scale CNN-based models) affect the alignment of synthetic metrics with human attention benchmarks in tasks. Tactile…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does multimodal context (text + code diagrams) affect the iterative code repair performance of DeepSeek-R1 on FeedbackEval compared to text-only context, measured by repair success rate and token. Code repair…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the token efficiency of DeepSeek-R1 compare to Claude-3 when performing few-shot code generation on HumanEval, measured by pass@1 accuracy per token consumed. How far are Large Language Models (LLMs) in…