Papers
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: Does the adoption of directional preference alignment improve robustness against diverse user preference shifts in code generation benchmarks without degrading model efficiency. Methods for detecting nucleotide…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: Does activation-aware quantization preserve visual grounding capabilities better than standard post-training quantization on the RefCOCO+ benchmark. In the past year, MultiModal Large Language Models (MM-LLMs)…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the trade-off between inference throughput in tokens per second and functional correctness when applying multi-objective alignment frameworks to large language models on coding tasks. Abstract The rapid…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does directional preference alignment with multi-objective rewards impact code generation accuracy on the DS-1000 benchmark compared to standard scalar-reward RLHF methods. Abstract The rapid evolution of…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does increasing context window size affect pass@1 accuracy on BigCodeBench for Code Llama variants during cross-library API generation tasks. Large Language Models (LLMs) have garnered remarkable advancements…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: Does optimizing context window size improve inference efficiency and maintain accuracy for Python code generation in data-constrained pretraining scenarios. We release Code Llama, a family of large language…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the trade-off between inference latency and vulnerability detection accuracy when deploying fine-tuned 7B code models versus 70B models for on-premise security analysis. Edge computing environments face…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does training data heterogeneity across C, C++, and Python affect the F1 score of 7B-parameter code models compared to 70B-parameter models in CWE vulnerability detection. Abstract Deep learning (DL) is one…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the inference efficiency (measured in tokens/sec or latency) of Llama3-70B and Codestral-7B change across fine-tuning iterations, and does this correlate with their alignment scores on. Large language…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does fine-tuning CodeT5 on syntactically perturbed code datasets impact Pass@K performance in cross-language migration tasks compared to standard fine-tuning. Large Language Models (LLMs) have garnered…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of fine-tuning Llama3-70B on mixed-code datasets (e.g., Rust/Python or Go/Java) on its cross-domain generalization, as measured by completion accuracy and perplexity in. QUANTUM ESPRESSO is an…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the impact of input sequence length on the efficiency-accuracy trade-off in retrieval-augmented Llama3-70B compared to Llama-13B for long-context code tasks like vulnerability detection. The escalating…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does semantic retrieval augmentation via Elicit-like systems affect pass@1 scores on HumanEval for niche domain code generation compared to standard context window extension. As far back as the industrial…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the train-test split contamination rate affect the F1-score stability in code generation models evaluated on CodeXGLUE security subsets. The development of large language models (LLMs) such as ChatGPT…
Abstract: This report synthesises findings from 2 peer-reviewed papers addressing the following research question: How does the robustness of retrieval-augmented generation compare between Llama3-70B and Gemini 1.5 Pro on the CodeXGLUE security subset when evaluated using the EM (Exact Match) metric under. Large Language…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of annotation bias in GPT-4 generated visual instructions on the hallucination rates of vision-language models evaluated on standard VQA datasets. Despite vision-language models' (VLMs)…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: To what extent does chain-of-thought prompting improve the robustness of Mistral-Large-2 versus GPT-4 on edge-case scenarios within the MBPP benchmark. Large language models (LLMs) have demonstrated remarkable…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: To what extent does code-based self-verification improve robustness against adversarial perturbations in math word problems compared to standard multimodal fusion approaches. Recent progress in large language…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: What is the relative inference latency and throughput trade-off between Mistral-Large-2 and GPT-4 when executing complex coding tasks on the MBPP dataset. The advent of Large Language Models (LLMs) has raised…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the inference latency and token consumption of code-interpreter augmented LLMs compare to chain-of-thought prompting on the AQuA and SVAMP benchmarks under fixed compute constraints. Chain-of-Thought…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: How do different multimodal alignment strategies affect cross-lingual retrieval performance and robustness when adapting English pre-trained models to the MSVD-Indonesian benchmark. Multimodal learning on video…
Abstract: This report synthesises findings from 5 peer-reviewed papers addressing the following research question: To what extent does the ECCO benchmark's natural language evaluation paradigm correlate with hardware-independent runtime metrics across different code-generating LLMs. Edge-cloud collaborative computing (ECCC)…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the efficiency-performance trade-off of Mistral-Large-2 on the MBPP benchmark compare to smaller variants when optimizing for both execution time and functional correctness. Program synthesis has been…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the functional correctness of Mistral-Large-2 generated solutions on MBPP scale with model size, as measured by pass@k scores compared to smaller variants like Mistral-7B. Large Language Models (LLMs)…
Abstract: This report synthesises findings from 5 peer-reviewed papers addressing the following research question: What is the robustness of automated test suite evaluations for code generated by Mistral-Large-2 on MBPP when benchmarked against human evaluations using Cohen's kappa for inter-rater agreement. Large Language…