Papers
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: Can the human attention benchmark be used to improve the training of attention-based models through multi-task learning frameworks. Deep convolutional neural networks have performed remarkably well on many…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What is the impact of using multi-layer human attention masks versus single-layer attention mechanisms on explanation quality scores. Deep convolutional neural networks have performed remarkably well on many…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What are the cross-domain reasoning capabilities of DeepSeek-V4-Pro when evaluated on the ARC and HellaSwag benchmarks. Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the human attention benchmark compare to existing synthetic attention evaluation metrics in terms of correlation with model performance on downstream tasks. Many computational models of visual attention…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the performance difference between DeepSeek-V4-Pro and GPT-4 on HumanEval code generation benchmark scores. Understanding and reasoning over diagrams is a fundamental aspect of human intelligence. While…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the inference efficiency of DeepSeek-V4-Pro compare to other LLMs on standard reasoning benchmarks like MMLU and GSM8K. Rapid advancements in large language models (LLMs) have increased interest in…
Abstract: This report synthesises findings from 18 peer-reviewed papers addressing the following research question: What are the precision and recall metrics for DeepSeek-V3 in detecting specific code smell categories compared to human-annotated ground truth. Determining which Large Language Model (LLM) is superior for code…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the inference latency of DeepSeek-R1 compare to Llama-2-70B on GSM8K across different batch sizes and hardware configurations. Finetuning language models on a collection of datasets phrased as…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: What is the inference latency of DeepSeek-R1 on HumanEval-V benchmark tasks compared to baseline multimodal models. Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in…
Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: What is the performance difference between DeepSeek-R1 and Claude models on SWE-bench Verified when evaluated with and without access to issue-specific file context. Code repair is a fundamental task in software…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: Does cross-domain finetuning improve DeepSeek-V3's performance on GPQA Diamond, and if so, by what percentage. Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in…
Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: What is the pass@1 accuracy of DeepSeek-V3 on the HumanEval benchmark for code generation tasks. As Large Language Models (LLMs) become increasingly integrated into secure software development workflows, a…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: Does Llama-3.1-8B exhibit consistent MBPP performance across different programming language domains (e.g., Python vs. JavaScript) when fine-tuned on domain-specific code datasets. Large Language Models (LLMs)…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the file retrieval accuracy of DeepSeek-V3 correlate with its final issue resolution success rate on SWE-bench Verified. As Large Language Models (LLMs) become increasingly integrated into secure…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: What is the inference latency in tokens per second of DeepSeek-V3 when processing SWE-bench Verified issues compared to baseline models. Abstract The rapid evolution of large language models (LLMs) has driven a…
Abstract: This report synthesises findings from 2 peer-reviewed papers addressing the following research question: How does Llama-3.1-8B's performance on MBPP compare to other open-source 8B-parameter models like Falcon-8B or Mistral-8B in terms of pass@1 accuracy. Large Language Models (LLMs) have achieved remarkable success…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does Llama-3.1-8B's performance on LiveCodeBench compare to smaller or similarly sized language models when evaluated under low-resource conditions or limited inference budgets. Large Language Models achieve…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How do different PDF preprocessing techniques (e.g., anonymization, content extraction methods) affect LiveCodeBench performance for Llama-3.1-8B in GDPR-compliant pipelines, and what trade-offs. Blockchains or…
Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) for domain-specific question-answering (QA) tasks by leveraging external knowledge sources. However, traditional RAG systems primarily focus on relevance-based retrieval and often struggle with redundancy, especially when reasoning requires…
Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) for domain-specific question-answering (QA) tasks by leveraging external knowledge sources. However, traditional RAG systems primarily focus on relevance-based retrieval and often struggle with redundancy, especially when reasoning requires…
Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) for domain-specific question-answering (QA) tasks by leveraging external knowledge sources. However, traditional RAG systems primarily focus on relevance-based retrieval and often struggle with redundancy, especially when reasoning requires…
Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) for domain-specific question-answering (QA) tasks by leveraging external knowledge sources. However, traditional RAG systems primarily focus on relevance-based retrieval and often struggle with redundancy, especially when reasoning requires…
Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) for domain-specific question-answering (QA) tasks by leveraging external knowledge sources. However, traditional RAG systems primarily focus on relevance-based retrieval and often struggle with redundancy, especially when reasoning requires…
Abstract: Retrieval-Augmented Generation (RAG) is a prevalent approach to infuse a private knowledge base of documents with Large Language Models (LLM) to build Generative Q\&A (Question-Answering) systems. However, RAG accuracy becomes increasingly challenging as the corpus of documents scales up, with Retrievers playing an…
Abstract: Retrieval-Augmented Generation (RAG) is a prevalent approach to infuse a private knowledge base of documents with Large Language Models (LLM) to build Generative Q\&A (Question-Answering) systems. However, RAG accuracy becomes increasingly challenging as the corpus of documents scales up, with Retrievers playing an…