Papers
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does 4-bit versus 8-bit quantization affect the HumanEval pass@1 scores of code generation models. Abstract The rapid evolution of large language models (LLMs) has driven a transformative shift in artificial…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How do memory-efficient multimodal architectures perform relative to LLaVA-NeXT on long-context video understanding tasks within the Video-MME benchmark. We introduce phi-3-mini, a 3.8 billion parameter language…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the inference latency of Gemini 1.5 Flash compare to LLaVA-NeXT on the Video-MME benchmark when constrained to 24GB VRAM. In this work, we present a novel method to tackle the token generation challenge…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the score on the MT-bench change for Phi-3-mini versus Llama 3 70B when evaluated on code generation tasks involving long-context reasoning spanning 100K tokens. We introduce phi-3-mini, a 3.8 billion…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the performance of Deepseek R1 and Codestral compare on Qiskit-based quantum code generation tasks when evaluated using the Qiskit HumanEval benchmark with varying levels of quantum circuit. As Large…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: What is the impact of task-specific fine-tuning on the throughput and accuracy of small language models compared to large models in code generation benchmarks such as HumanEval. Large Language Models (LLMs) have…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How robust is GADT3 to adversarial attacks on graph structure and node features compared to traditional supervised GAD methods, measured using the AUC-ROC score on perturbed datasets. Real-time traffic prediction…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the memory efficiency of LLaVA-UHD scale with image resolution (e.g., 1024x1024 to 8192x8192) compared to dense inference in Visual-LLM benchmarks like LVIS. Visual encoding constitutes the basis of…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does neuron activation sparsity correlate with reasoning task accuracy degradation when models are pruned to cold neurons only in PowerInfer's inference pipeline. Activation sparsity offers a compelling route…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of varying the number of homophily-guided self-supervision steps in GADT3 on its inference efficiency and detection accuracy across different graph domains. Graph Anomaly Detection (GAD) has…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does PowerInfer's neuron activation sparsity optimization affect inference latency when scaling from LLaMA-33B to LLaMA-70B across different consumer GPU memory configurations. This paper introduces…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the accuracy of GADT3 compare to other state-of-the-art cross-domain graph anomaly detection models on standard graph benchmarks like Reddit and Twitter datasets. Anomaly detection is defined as…
Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: What is the accuracy trade-off between dense and quantized LLaVA-UHD models on the PopVQA benchmark when processing images with varying aspect ratios (e.g., 16:9 vs. 9:16). We investigate the behaviour of quantum…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does the inference latency of quantized LLaVA-UHD compare to LLaVA-1.5 when processing ultra-high-resolution images (e.g., 4K) across multimodal benchmarks like MMBench or SEED-Bench. The advent of real-time…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the scaling behavior of quantization-aware training vary across different LLaVA model versions on multimodal reasoning benchmarks. Large Language Models (LLMs) have drawn a lot of attention due to their…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does activation-aware weight quantization affect LLaVA-1.5 performance on the GQA benchmark compared to standard post-training quantization methods. We present LLaVA-OneVision-1.5, a novel family of Large…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does the PowerInfer hot neuron activation threshold parameter impact inference latency and accuracy trade-offs for LLaMA-33B and LLaMA-70B on the HumanEval code generation benchmark. Deploying local AI…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the forecasting accuracy of Llama3 compare to domain-specific models like Prophet or ARIMA when evaluated on high-frequency renewable energy time-series data (e.g., minute-level solar power. This study…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the pass@1 accuracy of fine-tuned LLaMA-70B on MBPP Python function synthesis compare to CodeGen/CodeLlama when evaluated under the same dynamic hot neuron threshold settings in PowerInfer. We benchmark…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does domain adaptation via cross-task fine-tuning affect the robustness of SLMs in detecting CWEs in Python code under adversarial perturbations compared to a baseline of pre-trained LLMs. A joint measurement…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the trade-off between inference throughput and pass@1 accuracy for SLMs vs. LLMs in CWE detection tasks on private Python codebases when deployed on-device vs. in cloud environments. Large Language Models…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the alignment of LLaMA-70B with human preferences via PowerInfer's dynamic threshold adjustment scale with model size, as measured by accuracy on MBPP and the degree of preference divergence. Aligning…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the robustness of MORL-based preference alignment in PowerInfer when evaluated across diverse programming languages beyond Python (e.g., JavaScript, Java) using the HumanEval benchmark. Fine-grained…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does the dynamic hot neuron threshold adjustment in PowerInfer compare to fixed threshold methods in terms of inference latency and memory efficiency when applied to LLaMA-70B on MBPP Python. Large Language…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the detection accuracy of federated learning models compare to centralized deep neural networks when evaluated on the AndroZoo benchmark with varying levels of code obfuscation and. This work investigates…