Papers
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: What is the impact of model size on the coverage-efficiency trade-off of conformal prediction sets for out-of-distribution detection in healthcare language tasks. 10 claims were extracted from source literature;…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of multi-turn RL training on the sample efficiency and convergence speed of VLA agents performing long-horizon tasks in ALFRED. 0 claims were extracted from source literature; 0 were…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the multi-turn conversation paradigm in LongNav-R1 compare to chain-of-thought prompting in terms of success rate and path efficiency on the ALFRED benchmark under partial observability. 6 claims were…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the impact of geodesic distance-based retrieval on inference latency and throughput compared to cosine similarity in large-scale language model applications. 5 claims were extracted from source…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: Does horizon-adaptive multi-turn RL improve the robustness of VLA models to environmental perturbations and instruction ambiguity in the ALFRED benchmark relative to supervised single-turn approaches. 7 claims…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: Does the type-aware entity representation in NER Retriever improve cross-domain generalization for rare entities on the FEVER benchmark compared to standard DPR baselines. 0 claims were extracted from source…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does horizon-adaptive multi-turn reinforcement learning affect the task success rate of Vision-Language-Action models on the ALFRED dataset compared to single-turn baselines. 9 claims were extracted from…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the integration of thinking and non-thinking modes in Qwen3 affect its performance on HumanEval Pro and MBPP Pro benchmarks, as measured by pass@k accuracy and latency trade-offs compared to. 0 claims…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does replacing cosine similarity with geodesic distance metrics affect the robustness scores of dense retrievers on the Adversarial NLI benchmark under domain shift. 10 claims were extracted from source…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the comparative robustness of Llama-2 models with and without multimodal pre-training when evaluated on non-adversarial versus adversarial inputs in the MBPP Pro benchmark, measured by. 14 claims were…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the integration of entity-aware attention mechanisms in RAG models impact the retrieval precision for rare entities on the BEIR benchmark compared to standard DPR baselines. 11 claims were extracted from…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: To what extent does dynamic entity representation in NER Retriever affect the inference latency of RAG models on the MS MARCO benchmark while maintaining retrieval effectiveness. 9 claims were extracted from…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does the efficacy of self-repair in Llama-2 models scale with instruction-tuning data size, measured by HumanEval pass@1 accuracy and token efficiency in code generation tasks. 0 claims were extracted from…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the inference latency of self-repair in Llama-2 models vary with task complexity (e.g., single-function vs. multi-file code generation), and what trade-offs exist between accuracy and. 5 claims were…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the diversity of instruction-tuning data affect the cross-domain zero-shot code generation capability of Llama-2 models, as measured by pass@1 accuracy on HumanEval across Python,. 0 claims were extracted…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the integration of structured diagram representations (e.g., graph embeddings) with code generation tasks in multimodal models improve pass@k metrics compared to raw image-based reasoning on. 10 claims…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: What is the impact of incorporating human-labeled visual instruction tasks on the multimodal reasoning performance of Flan-VLMs, as evaluated by VQA accuracy on OK-VQA and GQA benchmarks. 6 claims were extracted…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the impact of varying the percentage of known normal nodes on the convergence speed and inference efficiency of generative semi-supervised graph anomaly detection models. 17 claims were extracted from…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How do generative semi-supervised graph anomaly detection methods perform in cross-domain transfer scenarios compared to unsupervised baselines when evaluated on multi-view graph benchmarks. 7 claims were…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does integrating item2vec-style sequential embeddings with large language model text encoders impact zero-shot recommendation accuracy on cross-domain datasets compared to pure ID-based. 9 claims were…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: How robust are Metapath Context Convolution-based HGNNs to noisy or adversarial metapaths in heterogeneous graphs, as evaluated by link prediction F1 scores on corrupted versions of citation datasets. 5 claims…
Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: What is the impact of different metapath sampling granularities (e.g., coarse vs. fine-grained) on the performance of Metapath Context Convolution-based HGNNs in multi-task learning benchmarks like. 6 claims were…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the trade-off between inference throughput (in samples/second) and recommendation precision (e.g., Recall@K) when scaling XSimGCL's contrastive loss weighting to billion-parameter multimodal. 9 claims were…
Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the effect of semantic text augmentation strategies versus structural graph perturbations on the robustness of contrastive recommendation models under data sparsity conditions. 0 claims were extracted…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the integration of Metapath Context Convolution with transformers compare to traditional HGNNs in terms of node classification accuracy and inference latency on citation graphs like ACM or. 11 claims…