Papers
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the inference latency and throughput of geodesic distance-based dense retrievers compare to Euclidean-based models when evaluated across the 18 heterogeneous datasets in BEIR. 0 claims were extracted…
Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: How does the performance of contrastive learning models in hyperbolic space for zero-shot cross-lingual retrieval vary with different language pairs in XOR-TyDi QA, measured by recall@k and NDCG. 8 claims were…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: Does replacing Euclidean distance with geodesic distance in dense retriever training improve zero-shot retrieval accuracy on the BEIR benchmark under domain shift conditions. 12 claims were extracted from source…
Abstract: This report synthesises findings from 2 peer-reviewed papers addressing the following research question: How do hyperbolic and Euclidean contrastive learning models scale with increasing model size and training data size in zero-shot cross-lingual retrieval for XOR-TyDi QA, measured by recall@k and NDCG. 5 claims…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: What is the impact of different contrastive loss functions (e.g., InfoNCE, SupCon) on the performance of hyperbolic vs. Euclidean embeddings for cross-lingual retrieval in XOR-TyDi QA, evaluated with. 8 claims…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: Does the adoption of geodesic distance over cosine similarity improve the robustness of dense retrievers against adversarial query perturbations in out-of-distribution settings on the BEIR benchmark. 5 claims were…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the comparative robustness of manifold-based semantic scoring versus cosine similarity in cross-lingual open QA benchmarks when evaluated on low-resource languages. 15 claims were extracted from source…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does conformal prediction for distribution shift estimation scale with model size in large language models trained on medical question-answering datasets. 6 claims were extracted from source literature; 6…
Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What is the correlation between context window length and pass@1 accuracy on code generation tasks for Gemini 1.5 models when multimodal inputs include executable video demonstrations. 11 claims were extracted…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the difference in classification accuracy and robustness between multimodal models trained on dimensional facial affect representations versus raw visual features for deception detection on. 12 claims…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How robust are manifold-aware distance metrics in cross-domain dense retrieval tasks, as measured by performance on the MTEB (Massive Text Embedding Benchmark) across different domains such as news,. 5 claims…
Abstract: This report synthesises findings from 5 peer-reviewed papers addressing the following research question: Does the multi-turn conversation paradigm in LongNav-R1 improve robustness to partial observability in long-horizon tasks relative to chain-of-thought prompting on ALFRED. 8 claims were extracted from source…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of horizon-adaptive multi-turn RL on the success rate of VLA models compared to single-turn baselines in the ALFRED dataset. 6 claims were extracted from source literature; 6 were independently…
Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does replacing cosine similarity with geodesic distance metrics impact the robustness of dense retrievers on the Adversarial NLI benchmark under domain shift conditions. 9 claims were extracted from source…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the alignment of multimodal Llama-2 models affect their performance on self-invoking code generation tasks in HumanEval Pro and MBPP Pro, as measured by the trade-off between inference. 13 claims were…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How do multimodal Llama-2 extensions perform on HumanEval Pro and MBPP Pro compared to text-only models when evaluated on solution correctness and problem-solving latency in self-invoking code. 10 claims were…
Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: Does the type-aware entity representation in NER Retriever improve retrieval throughput compared to standard DPR baselines on the BEIR benchmark while maintaining accuracy for rare entities. 9 claims were…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of multimodal pre-training on the robustness of Llama-2 models in cross-domain code generation tasks, as measured by accuracy degradation when evaluated on HumanEval Pro and MBPP. 12 claims…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: To what extent does model size (e.g., 7B vs. 13B vs. 70B) impact the efficiency of self-repair in Llama-2 models, evaluated by the trade-off between pass@1 accuracy and inference latency in code. 0 claims were…
Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the addition of multimodal context (e.g., natural language error messages or stack traces) improve the robustness of self-repair in Llama-2 models, measured by accuracy degradation in pass@k. 11 claims…
Abstract: This report synthesises findings from 7 peer-reviewed papers addressing the following research question: How does the cross-domain transferability of self-repair mechanisms in Llama-2 models scale with instruction-tuning data diversity, as measured by pass@k accuracy across different programming. 10 claims were…
Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the robustness of semi-supervised graph anomaly detection frameworks compare to fully unsupervised methods when evaluated on heterogeneous multi-view graph benchmarks under adversarial. 0 claims were…
Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: Does integrating XSimGCL with large language model-encoded item descriptions improve out-of-domain generalization metrics compared to traditional ID-based embeddings on Steam dataset evaluations. 0 claims were…
Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the impact of different metapath sampling strategies (random vs. heuristic-based) on the convergence speed and final accuracy of HGNNs in multi-task learning settings (e.g., node. 10 claims were extracted…
Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of XSimGCL's contrastive loss weighting on inference throughput and precision-recall trade-offs when scaled to large-scale multimodal item datasets. 0 claims were extracted from source…