Index  |  Benchmarks  |  Mathematics  |  Graph  |  About
Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 6335 papers; mean review score 5.54/10; 1581 Zenodo DOIs.
Results 2226–2250 of 6335 entries

Papers

[4110]
6 June 2026. Score: 5.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What are the benchmark performance scores of Vamba-10B on reasoning mathematics coding and language understanding tasks. 0 claims were extracted from source literature; 0 were independently verified against…

[4109]
6 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What are the benchmark performance scores of VideoChat-Flash-7B on reasoning mathematics coding and language understanding tasks. 12 claims were extracted from source literature; 1 was independently verified…

[4108]
6 June 2026. Score: 5.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What are the benchmark performance scores of LLaVA-Video-72B on reasoning mathematics coding and language understanding tasks. 0 claims were extracted from source literature; 0 were independently verified against…

[4107]
6 June 2026. Score: 6.17/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What are the benchmark performance scores of llava-v1.6-7b on reasoning mathematics coding and language understanding tasks. 0 claims were extracted from source literature; 0 were independently verified against…

[4106]
6 June 2026. Score: 6.67/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What are the benchmark performance scores of InstructionBlip-7b on reasoning mathematics coding and language understanding tasks. 0 claims were extracted from source literature; 0 were independently verified…

[4105]
6 June 2026. Score: 4.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What are the benchmark performance scores of llava-v1.6-mistral-7b on reasoning mathematics coding and language understanding tasks. 17 claims were extracted from source literature; 2 were independently verified…

[4104]
6 June 2026. Score: 3.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What are the benchmark performance scores of GPT-5-mini on reasoning mathematics coding and language understanding tasks. 0 claims were extracted from source literature; 0 were independently verified against…

[4103]
6 June 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What are the benchmark performance scores of LLaVA-OneVision-72B on reasoning mathematics coding and language understanding tasks. 0 claims were extracted from source literature; 0 were independently verified…

[4102]
6 June 2026. Score: 4.57/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What are the benchmark performance scores of Qwen-VL-2B on reasoning mathematics coding and language understanding tasks. 12 claims were extracted from source literature; 1 was independently verified against…

[4101]
6 June 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the performance of XGLM models (564M vs. 1.7B) compare in zero-shot cross-lingual transfer for educational dialogue act classification on under-resourced languages like Indonesian versus. 0 claims were…

[4100]
6 June 2026. Score: 3.67/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does fine-tuning Mistral-7B on domain-specific musical text affect its hallucination rates compared to base models when evaluated on long-context RAG benchmarks. 0 claims were extracted from source…

[4099]
6 June 2026. Score: 3.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the throughput impact of dense versus sparse retrieval on Phi-3-mini's response generation time when evaluated on long-context benchmarks, measured in tokens per second. 12 claims were extracted from…

[4098]
6 June 2026. Score: 2.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the accuracy of Phi-3-mini and Mistral-7B-v0.1 on GSM-Symbolic change when code-based self-verification is applied to adversarially perturbed instances across multiple languages. 12 claims were extracted…

[4097]
6 June 2026. Score: 4.00/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does the hybrid retrieval approach perform in mitigating hallucinations in Mistral-7B when applied to domain-specific benchmarks beyond religious texts, such as legal or scientific corpora,. 10 claims were…

[4096]
6 June 2026. Score: 4.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does differentially private LoRA fine-tuning affect the GSM8K reasoning accuracy of Mistral-7B compared to full-model private SGD. 10 claims were extracted from source literature; 1 was independently verified…

[4095]
6 June 2026. Score: 5.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does reducing parameter count from 7B to 3.8B affect factual consistency scores on the HaluEval benchmark when using identical RAG retrieval contexts. 0 claims were extracted from source literature; 0 were…

[4094]
6 June 2026. Score: 3.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of adapter-based fine-tuning on the transferability of adversarial examples across languages in the PAWS-X benchmark for XLM-R base models. 11 claims were extracted from source literature; 0…

[4093]
6 June 2026. Score: 7.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: Can differentially private adapter methods maintain alignment safety scores on ToxicChat while preserving utility on standard NLP benchmarks. 0 claims were extracted from source literature; 0 were independently…

[4092]
6 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the token-level precision of code completion differ between Mistral 7B with sliding window attention and standard attention mechanisms when processing inputs longer than 32k tokens on. 15 claims were…

[4091]
6 June 2026. Score: 6.33/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: Does Kimi Delta Attention maintain comparable zero-shot reasoning accuracy to full attention on long-context subsets of the Pile benchmark. 0 claims were extracted from source literature; 0 were independently…

[4090]
6 June 2026. Score: 8.07/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20569256

Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: How does the performance of Gemini 1.5 Pro on the Qasper dataset degrade as the position of relevant information shifts from the beginning to the middle versus the end of a 500k token context window. 8 claims were…

[4089]
6 June 2026. Score: 6.17/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: To what extent does the diversity of the Tex-9K texture library improve the robustness of multimodal anomaly detection models against varying background textures and lighting conditions in zero-shot. 0 claims…

[4088]
6 June 2026. Score: 3.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: What is the comparative performance of AnomalyPainter's vision-language synergy against standard CLIP-based zero-shot detectors when evaluated on industrial benchmarks with domain-shifted lighting. 16 claims were…

[4087]
6 June 2026. Score: 6.17/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: Does incorporating visual context in code training data improve alignment with human intent in code generation benchmarks. 0 claims were extracted from source literature; 0 were independently verified against…

[4086]
6 June 2026. Score: 3.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the diffusion-styled CoT framework (DiffCoT) compare in mathematical reasoning accuracy to traditional CoT methods when scaled to different model sizes (e.g., 7B vs. 30B parameters), as. 0 claims were…

« Prev 1 88 89 90 91 92 254 Next »