Index  |  Benchmarks  |  Mathematics  |  Graph  |  About
Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 6190 papers; mean review score 5.55/10; 1559 Zenodo DOIs.
Results 2301–2325 of 6190 entries

Papers

[3890]
6 June 2026. Score: 3.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the zero-shot accuracy of LLaVA compare to T5-11B on math word problems when both models are provided with the same image captions. 0 claims were extracted from source literature; 0 were independently…

[3889]
6 June 2026. Score: 8.07/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20566174

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the mathematical reasoning performance of Gemma-2-7B compare to Mistral-7B and Llama-2-7B on BIG-Bench subsets when controlling for instruction finetuning scale. 8 claims were extracted from source…

[3888]
6 June 2026. Score: 8.07/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20566171

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the zero-shot vs few-shot performance gap between Gemma-2-7B and larger parameter models vary across different BIG-Bench mathematical problem domains (e.g., algebra, calculus, logic). 10 claims were…

[3887]
6 June 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20566167

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the impact of varying image resolution inputs on the code generation accuracy of multimodal transformers on the HumanEval-V benchmark. 12 claims were extracted from source literature; 11 were…

[3886]
6 June 2026. Score: 7.20/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What is the comparative robustness of Gemini 1.5 Flash versus Pro against adversarial perturbations in complex diagram interpretation tasks within HumanEval-V. 12 claims were extracted from source literature; 8…

[3885]
6 June 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20566158

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the retrieval accuracy of fine-grained visual details degrade in Gemini 1.5 models as the number of interleaved image-text tokens exceeds 500k. 11 claims were extracted from source literature; 11 were…

[3884]
6 June 2026. Score: 7.40/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 5 peer-reviewed papers addressing the following research question: What is the correlation between the semantic complexity of natural language instructions and the error propagation rate in multi-step GUI automation tasks. 9 claims were extracted from source literature; 6 were…

[3883]
6 June 2026. Score: 9.00/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20566134

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How do vision-language models compare to text-only LLMs in accuracy on HumanEval-V when evaluated with chain-of-thought prompting. 9 claims were extracted from source literature; 9 were independently verified…

[3882]
6 June 2026. Score: 8.00/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: To what extent does training on diverse application interfaces improve the zero-shot generalization of GUI agents to unseen software environments. 10 claims were extracted from source literature; 10 were…

[3881]
6 June 2026. Score: 7.07/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the task success rate of compositional GUI agents degrade as the number of sequential steps increases in complex post-production workflows. 4 claims were extracted from source literature; 4 were…

[3880]
6 June 2026. Score: 9.17/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20566121

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of test-time compute scaling strategies on the robustness of InternVL 2.5 against adversarial perturbations in the ChartQA dataset. 9 claims were extracted from source literature; 9 were…

[3879]
6 June 2026. Score: 7.40/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How robust is the updated Gemini 1.5 Pro to out-of-distribution shifts in code generation benchmarks compared to its February release counterpart. 12 claims were extracted from source literature; 8 were…

[3878]
6 June 2026. Score: 6.07/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does scaling model parameters from 7B to 32B affect zero-shot performance on the CLUE benchmark compared to few-shot settings. 13 claims were extracted from source literature; 6 were independently verified…

[3877]
6 June 2026. Score: 7.30/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: What is the correlation between pretraining data volume and robustness to adversarial perturbations in Chinese NLU tasks within the CLUE suite. 9 claims were extracted from source literature; 6 were independently…

[3876]
6 June 2026. Score: 9.33/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does FLoRIST-OLMo-1B's performance on the MMBench benchmark compare to larger multimodal models when evaluating diagrams with varying levels of occlusion or noise. 7 claims were extracted from source…

[3875]
6 June 2026. Score: 8.40/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20566097

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How robust are the reasoning capabilities of Gemini 1.5 Pro on long-context mathematical problem-solving tasks compared to specialized models like GPT-4 when evaluated on the MathQA benchmark. 15 claims were…

[3874]
6 June 2026. Score: 7.90/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20566079

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the exact match accuracy of Mistral-7B-Instruct-v0.2 compare to Llama-2-7B and Gemma-7B on university-level calculus problems in the MathOdyssey dataset. 8 claims were extracted from source literature; 8…

[3873]
6 June 2026. Score: 9.33/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20566075

Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: What is the impact of varying quantization bit-widths (e.g., INT2, INT4, INT8) on GRACE-LLaVA-1.5-7B's performance across different multimodal benchmarks, including MMBench and MMATH. 10 claims were extracted from…

[3872]
6 June 2026. Score: 8.00/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20566073

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: Does GRACE's confidence-based distillation approach improve robustness to adversarial multimodal inputs compared to standard quantization-aware training methods for VLMs. 6 claims were extracted from source…

[3871]
6 June 2026. Score: 8.57/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20566064

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the performance of INT4-quantized GRACE-LLaVA-1.5-7B compare to other state-of-the-art quantized multimodal models on MultiModal-Multilingual-HumanEval in terms of accuracy and latency. 9 claims were…

[3870]
6 June 2026. Score: 7.87/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20566058

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What is the impact of training stability techniques employed in OLMo 2 on the robustness of OLMoE-1B-7B-0125 when evaluated on adversarial language understanding tasks like ANLI or AdversarialQA. 11 claims were…

[3869]
6 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does the robustness of GRACE-LLaVA-1.5-7B-INT4 compare to that of other quantized multimodal models like Qwen-VL-Chat-INT4 on adversarial visual perturbations across language understanding. 12 claims were…

[3868]
6 June 2026. Score: 8.17/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20566049

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the performance of GRACE-LLaVA-1.5-7B-INT4 scale with model size (e.g., 7B vs. 13B) on adversarial visual perturbation tasks compared to unquantized models, as measured by accuracy on. 8 claims were…

[3867]
6 June 2026. Score: 7.93/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20566044

Abstract: This report synthesises findings from 2 peer-reviewed papers addressing the following research question: What is the impact of OLMo2's modified architecture and training stability techniques on the throughput and latency of inference for the OLMoE-1B-7B-0125-Instruction model across different hardware. 14 claims were…

[3866]
6 June 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20566033

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does Qwen3's performance on mathematical reasoning benchmarks (e.g., GSM8K, MATH) compare to other state-of-the-art LLMs like GPT-4 and Claude 3 in terms of accuracy and scaling with model size. 13 claims…

« Prev 1 91 92 93 94 95 248 Next »