Assignee Research: Index of Papers

Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 6190 papers; mean review score 5.55/10; 1559 Zenodo DOIs.

Results 2051–2075 of 6190 entries

Papers

[4140]

Instruction Fine-Tuning Effects on Language Model Mathematical Problem-Solving Accuracy

6 June 2026. Score: 4.07/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the effect of instruction fine-tuning on language model mathematical problem-solving accuracy v9. 15 claims were extracted from source literature; 1 was independently verified against retrieved documents.…

[4139]

Pretraining Data Quality and Its Impact on Language Model Reasoning Performance

6 June 2026. Score: 3.73/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does pretraining data quality affect language model reasoning benchmark performance v9. 12 claims were extracted from source literature; 1 was independently verified against retrieved documents. An automated…

[4138]

Multimodal Language Models in Visual Mathematical and Scientific Reasoning Performance

6 June 2026. Score: 5.33/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How do multimodal language models perform on visual mathematical and scientific reasoning v9. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4137]

Reinforcement Learning from Human Feedback Enhances Language Model Mathematical Reasoning

6 June 2026. Score: 1.67/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does reinforcement learning from human feedback improve language model mathematical reasoning v9. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4136]

Architectural Innovations Enhancing Transformer Multi-Step Logical Reasoning

6 June 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: What architectural innovations improve transformer performance on multi-step logical reasoning v9. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4135]

Sparse Mixture-of-Experts vs. Dense Transformers in Mathematical Reasoning Benchmarks

6 June 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How do sparse mixture-of-experts models compare to dense transformers on mathematical reasoning v9. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4134]

Emergent Reasoning in Transformers Through Reinforcement Learning and Scale

6 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the relationship between model scale and emergent reasoning capabilities in transformers v9. 16 claims were extracted from source literature; 3 were independently verified against retrieved documents. An…

[4133]

Test-Time Compute Scaling and Language Model Performance on Reasoning Benchmarks

6 June 2026. Score: 3.33/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does test-time compute scaling improve language model performance on reasoning benchmarks v9. 11 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4132]

Qwen-Max-VL and Proprietary Models Accuracy Gap on LogicVista Puzzle-Solving

6 June 2026. Score: 6.00/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the accuracy gap between Qwen-Max-VL and proprietary models on the LogicVista puzzle-solving subset. 0 claims were extracted from source literature; 0 were independently verified against retrieved…

[4131]

Adversarial Robustness of DeepSeek-33B in Self-Invoking Code Generation

6 June 2026. Score: 4.67/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How do adversarial perturbations or input variations affect the robustness of DeepSeek-33B's solutions in the self-invoking code generation setting, compared to baseline models. 0 claims were extracted from…

[4130]

Adversarial Perturbations Reduce LLaMA-8B Performance on MATH Benchmark

6 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the performance degradation of LLaMA-8B on MATH benchmark when subjected to adversarial perturbations in problem statements. 13 claims were extracted from source literature; 2 were independently verified…

[4129]

Quantization Effects on DeepSeek-1.3B Throughput and GSM8K-V Accuracy

6 June 2026. Score: 7.17/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of quantization techniques (e.g., 4-bit vs 8-bit) on DeepSeek-1.3B's inference throughput and accuracy in solving GSM8K-V math word problems. 0 claims were extracted from source literature; 0…

[4128]

LLaVA-OneVision-72B Robustness to Domain Shifts in Visual Mathematics Reasoning

6 June 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How robust is LLaVA-OneVision-72B to domain shifts when evaluated on out-of-distribution visual mathematics problems compared to text-only reasoning benchmarks like GSM8K. 0 claims were extracted from source…

[4127]

LLaVA-v1.6-Mistral-7B Performance on HumanEval-V for Diagram-Based Coding Tasks

6 June 2026. Score: 4.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the performance of llava-v1.6-mistral-7b compare to other state-of-the-art multimodal models on HumanEval-V for diagram interpretation tasks, particularly in accuracy and reasoning depth. 11 claims were…

[4126]

InstructionBlip-7B Performance on Adversarial NLU Tasks Versus GLUE Benchmarks

6 June 2026. Score: 6.00/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does InstructionBlip-7b perform on adversarial NLU tasks compared to standard GLUE benchmarks. 12 claims were extracted from source literature; 5 were independently verified against retrieved documents. An…

[4125]

To what extent can fine-tuning llava-v1.6-mistral-7b on domain-specific diagram datasets improve its performance on

6 June 2026. Score: 4.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: To what extent can fine-tuning llava-v1.6-mistral-7b on domain-specific diagram datasets improve its performance on HumanEval-V coding tasks. 17 claims were extracted from source literature; 3 were independently…

[4124]

LLaVA-OneVision-72B Performance on GSM8K-V Compared to Qwen-VL and DeepSeek-VL

6 June 2026. Score: 2.17/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the performance of LLaVA-OneVision-72B on visual mathematical reasoning tasks compare to other state-of-the-art multimodal models like Qwen-VL or DeepSeek-VL on the GSM8K-V benchmark. 0 claims were…

[4123]

LLaDA-8B Benchmark Performance Across Reasoning Mathematics Coding and Language Tasks

6 June 2026. Score: 4.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What are the benchmark performance scores of LLaDA-8B on reasoning mathematics coding and language understanding tasks. 14 claims were extracted from source literature; 2 were independently verified against…

[4122]

GPT-4V Benchmark Performance Across Reasoning Mathematics Coding and Language Tasks

6 June 2026. Score: 3.17/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What are the benchmark performance scores of GPT-4V on reasoning mathematics coding and language understanding tasks. 0 claims were extracted from source literature; 0 were independently verified against…

[4121]

LLaMA-1B Benchmark Performance Across Reasoning Mathematics Coding and Language Tasks

6 June 2026. Score: 4.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What are the benchmark performance scores of LLaMA-1B on reasoning mathematics coding and language understanding tasks. 14 claims were extracted from source literature; 1 was independently verified against…

[4120]

Claude-3-7-Sonnet Benchmark Performance Across Reasoning Mathematics and Language Tasks

6 June 2026. Score: 4.40/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What are the benchmark performance scores of Claude-3-7-Sonnet on reasoning mathematics coding and language understanding tasks. 14 claims were extracted from source literature; 1 was independently verified…

[4119]

Vicuna-13B Benchmark Performance Across Reasoning Mathematics Coding and Language Tasks

6 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What are the benchmark performance scores of Vicuna-13B on reasoning mathematics coding and language understanding tasks. 14 claims were extracted from source literature; 2 were independently verified against…

[4118]

Qwen-Max-VL Benchmark Performance Across Reasoning Mathematics Coding and Language Tasks

6 June 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What are the benchmark performance scores of Qwen-Max-VL on reasoning mathematics coding and language understanding tasks. 0 claims were extracted from source literature; 0 were independently verified against…

[4117]

LLaMA-8B Benchmark Performance Across Reasoning Mathematics Coding and Language Tasks

6 June 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What are the benchmark performance scores of LLaMA-8B on reasoning mathematics coding and language understanding tasks. 0 claims were extracted from source literature; 0 were independently verified against…

[4116]

LLaMA-3B Benchmark Performance Across Reasoning Mathematics Coding and Language Tasks

6 June 2026. Score: 6.00/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What are the benchmark performance scores of LLaMA-3B on reasoning mathematics coding and language understanding tasks. 0 claims were extracted from source literature; 0 were independently verified against…

« Prev 1 … 81 82 83 84 85 … 248 Next »