Assignee Research: Index of Papers

Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 8301 papers; mean review score 5.73/10; 2276 Zenodo DOIs. Verified contributions (Gate 2: formal proof or sandbox reproduction): 149. 97 claims falsified by the pipeline (see falsification record). 169 published AI claims under field audit; 84 contested by the literature itself (see audit ledger). 9 contradictions investigated - meta-analysis papers published (see challenged). What does this mean?

Results 7651–7675 of 8301 entries

Papers

[651]

Mistral-Large-2 Inference Latency Scaling with Sequence Length on ARC-Challenge

30 May 2026. Score: 7.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does Mistral-Large-2's inference latency scale across different sequence lengths on ARC-Challenge questions. We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior…

[650]

Human Evaluation of Mistral-Large-2 Code Quality and Correctness on MBPP

30 May 2026. Score: 2.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the human evaluation score for code quality and functional correctness of Mistral-Large-2 generated solutions on MBPP compared to ground truth implementations. Several Deep Learning (DL)-based techniques…

[649]

Fine-Tuning Mistral-Large-2 On Domain-Specific Math Datasets (E.G., Math-Pt) Performance On Its Math Benchmark Scores

30 May 2026. Score: 4.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does fine-tuning Mistral-Large-2 on domain-specific math datasets (e.g., Math-PT) improve its MATH benchmark scores compared to zero-shot or few-shot evaluation. The use of large language models (LLMs) for…

[648]

Mistral-Large-2 and State-of-the-Art Models on MBPP Benchmark Performance

30 May 2026. Score: 6.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the pass@1 accuracy of Mistral-Large-2 on the MBPP benchmark compared to other state-of-the-art code generation models. We introduce self-invoking code generation, a new task designed to evaluate the…

[647]

Qwen3-235B Performance Degradation Under PPTC-R Adversarial Instructions

30 May 2026. Score: 5.70/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the performance of Qwen3-235B degrade under PPTC-R adversarial user instructions compared to standard instructions. The growing dependence on Large Language Models (LLMs) for finishing user instructions…

[646]

Mistral-Large-2 Inference Efficiency on MATH vs. Specialized Math Models

30 May 2026. Score: 2.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the inference efficiency (tokens/sec or latency) of Mistral-Large-2 when solving MATH problems compared to smaller specialized math-focused models. Large language models (LLMs) have been explored in a…

[645]

Context Window Size Effects on Mistral-Large-2 Inference Efficiency for GSM8K

30 May 2026. Score: 7.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453617

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does the inference efficiency of Mistral-Large-2 on GSM8K benchmark change with different context window sizes. In this report, we introduce the Gemini 1.5 family of models, representing the next generation of…

[644]

Mistral-Large-2 Performance on Multilingual Math Benchmarks Across Languages

30 May 2026. Score: 7.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does Mistral-Large-2's performance on MATH vary across different languages when evaluated on multilingual math benchmarks like Math-PT. Large Language Models (LLMs) have demonstrated remarkable versatility in…

[643]

Qwen3-235B Inference Efficiency Across Programming Languages in LiveCodeBench

30 May 2026. Score: 9.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453545

Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: How does the inference efficiency of Qwen3-235B vary across different programming languages in the LiveCodeBench evaluation. In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3…

[642]

Monolingual Portuguese and Multilingual LLMs on Non-English Reasoning Benchmarks

30 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453534

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the performance gap between monolingual Portuguese LLMs and multilingual models (e.g., Qwen2.5-72B) on MATH-PT, and does this gap persist when evaluating on other non-English reasoning. In this work, we…

[641]

Qwen3-235B Inference Efficiency on SWE-Bench Verified Under Computational Constraints

30 May 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453532

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the inference efficiency of Qwen3-235B vary when evaluated on SWE-bench Verified tasks with different computational constraints. In this work, we present Qwen3, the latest version of the Qwen model…

[640]

Training Data Contamination Effects on Qwen3 Model Performance Across Scales on SWE-Bench Verified

30 May 2026. Score: 6.50/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of training data contamination on Qwen3-235B's performance across different model sizes on SWE-bench Verified. Abstract The rapid evolution of large language models (LLMs) has driven a…

[639]

Explanation Method Performance on Human Attention Quality Metrics

30 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: Can explanation methods that perform well on traditional accuracy metrics maintain similar performance on the human attention explanation quality metric. Multilayer neural networks trained with the…

[638]

Qwen2.5-72B Inference Efficiency vs. State-of-the-Art Models on MATH-PT

30 May 2026. Score: 9.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453395

Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: How does the inference efficiency (e.g., tokens per second) of Qwen2.5-72B compare to other state-of-the-art models (e.g., Mistral-7B, Llama3-8B) when processing MATH-PT problems. We introduce MiniMax-01 series,…

[637]

Saliency Explanation Methods and Human Interpretability Across Vision and Language Domains

30 May 2026. Score: 8.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How do different saliency explanation methods compare in terms of human interpretability when evaluated on the proposed human attention benchmark across vision and language domains. Multilayer neural networks…

[636]

Computational Efficiency and Explanation Quality in Tumor Segmentation Algorithms

30 May 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453342

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the correlation between computational efficiency (FLOPs, inference time) and explanation quality scores on the human attention benchmark. In this paper we report the set-up and results of the Multimodal…

[635]

Qwen2.5-72B Performance on HumanEval-V Versus Standard Code Generation Benchmarks

30 May 2026. Score: 8.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the performance of Qwen2.5-72B on HumanEval-V compare to its performance on standard code generation benchmarks like HumanEval and MBPP. In this work, we present Qwen3, the latest version of the Qwen…

[634]

Human Attention Benchmarks for Multi-Task Learning in Attention-Based Models

30 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453327

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: Can the human attention benchmark be used to improve the training of attention-based models through multi-task learning frameworks. Deep convolutional neural networks have performed remarkably well on many…

[633]

Multi-Layer Human Attention Masks and Explanation Quality in Deep Neural Networks

30 May 2026. Score: 8.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453272

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What is the impact of using multi-layer human attention masks versus single-layer attention mechanisms on explanation quality scores. Deep convolutional neural networks have performed remarkably well on many…

[632]

DeepSeek-V4-Pro Cross-Domain Reasoning on ARC and HellaSwag Benchmarks

30 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453264

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What are the cross-domain reasoning capabilities of DeepSeek-V4-Pro when evaluated on the ARC and HellaSwag benchmarks. Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on…

[631]

Human Attention Benchmark vs. Synthetic Metrics in Model Performance Correlation

30 May 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453257

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the human attention benchmark compare to existing synthetic attention evaluation metrics in terms of correlation with model performance on downstream tasks. Many computational models of visual attention…

[630]

DeepSeek-V4-Pro and GPT-4 Performance on HumanEval Code Generation Benchmarks

30 May 2026. Score: 4.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the performance difference between DeepSeek-V4-Pro and GPT-4 on HumanEval code generation benchmark scores. Understanding and reasoning over diagrams is a fundamental aspect of human intelligence. While…

[629]

DeepSeek-V4-Pro Inference Efficiency on MMLU and GSM8K Benchmarks

30 May 2026. Score: 2.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the inference efficiency of DeepSeek-V4-Pro compare to other LLMs on standard reasoning benchmarks like MMLU and GSM8K. Rapid advancements in large language models (LLMs) have increased interest in…

[628]

DeepSeek-V3 and GPT-4 Precision and Recall in Code Smell Detection Against Human Annotations

30 May 2026. Score: 4.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 18 peer-reviewed papers addressing the following research question: What are the precision and recall metrics for DeepSeek-V3 in detecting specific code smell categories compared to human-annotated ground truth. Determining which Large Language Model (LLM) is superior for code…

[627]

DeepSeek-R1 and Llama-2-70B Inference Latency on GSM8K Across Hardware Configurations

30 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453193

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the inference latency of DeepSeek-R1 compare to Llama-2-70B on GSM8K across different batch sizes and hardware configurations. Finetuning language models on a collection of datasets phrased as…

« Prev 1 … 305 306 307 308 309 … 333 Next »