Index |  Research ▾  |  Verification ▾  | About
Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 8301 papers; mean review score 5.73/10; 2276 Zenodo DOIs. Verified contributions (Gate 2: formal proof or sandbox reproduction): 149. 97 claims falsified by the pipeline (see falsification record). 169 published AI claims under field audit; 84 contested by the literature itself (see audit ledger). 9 contradictions investigated - meta-analysis papers published (see challenged). What does this mean?
Results 7651–7675 of 8301 entries

Papers

[651]
30 May 2026. Score: 7.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does Mistral-Large-2's inference latency scale across different sequence lengths on ARC-Challenge questions. We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior…

[650]
30 May 2026. Score: 2.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the human evaluation score for code quality and functional correctness of Mistral-Large-2 generated solutions on MBPP compared to ground truth implementations. Several Deep Learning (DL)-based techniques…

[649]
30 May 2026. Score: 4.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does fine-tuning Mistral-Large-2 on domain-specific math datasets (e.g., Math-PT) improve its MATH benchmark scores compared to zero-shot or few-shot evaluation. The use of large language models (LLMs) for…

[648]
30 May 2026. Score: 6.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the pass@1 accuracy of Mistral-Large-2 on the MBPP benchmark compared to other state-of-the-art code generation models. We introduce self-invoking code generation, a new task designed to evaluate the…

[647]
30 May 2026. Score: 5.70/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the performance of Qwen3-235B degrade under PPTC-R adversarial user instructions compared to standard instructions. The growing dependence on Large Language Models (LLMs) for finishing user instructions…

[646]
30 May 2026. Score: 2.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the inference efficiency (tokens/sec or latency) of Mistral-Large-2 when solving MATH problems compared to smaller specialized math-focused models. Large language models (LLMs) have been explored in a…

[645]
30 May 2026. Score: 7.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453617

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does the inference efficiency of Mistral-Large-2 on GSM8K benchmark change with different context window sizes. In this report, we introduce the Gemini 1.5 family of models, representing the next generation of…

[644]
30 May 2026. Score: 7.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does Mistral-Large-2's performance on MATH vary across different languages when evaluated on multilingual math benchmarks like Math-PT. Large Language Models (LLMs) have demonstrated remarkable versatility in…

[643]
30 May 2026. Score: 9.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453545

Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: How does the inference efficiency of Qwen3-235B vary across different programming languages in the LiveCodeBench evaluation. In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3…

[642]
30 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453534

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the performance gap between monolingual Portuguese LLMs and multilingual models (e.g., Qwen2.5-72B) on MATH-PT, and does this gap persist when evaluating on other non-English reasoning. In this work, we…

[641]
30 May 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453532

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the inference efficiency of Qwen3-235B vary when evaluated on SWE-bench Verified tasks with different computational constraints. In this work, we present Qwen3, the latest version of the Qwen model…

[640]
30 May 2026. Score: 6.50/10. Verification: L1, Literature synthesis. Gate status: Unverified.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of training data contamination on Qwen3-235B's performance across different model sizes on SWE-bench Verified. Abstract The rapid evolution of large language models (LLMs) has driven a…

[639]
30 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: Can explanation methods that perform well on traditional accuracy metrics maintain similar performance on the human attention explanation quality metric. Multilayer neural networks trained with the…

[638]
30 May 2026. Score: 9.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453395

Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: How does the inference efficiency (e.g., tokens per second) of Qwen2.5-72B compare to other state-of-the-art models (e.g., Mistral-7B, Llama3-8B) when processing MATH-PT problems. We introduce MiniMax-01 series,…

[637]
30 May 2026. Score: 8.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How do different saliency explanation methods compare in terms of human interpretability when evaluated on the proposed human attention benchmark across vision and language domains. Multilayer neural networks…

[636]
30 May 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453342

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the correlation between computational efficiency (FLOPs, inference time) and explanation quality scores on the human attention benchmark. In this paper we report the set-up and results of the Multimodal…

[635]
30 May 2026. Score: 8.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the performance of Qwen2.5-72B on HumanEval-V compare to its performance on standard code generation benchmarks like HumanEval and MBPP. In this work, we present Qwen3, the latest version of the Qwen…

[634]
30 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453327

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: Can the human attention benchmark be used to improve the training of attention-based models through multi-task learning frameworks. Deep convolutional neural networks have performed remarkably well on many…

[633]
30 May 2026. Score: 8.00/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453272

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What is the impact of using multi-layer human attention masks versus single-layer attention mechanisms on explanation quality scores. Deep convolutional neural networks have performed remarkably well on many…

[632]
30 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453264

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What are the cross-domain reasoning capabilities of DeepSeek-V4-Pro when evaluated on the ARC and HellaSwag benchmarks. Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on…

[631]
30 May 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453257

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the human attention benchmark compare to existing synthetic attention evaluation metrics in terms of correlation with model performance on downstream tasks. Many computational models of visual attention…

[630]
30 May 2026. Score: 4.83/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the performance difference between DeepSeek-V4-Pro and GPT-4 on HumanEval code generation benchmark scores. Understanding and reasoning over diagrams is a fundamental aspect of human intelligence. While…

[629]
30 May 2026. Score: 2.33/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the inference efficiency of DeepSeek-V4-Pro compare to other LLMs on standard reasoning benchmarks like MMLU and GSM8K. Rapid advancements in large language models (LLMs) have increased interest in…

[628]
30 May 2026. Score: 4.67/10. Verification: L2, Source-grounded claims. Gate status: Unverified.

Abstract: This report synthesises findings from 18 peer-reviewed papers addressing the following research question: What are the precision and recall metrics for DeepSeek-V3 in detecting specific code smell categories compared to human-annotated ground truth. Determining which Large Language Model (LLM) is superior for code…

[627]
30 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. Gate status: Unverified. 10.5281/zenodo.20453193

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the inference latency of DeepSeek-R1 compare to Llama-2-70B on GSM8K across different batch sizes and hardware configurations. Finetuning language models on a collection of datasets phrased as…

« Prev 1 305 306 307 308 309 333 Next »