Assignee Research: Index of Papers

Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 4606 papers; mean review score 5.86/10; 1460 Zenodo DOIs.

Results 3951–3975 of 4605 entries

Papers

[655]

Quantization Trade-offs in SecLM-Fine-Tuned Llama3 for Edge Text Classification

30 May 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20453852

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What is the impact of quantization techniques (e.g., 4-bit, 8-bit) on the inference efficiency and accuracy of SecLM-fine-tuned Llama3 for text classification tasks on edge devices with limited. Abstract The rapid…

[654]

Inference Latency Scaling of Mistral-Large-2 on MBPP Code Completion Tasks

30 May 2026. Score: 3.33/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the inference latency of Mistral-Large-2 scale with input sequence length on MBPP code completion tasks. We release Code Llama, a family of large language models for code based on Llama 2 providing…

[653]

Mistral-Large-2 Reasoning Accuracy on GSM8K vs. 7B Parameter Models

30 May 2026. Score: 7.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20453720

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the reasoning accuracy of Mistral-Large-2 on GSM8K compared to other 7B parameter models. Large Language Models (LLMs) have demonstrated remarkable versatility in recent years, offering potential…

[652]

Scaling Laws of Model Size and Training Data in Mistral-Large-2 LiveCodeBench Performance

30 May 2026. Score: 7.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of model size and training data on Mistral-Large-2's LiveCodeBench performance, and how does it scale with increasing parameter count. In this report, we introduce Qwen2.5, a comprehensive…

[651]

Mistral-Large-2 Inference Latency Scaling with Sequence Length on ARC-Challenge

30 May 2026. Score: 7.33/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does Mistral-Large-2's inference latency scale across different sequence lengths on ARC-Challenge questions. We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior…

[650]

Human Evaluation of Mistral-Large-2 Code Quality and Correctness on MBPP

30 May 2026. Score: 2.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the human evaluation score for code quality and functional correctness of Mistral-Large-2 generated solutions on MBPP compared to ground truth implementations. Several Deep Learning (DL)-based techniques…

[649]

Fine-Tuning Mistral-Large-2 On Domain-Specific Math Datasets (E.G., Math-Pt) Performance On Its Math Benchmark Scores

30 May 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does fine-tuning Mistral-Large-2 on domain-specific math datasets (e.g., Math-PT) improve its MATH benchmark scores compared to zero-shot or few-shot evaluation. The use of large language models (LLMs) for…

[648]

Mistral-Large-2 and State-of-the-Art Models on MBPP Benchmark Performance

30 May 2026. Score: 6.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the pass@1 accuracy of Mistral-Large-2 on the MBPP benchmark compared to other state-of-the-art code generation models. We introduce self-invoking code generation, a new task designed to evaluate the…

[647]

Qwen3-235B Performance Degradation Under PPTC-R Adversarial Instructions

30 May 2026. Score: 5.70/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the performance of Qwen3-235B degrade under PPTC-R adversarial user instructions compared to standard instructions. The growing dependence on Large Language Models (LLMs) for finishing user instructions…

[646]

Mistral-Large-2 Inference Efficiency on MATH vs. Specialized Math Models

30 May 2026. Score: 2.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the inference efficiency (tokens/sec or latency) of Mistral-Large-2 when solving MATH problems compared to smaller specialized math-focused models. Large language models (LLMs) have been explored in a…

[645]

Context Window Size Effects on Mistral-Large-2 Inference Efficiency for GSM8K

30 May 2026. Score: 7.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20453617

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: How does the inference efficiency of Mistral-Large-2 on GSM8K benchmark change with different context window sizes. In this report, we introduce the Gemini 1.5 family of models, representing the next generation of…

[644]

Mistral-Large-2 Performance on Multilingual Math Benchmarks Across Languages

30 May 2026. Score: 7.00/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does Mistral-Large-2's performance on MATH vary across different languages when evaluated on multilingual math benchmarks like Math-PT. Large Language Models (LLMs) have demonstrated remarkable versatility in…

[643]

Qwen3-235B Inference Efficiency Across Programming Languages in LiveCodeBench

30 May 2026. Score: 9.33/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20453545

Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: How does the inference efficiency of Qwen3-235B vary across different programming languages in the LiveCodeBench evaluation. In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3…

[642]

Monolingual Portuguese and Multilingual LLMs on Non-English Reasoning Benchmarks

30 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20453534

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the performance gap between monolingual Portuguese LLMs and multilingual models (e.g., Qwen2.5-72B) on MATH-PT, and does this gap persist when evaluating on other non-English reasoning. In this work, we…

[641]

Qwen3-235B Inference Efficiency on SWE-Bench Verified Under Computational Constraints

30 May 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20453532

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How does the inference efficiency of Qwen3-235B vary when evaluated on SWE-bench Verified tasks with different computational constraints. In this work, we present Qwen3, the latest version of the Qwen model…

[640]

Training Data Contamination Effects on Qwen3 Model Performance Across Scales on SWE-Bench Verified

30 May 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the impact of training data contamination on Qwen3-235B's performance across different model sizes on SWE-bench Verified. Abstract The rapid evolution of large language models (LLMs) has driven a…

[639]

Explanation Method Performance on Human Attention Quality Metrics

30 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: Can explanation methods that perform well on traditional accuracy metrics maintain similar performance on the human attention explanation quality metric. Multilayer neural networks trained with the…

[638]

Qwen2.5-72B Inference Efficiency vs. State-of-the-Art Models on MATH-PT

30 May 2026. Score: 9.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20453395

Abstract: This report synthesises findings from 3 peer-reviewed papers addressing the following research question: How does the inference efficiency (e.g., tokens per second) of Qwen2.5-72B compare to other state-of-the-art models (e.g., Mistral-7B, Llama3-8B) when processing MATH-PT problems. We introduce MiniMax-01 series,…

[637]

Saliency Explanation Methods and Human Interpretability Across Vision and Language Domains

30 May 2026. Score: 8.33/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How do different saliency explanation methods compare in terms of human interpretability when evaluated on the proposed human attention benchmark across vision and language domains. Multilayer neural networks…

[636]

Computational Efficiency and Explanation Quality in Tumor Segmentation Algorithms

30 May 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20453342

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the correlation between computational efficiency (FLOPs, inference time) and explanation quality scores on the human attention benchmark. In this paper we report the set-up and results of the Multimodal…

[635]

Qwen2.5-72B Performance on HumanEval-V Versus Standard Code Generation Benchmarks

30 May 2026. Score: 8.33/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the performance of Qwen2.5-72B on HumanEval-V compare to its performance on standard code generation benchmarks like HumanEval and MBPP. In this work, we present Qwen3, the latest version of the Qwen…

[634]

Human Attention Benchmarks for Multi-Task Learning in Attention-Based Models

30 May 2026. Score: 7.50/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20453327

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: Can the human attention benchmark be used to improve the training of attention-based models through multi-task learning frameworks. Deep convolutional neural networks have performed remarkably well on many…

[633]

Multi-Layer Human Attention Masks and Explanation Quality in Deep Neural Networks

30 May 2026. Score: 8.00/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20453272

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What is the impact of using multi-layer human attention masks versus single-layer attention mechanisms on explanation quality scores. Deep convolutional neural networks have performed remarkably well on many…

[632]

DeepSeek-V4-Pro Cross-Domain Reasoning on ARC and HellaSwag Benchmarks

30 May 2026. Score: 8.50/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20453264

Abstract: This report synthesises findings from 8 peer-reviewed papers addressing the following research question: What are the cross-domain reasoning capabilities of DeepSeek-V4-Pro when evaluated on the ARC and HellaSwag benchmarks. Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on…

[631]

Human Attention Benchmark vs. Synthetic Metrics in Model Performance Correlation

30 May 2026. Score: 8.83/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20453257

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the human attention benchmark compare to existing synthetic attention evaluation metrics in terms of correlation with model performance on downstream tasks. Many computational models of visual attention…

« Prev 1 … 157 158 159 160 161 … 185 Next »