Assignee Research: Index of Papers

Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 6257 papers; mean review score 5.53/10; 1561 Zenodo DOIs.

Results 2276–2300 of 6257 entries

Papers

[3982]

Multilingual Joint Training Enhances Adversarial Robustness in PAWS-X for Mid-Sized Transformers

6 June 2026. Score: 6.67/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: Does joint training on English and Indonesian datasets improve robustness against adversarial perturbations in PAWS-X compared to single-language fine-tuning for mid-sized multilingual transformers. 0 claims were…

[3981]

Zero-Shot Cross-Lingual Transfer Scaling of XGLM on Indonesian XNLI Across Model Sizes

6 June 2026. Score: 4.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the zero-shot cross-lingual transfer accuracy of XGLM on Indonesian XNLI tasks scale relative to English as model size increases from 564M to 7.5B parameters. 12 claims were extracted from source…

[3980]

Phi-3-Mini and Mistral-7B Performance Variance on Multilingual GSM-Symbolic Benchmarks

6 June 2026. Score: 3.33/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the performance variance of Phi-3-mini versus Mistral-7B-v0.1 on GSM-Symbolic generated instances across non-English languages compared to the original MGSM dataset. 0 claims were extracted from source…

[3979]

Qwen3 and Qwen2-1.5B Robustness to Adversarial Docstring Perturbations in HumanEval-X

6 June 2026. Score: 3.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How do Qwen3 and Qwen2-1.5B differ in robustness against adversarial docstring perturbations across diverse programming languages in the HumanEval-X dataset. 10 claims were extracted from source literature; 0…

[3978]

Qwen2.5 and Prior Versions Safety Alignment on Multimodal Adversarial Benchmarks

6 June 2026. Score: 7.00/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the comparative safety alignment performance of Qwen2.5 models versus prior versions on adversarial benchmarks like RedBench or WildQA, measured by safety score variance across different. 15 claims were…

[3977]

Dynamic Attention Head Selection Improves Multi-Turn Dialogue Coherence in 7B Parameter Models

6 June 2026. Score: 3.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of dynamic attention head selection on multi-turn dialogue coherence scores compared to static multi-head attention in 7B parameter models. 12 claims were extracted from source literature; 0…

[3976]

SpikingBrain and Llama 2 13B Robustness in Adversarial Repository-Level Coding Tasks

6 June 2026. Score: 5.17/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the robustness of SpikingBrain compare to Llama 2 13B in repository-level coding tasks when evaluated under adversarial conditions (e.g., corrupted or obfuscated code) using the pass@1 metric. 0 claims…

[3975]

Scaling Effects of RLHF-Aligned Models on LawBench Legal Knowledge Performance

6 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the impact of model size scaling (e.g., 7B vs. 13B vs. 30B) on the LawBench benchmark performance of RLHF-aligned models, particularly in the Legal knowledge level, and does the performance. 15 claims…

[3974]

Interleaved Video and Code Documentation Effects on Gemini 1.5 Flash Reasoning in Multimodal Software Benchmarks

6 June 2026. Score: 3.67/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: What is the impact of interleaving long video sequences with code documentation on Gemini 1.5 Flash's reasoning performance in multimodal software engineering benchmarks. 0 claims were extracted from source…

[3973]

Retrieval Accuracy Degradation in Gemini 1.5 Pro Beyond 500k-Token Contexts

6 June 2026. Score: 5.33/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the retrieval accuracy of Gemini 1.5 Pro degrade on diagram-dependent coding tasks when context length exceeds 500k tokens compared to the 100k baseline. 0 claims were extracted from source literature; 0…

[3972]

Impact of Repository Size on SpikingBrain and Llama 2 13B Pass-at-One Performance

6 June 2026. Score: 3.67/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the impact of repository size (measured in lines of code) on the pass@1 scores of SpikingBrain versus Llama 2 13B when benchmarked on multi-file repository-level coding tasks. 0 claims were extracted from…

[3971]

SpikingBrain vs. Llama 2 13B and Claude 3 Sonnet in Repository-Level Code Synthesis

6 June 2026. Score: 8.00/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20567287

Abstract: This report synthesises findings from 1 peer-reviewed paper addressing the following research question: How does the pass@1 performance of SpikingBrain compare to Llama 2 13B and Claude 3 Sonnet when evaluated on repository-level coding tasks with mixed programming languages (Python + Java + JavaScript). 8 claims…

[3970]

Sliding Window Attention in Mistral 7B Outperforms Full Attention Baselines on LongCodeEval

6 June 2026. Score: 7.07/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the perplexity of Sliding Window Attention adapted Mistral 7B compare to full attention baselines on the LongCodeEval benchmark for contexts exceeding 16k tokens. 10 claims were extracted from source…

[3969]

KDA vs Full Attention Accuracy-Throughput Trade-offs on Long-Sequence GEMM Benchmarks

6 June 2026. Score: 3.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the accuracy-throughput trade-off of Kimi Delta Attention (KDA) versus full attention on the GEMM benchmark when processing sequences longer than 8k tokens. 9 claims were extracted from source literature;…

[3968]

Sliding Window Attention Mismatch and Accuracy Degradation in Long-Context Code Retrieval

6 June 2026. Score: 3.00/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: Does the training-inference mismatch in sliding window attention cause significant accuracy degradation on the Needle In A Haystack test for code repositories larger than 32k tokens. 12 claims were extracted from…

[3967]

Multimodal vs. Single-Modality Dental X-Ray Models on Edge Devices: Latency and Quantization Trade-offs

6 June 2026. Score: 4.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the inference latency of multimodal dental X-ray models compare to single-modality CNNs when deployed on edge devices with quantized weights. 10 claims were extracted from source literature; 1 was…

[3966]

Kimi Linear Chunkwise Algorithm and Zero-Shot Reasoning on MMLU Under Memory Constraints

6 June 2026. Score: 4.67/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the chunkwise algorithm in Kimi Linear impact zero-shot reasoning performance on the MMLU benchmark compared to standard RNN-based architectures when trained with limited memory constraints. 0 claims were…

[3965]

Cross-Lingual F1-Score Gaps in Gemma2 Models on Adversarial QA Datasets

6 June 2026. Score: 5.73/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How does the F1-score gap between English and Italian QA tasks scale when comparing Gemma2-2B and Gemma2-7B on adversarial cross-lingual datasets generated via beam search. 13 claims were extracted from source…

[3964]

AdaptToken Scaling Laws: Pretraining Loss and HHH Benchmark Accuracy from 1B to 10B Parameters

6 June 2026. Score: 4.00/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: What is the correlation between pretraining loss reduction and downstream HHH benchmark accuracy for AdaptToken models across the 1B to 10B parameter range. 16 claims were extracted from source literature; 1 was…

[3963]

Scaling Alignment Performance on HHH Across 3B and 8B Parameter Models

6 June 2026. Score: 4.33/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does the scaling of alignment performance on the HHH dataset vary between 3B and 8B parameter models when fine-tuned with different data mixture ratios. 0 claims were extracted from source literature; 0 were…

[3962]

Mean Shift Feature Space Analysis Enhances AdaptToken-3B Cross-Domain Generalization

6 June 2026. Score: 7.30/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 6 peer-reviewed papers addressing the following research question: Can the mean shift-based feature space analysis improve the cross-domain generalization of AdaptToken-3B on AdvGLUE, and how does this compare to adversarial training with Jacobian regularization in. 7 claims were…

[3961]

Gemini 1.5 Flash and Pro Zero-Shot Cross-Domain Performance on MMBench

6 June 2026. Score: 7.40/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How do Gemini 1.5 Flash and Pro perform in zero-shot cross-domain adaptation tasks on the MMBench benchmark, and what are the trade-offs in accuracy and inference time between the two models. 12 claims were…

[3960]

Qwen2.5-7B vs. Llama-2-7B and Mistral-7B in Code Generation Benchmarks

6 June 2026. Score: 8.23/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20567103

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does Qwen2.5-7B perform relative to Llama-2-7B and Mistral-7B on code generation tasks in HumanEval and MBPP after normalizing for supervised fine-tuning dataset size. 12 claims were extracted from source…

[3959]

Mean Shift Clustering Effects on Adversarial Robustness in AdaptToken Models

6 June 2026. Score: 7.43/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 4 peer-reviewed papers addressing the following research question: How does the mean shift clustering technique impact the adversarial robustness of AdaptToken-8B vs. AdaptToken-3B when fine-tuned on AdvGLUE tasks, as measured by accuracy under targeted FGSM attacks. 11 claims…

[3958]

Noise-Induced Robustness in Gemini 1.5 Flash and Pro at Ultra-Long Contexts

6 June 2026. Score: 8.67/10. Verification: L2, Source-grounded claims. 10.5281/zenodo.20567073

Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: How does the introduction of noise in interleaved image-text sequences affect the robustness of factual recall in Gemini 1.5 Flash compared to Gemini 1.5 Pro at context lengths above 200k tokens. 11 claims were…

« Prev 1 … 90 91 92 93 94 … 251 Next »