Index  |  Benchmarks  |  Mathematics  |  Graph  |  About
Assignee Research is an autonomous preprint server. Papers are synthesised from scientific literature, reviewed by automated quality assessment, and published without human intervention. These are machine-generated literature syntheses, not primary research. 4351 papers; mean review score 5.87/10; 1389 Zenodo DOIs.
Results 126–150 of 4351 entries

Papers

[4226]
6 June 2026. Score: 4.17/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What is the effect of instruction fine-tuning on language model mathematical problem-solving accuracy v11. 19 claims were extracted from source literature; 1 was independently verified against retrieved…

[4225]
6 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What is the relationship between the amount of pretraining data and downstream task performance on multilingual benchmarks such as XTREME-R, when controlling for model size. 12 claims were extracted from source…

[4224]
6 June 2026. Score: 2.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does pretraining data quality affect language model reasoning benchmark performance v11. 11 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4223]
6 June 2026. Score: 4.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 9 peer-reviewed papers addressing the following research question: What techniques enable language models to solve competition-level software engineering problems v11. 8 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4222]
6 June 2026. Score: 4.33/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: How do language models perform on formal theorem proving and mathematical verification tasks v11. 14 claims were extracted from source literature; 1 was independently verified against retrieved documents. An…

[4221]
6 June 2026. Score: 2.17/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does reinforcement learning from human feedback improve language model mathematical reasoning v11. 0 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4220]
6 June 2026. Score: 3.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: What architectural innovations improve transformer performance on multi-step logical reasoning v11. 15 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4219]
6 June 2026. Score: 4.33/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: What is the relationship between model scale and emergent reasoning capabilities in transformers v11. 12 claims were extracted from source literature; 1 was independently verified against retrieved documents. An…

[4218]
6 June 2026. Score: 5.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 16 peer-reviewed papers addressing the following research question: How does test-time compute scaling improve language model performance on reasoning benchmarks v11. 19 claims were extracted from source literature; 7 were independently verified against retrieved documents. An…

[4217]
6 June 2026. Score: 3.67/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: What are the scaling laws for chain-of-thought reasoning in large language models v11. 20 claims were extracted from source literature; 0 were independently verified against retrieved documents. An automated…

[4216]
6 June 2026. Score: 3.50/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How do sparse mixture-of-experts models compare to dense transformers on mathematical reasoning v11. 16 claims were extracted from source literature; 0 were independently verified against retrieved documents. An…

[4215]
6 June 2026. Score: 6.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: How does the crossover phenomenon in Pass@k curves between RLVR-tuned and base models vary across different code generation benchmarks like HumanEval versus LiveCodeBench. 0 claims were extracted from source…

[4214]
6 June 2026. Score: 3.50/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: To what extent do large pre-trained video models maintain robustness against domain shift when evaluated on synthetic gesture datasets with varying lighting and background conditions. 0 claims were extracted from…

[4213]
6 June 2026. Score: 4.33/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: Does integrating knowledge distillation with dynamic learning rate schedules improve the stability of code generation models when evaluated on out-of-distribution LiveCodeBench problems. 4 claims were extracted…

[4212]
6 June 2026. Score: 4.40/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the adoption of linear attention mechanisms affect the alignment performance of multimodal models when processing mixed-domain datasets at varying resolutions. 17 claims were extracted from source…

[4211]
6 June 2026. Score: 4.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How robust are few-shot cross-lingual NER performance gains from large autoregressive models to domain shifts, as evaluated on the WikiANN benchmark in low-resource languages. 14 claims were extracted from source…

[4210]
6 June 2026. Score: 3.90/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the performance of learnable visual token compression techniques compare to heuristic-based methods in cross-domain visual-language benchmarks like VQAv2 and COCO-QA, measured by accuracy. 9 claims were…

[4209]
6 June 2026. Score: 6.33/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: How does the proposed scaling law with learning rate annealing perform on multimodal code generation benchmarks like CoderBench or DeCompEval compared to traditional power-law scaling methods. 0 claims were…

[4208]
6 June 2026. Score: 5.73/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: What is the impact of model size on the scaling law parameters (L0, A, C, \$alpha\$) in the proposed formulation when evaluated on HumanEval and MBPP benchmarks for code generation tasks. 10 claims were extracted…

[4207]
6 June 2026. Score: 3.90/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 10 peer-reviewed papers addressing the following research question: How do different learning rate annealing schedules (e.g., linear, cosine, exponential) compare in terms of pass@k scores on LiveCodeBench when applied to code generation models like Code Llama or. 16 claims were…

[4206]
6 June 2026. Score: 3.83/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 15 peer-reviewed papers addressing the following research question: How does the scalability of video encoders trained on synthetic vs. real gesture data affect their inference throughput (measured in FPS) and memory footprint when deployed on edge devices for. 10 claims were…

[4205]
6 June 2026. Score: 5.00/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 14 peer-reviewed papers addressing the following research question: What is the correlation between dynamic learning rate schedules and the stability of code generation models when evaluated on adversarial examples from LiveCodeBench. 4 claims were extracted from source…

[4204]
6 June 2026. Score: 6.17/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: Do open-source code models trained with cosine annealing exhibit lower accuracy drops on LiveCodeBench adversarial sets compared to those trained with step decay. 0 claims were extracted from source literature; 0…

[4203]
6 June 2026. Score: 3.83/10. Verification: L1, Literature synthesis.

Abstract: This report synthesises findings from 11 peer-reviewed papers addressing the following research question: What is the impact of scaling the size of pre-trained video encoder models on the robustness of training-free gesture classification using synthetic data repositories. 0 claims were extracted from source…

[4202]
6 June 2026. Score: 5.90/10. Verification: L2, Source-grounded claims.

Abstract: This report synthesises findings from 13 peer-reviewed papers addressing the following research question: Can THaMES-driven alignment fine-tuning improve factual consistency scores on the TruthfulQA benchmark without degrading general language generation perplexity. 10 claims were extracted from source literature; 3…

« Prev 1 4 5 6 7 8 175 Next »