Impact of Gated Sparse Attention on Recall@K in Kosmos-2 for Long-Context Image-Text Retrieval on Flickr30K
Abstract
Abstract: We introduce ModaRoute, an LLM-based intelligent routing system that dynamically selects optimal modalities for multimodal video retrieval. While dense text captions can achieve 75.9\% Recall@5, they require expensive offline processing and miss critical visual information present in 34\% of clips with scene text not captured by ASR. By analyzing query intent and predicting information needs, ModaRoute reduces computational overhead by 41\% while achieving 60.9\% Recall@5. Our approach uses GPT-4.1 to route queries across ASR (speech), OCR (text), and visual indices, averaging 1.78 modalities per
Research Question
How does replacing dense attention with gated sparse attention in Kosmos-2 impact Recall@K metrics on Flickr30K when processing image-text pairs with captions exceeding 1000 tokens?
Verification Level
| Paper level | L2, Source-grounded claims | |
| Source-grounded claims | 28 | |
| Claim record source | parsed source sections |
Descriptive public verification status only; aggregate claim counts are public, but individual claim records are not exposed here.
Truth-Engine Gate Verdict
| Status | Unverified | |
| Gate | Gate 2 — Verification (formal proof or sandbox reproduction) | |
| Reason | Published before the Gate 2 verification pipeline was activated (2026-06-10). No formal proof or sandbox reproduction has been attempted for this record. |
This record has not completed Gate 2 of the verification pipeline (a type-checked Lean4 proof for mathematical claims, or a sealed-sandbox reproduction for empirical claims). It is a literature synthesis only. VERIFIED requires an attached reproducible artifact (Lean4 proof source, or repro script and results) before this status can be set; it is not derived from review score or claim count.
Quality Tier
| Tier | DOI grade | |
| Basis | Review score and verified-claim count meet DOI-grade public quality thresholds. |
Descriptive public triage only; this tier does not alter current publication or DOI behavior.
Quality Dimensions
| Evidence strength | MEDIUM | |
| Citation grounding | MEDIUM | |
| Uncertainty disclosure | MEDIUM | |
| Reproducibility status | HIGH |
Automated triage signals derived from public fields; not human peer review or independent validation.
Correction Record
| Status | CURRENT |
| Correction count | 0 |
| Manifest contract | paper-manifest-v1.1 |
| Correction contract | correction-record-v1 |
Public corrections are additive records. Current status does not claim the synthesis is error-free.
Provenance
| Publisher | Assignee Research |
| Public provenance | L4, External archival record |
| Report artifact | Available |
| External record | Registered |
| Claim lineage | 28 aggregate source-grounded claims |
| Review method | Automated multi-reviewer assessment |
| Quality guide | How to read scores, claims, manifests, and evidence links |
| Provenance contract | source-provenance-v1 |
| Note | Machine-generated synthesis of existing literature. Not primary research. |