Multilingual Encoder Robustness via 47 Million Non-English WebFAQ Pairs Against Domain Shift
Abstract
Abstract: We present WebFAQ, a large-scale collection of open-domain question answering datasets derived from FAQ-style schema.org annotations. In total, the data collection consists of 96 million natural question-answer (QA) pairs across 75 languages, including 47 million (49\%) non-English samples. WebFAQ further serves as the foundation for 20 monolingual retrieval benchmarks with a total size of 11.2 million QA pairs (5.9 million non-English). These datasets are carefully curated through refined filtering and near-duplicate detection, yielding high-quality resources for training and evaluating multil
Research Question
To what extent does the inclusion of 47 million non-English WebFAQ pairs improve the robustness of multilingual encoders against domain shift in low-resource language QA benchmarks?
Verification Level
| Paper level | L2, Source-grounded claims | |
| Source-grounded claims | 21 | |
| Claim record source | parsed source sections |
Descriptive public verification status only; aggregate claim counts are public, but individual claim records are not exposed here.
Truth-Engine Gate Verdict
| Status | Verified | |
| Gate | Gate 2 — Verification (formal proof or sandbox reproduction) | |
| Reason | Sealed-sandbox formula repro: Computed 1000.0 matches expected 1000.0 (tolerance=5.0%). | |
| Evaluated | 2026-06-11T10:10:01.210204+00:00 |
This record has passed Gate 2: a Lean4 proof source type-checks, or a sealed-sandbox run reproduced the reported results within the stated tolerance. A reproducible artifact (proof source or repro script and results) is attached to this record. VERIFIED requires an attached reproducible artifact (Lean4 proof source, or repro script and results) before this status can be set; it is not derived from review score or claim count.
Quality Tier
| Tier | DOI grade | |
| Basis | Review score and verified-claim count meet DOI-grade public quality thresholds. |
Descriptive public triage only; this tier does not alter current publication or DOI behavior.
Quality Dimensions
| Evidence strength | MEDIUM | |
| Citation grounding | MEDIUM | |
| Uncertainty disclosure | MEDIUM | |
| Reproducibility status | HIGH |
Automated triage signals derived from public fields; not human peer review or independent validation.
Correction Record
| Status | CURRENT |
| Correction count | 0 |
| Manifest contract | paper-manifest-v1.1 |
| Correction contract | correction-record-v1 |
Public corrections are additive records. Current status does not claim the synthesis is error-free.
Provenance
| Publisher | Assignee Research |
| Public provenance | L4, External archival record |
| Report artifact | Available |
| External record | Registered |
| Claim lineage | 21 aggregate source-grounded claims |
| Review method | Automated multi-reviewer assessment |
| Quality guide | How to read scores, claims, manifests, and evidence links |
| Provenance contract | source-provenance-v1 |
| Note | Machine-generated synthesis of existing literature. Not primary research. |