MultiLingua
Cross-lingual NLP benchmarks and models for low-resource languages
MultiLingua provides benchmarks, pre-trained multilingual embeddings, and evaluation suites for cross-lingual NLP research. It focuses on low-resource languages that are underrepresented in existing benchmarks. The project includes datasets for 45 languages spanning 12 language families, with a focus on African and Southeast Asian languages.
features
- Benchmarks covering 45 languages and 12 language families
- Pre-trained multilingual embeddings optimized for low-resource transfer
- Evaluation suites for NER, POS tagging, sentiment, and QA
- Data collection tools and annotation guidelines
- Active community with regular benchmark updates