Portfolio

MultiLingua

Cross-lingual NLP benchmarks and models for low-resource languages

MultiLingua provides benchmarks, pre-trained multilingual embeddings, and evaluation suites for cross-lingual NLP research. It focuses on low-resource languages that are underrepresented in existing benchmarks. The project includes datasets for 45 languages spanning 12 language families, with a focus on African and Southeast Asian languages.

features

Benchmarks covering 45 languages and 12 language families
Pre-trained multilingual embeddings optimized for low-resource transfer
Evaluation suites for NER, POS tagging, sentiment, and QA
Data collection tools and annotation guidelines
Active community with regular benchmark updates

GitHub Leaderboard