Welcome to “MMLoSo Language Challenge 2025”! India is land of unmatched diversity, especially in terms of spoken languages. Many of these languages are tribal languages, which have a sizable number of speakers. But often, these languages are poorly documented and thus lack the massive annotated corpora that power today’s NLP breakthroughs. Limited digital presence means the native speakers face barriers in healthcare messaging, disaster alerts, e-governance, and educational resources—all of which increasingly rely on text mining and machine translation.
By building open-source systems for LRL ⇆ HRL translation, this competition channels deep‐learning skills toward tangible social impact: making vital information accessible in underserved languages and amplifying the voices of native speakers online. In this competition you will translate between high‐resource languages (HRL) and our focused low‐resource languages (LRL), i.e.:
By tackling all tracks together, you will help push the frontier of multilingual NLP while building end‐to‐end pipelines that work in data‐sparse settings.
File | Purpose | Key Columns |
---|---|---|
bhili-train.csv |
Data for Bhili – Hindi translation training | row_id , hindi , bhili |
gondari-train.csv |
Data for Gondi – Hindi translation training | row_id , hindi , gondi |
mundari-train.csv |
Data for Mundari – Hindi translation training | row_id , hindi , mundari |
santali-train.csv |
Data for Santali – English translation training | row_id , english , santali |
test.csv |
Unlabeled source sentences to translate (released later) | row_id , source_sentence , source_lang , target_lang |
All texts are drawn from a private, permissively licensed source, cleaned and curated for research.
Courtesy: Ministry of Tribal Affairs, Government of India.
Task | What you submit | Where | Metric |
---|---|---|---|
Machine Translation | submission.csv with columns:row_id , source_lang , source_sentence , target_lang , target_sentence |
On Kaggle | BLEU & chrF (tokenized, case-insensitive) |
Leaderboard ranks teams by a weighted composite score:
```text Final Score = 0.6 × BLEU + 0.4 × chrF
Why 0.6 / 0.4? Translation quality is harder to push in low-resource settings; the higher weight reflects its research importance.
See the Rules tab on the competition page for full details.
Happy modeling – and thank you for advancing NLP for underrepresented languages!
bhili-train.csv
Sentences for supervised Bhili – Hindi translation.
Column | Type | Description |
---|---|---|
row_id |
int | Unique row identifier |
hindi |
str | Sentence in the high-resource language (Hindi) |
bhili |
str | Gold translation in the low-resource language |
mundari-train.csv
Sentences for supervised Mundari – Hindi translation.
Column | Type | Description |
---|---|---|
row_id |
int | Unique row identifier |
hindi |
str | Sentence in the high-resource language (Hindi) |
mundari |
str | Gold translation in the low-resource language |
gondari-train.csv
Sentences for supervised Gondi – Hindi translation.
Column | Type | Description |
---|---|---|
row_id |
int | Unique row identifier |
hindi |
str | Sentence in the high-resource language (Hindi) |
gondi |
str | Gold translation in the low-resource language |
santali-train.csv
Sentences for supervised Santali – English translation.
Column | Type | Description |
---|---|---|
row_id |
int | Unique row identifier |
english |
str | Sentence in the high-resource language (English) |
santali |
str | Gold translation in the low-resource language |
test.csv
Unlabeled source sentences. Participants must predict the target_sentence
column.
Machine Translation parallel corpora distilled from publicly released web-crawls and Wikipedia dumps, post-processed using the NGO-Aligned filtering toolkit.
All text is redistributed under Creative Commons BY-SA 4.0. Use outside this competition must cite the original sources.
If you publish work using this dataset, please cite:
```bibtex @misc{lrlchallenge2025, title = {Multimodal Models for Low-Resource Contexts and Social Impact 2025}, year = {2025}, howpublished = {Kaggle Competition}, url = {https://www.kaggle.com/competitions/mmloso2025} }