- Samanantar Parallel Corpus. Parallel corpora all major Indian languages containing 46 million sentence pairs between English and all major Indian languages.
- AI4Bharat-IndicNLPSuite. Text corpora (billions of words), multilingal language model (IndicBERT), word embeddings, text classification datasets for all major Indian languages.
- IIT Bombay Parallel Corpus
The IIT Bombay English-Hindi corpus contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources and corpora developed at the Center for Indian Language Technology, IIT Bombay over the years. This is the largest English-Hindi Parallel Corpus available in the public domain.
- IndicNLP Catalog
A comprehensive listing of Indian language NLP resources.
- IndoWordnet Parallel Corpus. Parallel corpora mined from IndoWordNet gloss and/or examples for Indian-Indian language corpora (6.3 million segments, 18 languages).
- GeoMM Word Embeddings for Indian languages
Bilingual embeddings for Indian languages trained using GeoMM
- Xlit-Crowd: Hindi-English Transliteration Corpus
This is a corpus containing transliteration pairs for Hindi-English. These pairs were obtained via crowdsourcing by asking workers to transliterate Hindi words into the Roman script. The tasks were done on Amazon Mechanical Turk and yielded a total of 14919 pairs.
- Xlit-IITB-Par: Hindi-English Transliteration Corpus
This is a corpus containing transliteration pairs for Hindi-English. These pairs were automatically mined from the IIT Bombay English-Hindi Parallel Corpus using the Moses Transliteration Module. The corpus contains 68,922 pairs.
- Brahmi-Net Transliteration Corpus for Indian languages
The Brahmi-Net transliteration resources consist of parallel transliteration corpora for 110 language pairs, comprising 10 Indian languages and English. The transliteration corpus has been mined from the Indian Language Corpora Initiative (ILCI) parallel corpus, containing tourism and health domains sentences.
- Sata-Anuvaadak: Translation Resources
110 translation models for Phrase based SMT between the languages mentioned above. This includes phrase tables, lexicalized reordering models and language models along with learnt parameters.