asato ma sad gamaya
tamaso ma jyotir gamaya
mrutyor ma amritam gamaya
From ignorance lead me to truth
From darkness lead me to light
From death lead me to immortality
| ~Brhadaranyaka Upanishad |
I am Anoop Kunchukuttan. I am a Principal Applied Researcher in the Microsoft Machine Translation team in Hyderabad, India. I am a founding member and co-lead of the AI4Bharat, a research center based in IIT Madras that works to drive advances and build resources for Indian language NLP. I am also an adjunct faculty at IIT Madras.
My research areas are Natural Language Processing, Machine Learning, Information Extraction and Retrieval.
My research interests include multilingual learning, instruction tuning of LLMs, representation learning, lexical and sentence semantics, NLP for related languages, machine translation and transliteration. I am interested in building tools and resources for Indian language NLP. Over the last decade, I have built/contributed to large-scale, broad coverage resources like the Indic NLP Library, IndicTrans/Sata-Anuvaadak Translation systems, IIT Bombay Parallel Corpus, Samanantar Corpus, Indic NLP/NLG Suite and Aksharantar/BrahmiNet transliteration corpora.
I completed my Ph.D in 2018 at the Department of Computer Science and Engineering, IIT Bombay. I did my research under the guidance of Prof. Pushpak Bhattacharyya at the Center for Indian Language Technology. My doctoral research work explored various facets of machine translation and transliteration between related languages.
Last updated on 22 Oct 2023
- 22 Nov 2023: IndicTrans2 paper accepted to the Transactions of Machine Learning Research (TMLR) journal. [pre-print]
- 15 Nov 2023: Our paper "A Comprehensive Analysis of Adapter Efficiency" has been accpeted tp CoDS-COMAD 2024. [pre-print]
- 6 Oct 2023: Three papers accepted to EMNLP 2023 (1 Main, 2 Findings). All three works will presented as posters at EMNLP. [Details]
- 5 Oct 2023: Work from my team at Microsoft Translator on supporting 4 new Indian languages (Bhojpuri, Bodo, Dogri, and Kashmiri) is now live. [Details]
- 1 Sep 2023: Starting new role as Principal Applied Researcher at Microsoft India.
- 26 Jun 2023: IndicSUPERB - our benchmark for Speech Language Understanding tasks accepted to AAAI [link]
- 26 Jun 2023: Shrutilipi - our work on mining ASR corpora from All India Radio accepted to ICASSP [link]
- 20 Jun 2023: Our work on Comprehensive Analysis of Adapter Efficiency accepted to the ES-FoMo: Efficient Systems for Foundation Models
workshop at ICLM (non-archival) [arxiv]
- 25 May 2023: Public Release of IndicTrans2, the first MT system supporting 22 Indian languages [arxiv] [Developer Site] [Try it out]
- 23 May 2023: Jugalbandi, a chatbot powered with Azure OpenAI and AI4Bharat translation/ASR/TTS models, showcased at Build2023. [video]
- 22 May 2023: New pre-print on example selection for MT with LLMs [arxiv]
- 19 May 2023: Work from my team at Microsoft Translator on supporting 4 new Indian languages (Konkani, Maithili, Sindhi, Sinhala) is now live. [Details]
- 12 May 2023: New pre-print on Comprehensive Analysis of Adapter Efficiency [arxiv]
- 10 May 2023: New pre-print on machine translation for extremely low-resource languages [arxiv]
- 1 May 2023: Four papers on Indian language NLP accepted to ACL 2023. [Details]
- 24 Jan 2023: Invited talk at IIT Hyderabad on Mining Datasets at scale for Building High-quality NLP Models [slides]
- 28 Jul 2022: Inaguration of the AI4Bharat center at IIT Madras.
- 14 Apr 2022: Invited talk at IISER Bhopal on Multilingual Learning and Mining Datasets for Building High-quality NLP Models [slides]
- 10 Mar 2022: IndicNLG Suite released with 5 generation tasks for 11 Indian languages [paper] [homepage]
- 4 Mar 2022: I conducted lectures on sequence labeling and sequence-to-sequence learning covering RNN, LSTM, Transformers, etc. for CS-772 (Deep Learning for NLP) by Prof. Pushpak Bhattacharyya.
- 4 Mar 2022: Our paper on IndicBART, a seq2seq pretrained model for 11 Indian languages accepted to Findings of ACL 2022
- 31 Dec 2021: I presented a tutorial on the AI4Bharat Initiative at ICON 2021 with Mitesh Khapra and Pratyush Kumar
- 10 Dec 2021: Our paper on IndicWav2Vec, a pretrained speech model for 40 Indian languages accepted to AAAI 2022
- 4 Dec 2021: I presented an invited talk at Tamil Internet Conference 2021 on Indian Language Computing: A Multilingual Perspective
- 20 Oct 2021: Work from my team at Microsoft Translator on supporting Dhivehi (language spoken in Maldives) is now live. [Details]
- 5 Aug 2021: Glad to be part of the Samanantar team that received the NASSCOM AI Gamechangers Award 2021.
- 15 Aug 2021: Glad to chair the SIGKDD 2021 Data Science in India Workshop networking session on NLP.
- 15 Jul 2021: I conducted sessions on Machine Translation at the ACM NLP Summer school.
- 1 Jul 2021: Our survey paper on Multilingual Pre-trained models is now available on arxiv.
- 15 Jun 2021: Happy to be part of the CIIL panel discussion on "Language Resources for AI in Indian Languages".
- 18 Apr 2021: My team at Microsoft India will be presenting our work at EACL 2021 on large-scale multilingual transliteration for Indian languages on mined transliteration corpora of 600k word pairs between English and 10 Indic language pairs.
- 13 Apr 2021: We at AI4Bharat with EkStep Foundation released Samanantar, the largest publicly available corpus for Indian languages containing 46M sentence pairs between English and 11 Indian languages.
- 15 Feb 2021: I conducted lectures on sequence labeling and sequence-to-sequence learning covering RNN, LSTM, Transformers, etc. for CS-772 (Deep Learning for NLP) by Prof. Pushpak Bhattacharyya.
- 2 Jan 2021: Glad to chair an NLP Session at CoDS-COMAD 2021.
- 20 Dec 2020: Glad to chair a Machine Translation Session at ICON 2020.
- 03 Dec 2020: Presented talk at Prof. Tanmoy Chakraborthy's ML course (IIIT Delhi) on Bridging the gap between Experimental Prototypes and Production ML systems.
- 10 Nov 2020: I will be part of a panel discussion on NLP/MT for low-resource languages at WMT 2020.
- 19 Oct 2020: Invited Talk on Indic NLP: A Multilinguality and Language Relatedness Perspective at Vaibhav Summit (Organized by MyGov).
- 18 Oct 2020: Lecture on Understanding the Indian Languages: Challenges & Opportunities> for Atal Faculty Development Program on Artificial Intelligence in Natural Language Processing at KIIT University, Bhubhaneshwar.
- 29 Sep 2020: Work from my team at Microsoft Translator on supporting Assamese is now live. [Details]
- 22 Sep 2020: IndicNLPSuite released containing large monolingual corpora, BERT models, embeddings and NLU datasets.
- 15 Sep 2020: Our paper on NLP resources for Indian languages, IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages,accepted to EMNLP Findings 2020 [preprint].
- 13 Aug 2020: Work from my team at Microsoft Translator on supporting Odia is now live. [Details]
- 09 Aug 2020: IITB Parallel Corpus v3.0 released. 47,000 new sentence pairs added. See details [HERE].
- 09 Aug 2020: Finally documented the BrahmiNet-ITRANS transliteration scheme. See details [HERE].
- 15 Jul 2020: Indian language multilingual translation shared task for WAT 2020 launched. We are resuming this task with larger parallel corpora. See details [HERE].
- 09 Jul 2020: Bamdev presented our paper on Geometric Meta-embeddings at the REPL4NLP workshop (ACL 2020) [VIDEO]
- 09 Jul 2020: We showcased theAI4Bharat-IndicNLP dataset at the REPL4NLP workshop (ACL 2020) [VIDEO]
- 27 Jun 2020: It was great to moderate a talk by my advisor Prof. Pushpak Bhattacharyya on Imparting Sentiment and Politeness on Computers at the IIT Alumni Center Bangalore [video]
- 10 Jun 2020: ACM Computing Survey has accepted our survey paper on Multilngual NMT. Camera-ready coming soon. HERE
- 24 May 2020: Keynote Talk on NLP for Indian Languages: A Language Relatedness Perspective at 5th WILDRE workshop (under LREC 2020). [slides]
- 16 May 2020: Lecture on Machine Translation at IIT Hyderabad as part of NLP course. [slides]
- 16 May 2020: IndoWordnet Parallel Corpus v0.2 released. Fixes critical isues with v0.1. [link]
- 01 May 2020: IndoWordnet Parallel Corpus is being used> for the WMT 2020 shared task on similar language translation for Hindi-Marathi translation. [link]
- 30 Apr 2020: AI4Bharat-IndicNLP dataset released (built in collaboration with IIT Madras). Contains NLP resources for 10+ Indian languages. [Link to paper]
- 15 Apr 2020: Work from my team at Microsoft Translator on supporting 5 new Indian languages (Marathi, Gujarati, Punjabi, Malayalam and Kannada) is now live. [Details]
- 25 Mar 2020: IndoWordnet Parallel Corpus v0.1 released. [link]
- 20 Mar 2020: Manuscript of Utilizing Language Relatedness to improve Machine Translation: A Case Study on Languages of the Indian Subcontinent available on arxiv. [link]
- Feb 2020: IndicNLP library featured on analyticsindiamag [link]
- 23 Jan 2020: IndicNLP library featured on AnalyticsVidhya [link]
- 16 Jan 2020: Lecture on Neural MT at CEP course on Deep Learning for Natural Language Processing at IIT Patna [slides]
- 05 Jan 2020: Our revised and expanded survey on Multilingual Neural MT is available.
- 26 Oct 2019: Tutorial on Multilingual NMT accepted at COLING 2020, Barcelona, September 2020 with Raj Dabre and Chenhui Chu.
- 26 Oct 2019: Workshop on Asian Translation (WAT) 2020 to be co-located with AACL/IJCNLP 2020. We will have en-hi and en-ta tasks. We may also have a multilingual Indic language translation tasks
- 30 Aug 2019: Started a collaborative catalog for Indian language NLP resources. Please contribute to improve the catalog.
- 30 Aug 2019: Invited talk at NASSCOM DSAI-CoE on NLP for Indian Languages: A Language Relatedness Perspective.[slides]
- 29 Jul 2019: Presented our paper Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach at ACL 2019. [video]
- 27 Jul 2019: Tutorial at the IIT Alumni Center Bengaluru AI Deep Dive Workshop 2019 on Natural Language Processing - A Distributional Approach. [slides]
- 4 Sep 2018: Work from my team at Microsoft Translator on supporting Telugu is now live. [Details]