asato ma sad gamaya
tamaso ma jyotir gamaya
mrutyor ma amritam gamaya
From ignorance lead me to truth
From darkness lead me to light
From death lead me to immortality
| ~Brhadaranyaka Upanishad |
I am Anoop Kunchukuttan. I am currently working as Senior Applied Researcher at Microsoft AI and Research in the Machine Translation team in Hyderabad, India. I am also a founding member of AI4Bharat, an initiative at IIT Madras focussed on building tools and technologies for Indian language NLP.
My research areas are Natural Language Processing, Machine Learning, Information Extraction and Retrieval.
My research interests include multilingual learning, representation learning, lexical and sentence semantics, NLP for related languages, machine translation and transliteration. I am interested in building tools and resources for Indian language NLP. Over the last decade, I have built/contributed to large-scale, broad coverage resources like the Indic NLP Library, Sata-Anuvaadak Translation system, IIT Bombay Parallel Corpus, Samanantar Corpus, Indic NLP Suite and BrahmiNet.
I completed my Ph.D in 2018 at the Department of Computer Science and Engineering, IIT Bombay. I did my research under the guidance of Prof. Pushpak Bhattacharyya at the Center for Indian Language Technology. My doctoral research work explored various facets of machine translation and transliteration between related languages.
Last updated on 10 August 2021
- 1 Jul 2021: Our survey paper on Multilingual Pre-trained models is now available on arxiv.
- 15 Jun 2021: Happy to be part of the CIIL panel discussion on "Language Resources for AI in Indian Languages".
- 18 Apr 2021: My team at Microsoft India will be presenting our work at EACL 2021 on large-scale multilingual transliteration for Indian languages on mined transliteration corpora of 600k word pairs between English and 10 Indic language pairs.
- 13 Apr 2021: We at AI4Bharat with EkStep Foundation released Samanantar, the largest publicly available corpus for Indian languages containing 46M sentence pairs between English and 11 Indian languages.
- 15 Feb 2021: I conducted lectures on sequence labeling and sequence-to-sequence learning covering RNN, LSTM, Transformers, etc. for CS-772 (Deep Learning for NLP) by Prof. Pushpak Bhattacharyya.
- 2 Jan 2021: Glad to chair an NLP Session at CoDS-COMAD 2021.
- 20 Dec 2020: Glad to chair a Machine Translation Session at ICON 2020.
- 03 Dec 2020: Presented talk at Prof. Tanmoy Chakraborthy's ML course (IIIT Delhi) on Bridging the gap between Experimental Prototypes and Production ML systems.
- 10 Nov 2020: I will be part of a panel discussion on NLP/MT for low-resource languages at WMT 2020.
- 19 Oct 2020: Invited Talk on Indic NLP: A Multilinguality and Language Relatedness Perspective at Vaibhav Summit (Organized by MyGov).
- 18 Oct 2020: Lecture on Understanding the Indian Languages: Challenges & Opportunities> for Atal Faculty Development Program on Artificial Intelligence in Natural Language Processing at KIIT University, Bhubhaneshwar.
- 22 Sep 2020: IndicNLPSuite released containing large monolingual corpora, BERT models, embeddings and NLU datasets.
- 15 Sep 2020: Our paper on NLP resources for Indian languages, IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages,accepted to EMNLP Findings 2020 [preprint].
- 09 Aug 2020: IITB Parallel Corpus v3.0 released. 47,000 new sentence pairs added. See details [HERE].
- 09 Aug 2020: Finally documented the BrahmiNet-ITRANS transliteration scheme. See details [HERE].
- 15 Jul 2020: Indian language multilingual translation shared task for WAT 2020 launched. We are resuming this task with larger parallel corpora. See details [HERE].
- 09 Jul 2020: Bamdev presented our paper on Geometric Meta-embeddings at the REPL4NLP workshop (ACL 2020) [VIDEO]
- 09 Jul 2020: We showcased theAI4Bharat-IndicNLP dataset at the REPL4NLP workshop (ACL 2020) [VIDEO]
- 27 Jun 2020: It was great to moderate a talk by my advisor Prof. Pushpak Bhattacharyya on Imparting Sentiment and Politeness on Computers at the IIT Alumni Center Bangalore [video]
- 10 Jun 2020: ACM Computing Survey has accepted our survey paper on Multilngual NMT. Camera-ready coming soon. HERE
- 24 May 2020: Keynote Talk on NLP for Indian Languages: A Language Relatedness Perspective at 5th WILDRE workshop (under LREC 2020). [slides]
- 16 May 2020: Lecture on Machine Translation at IIT Hyderabad as part of NLP course. [slides]
- 16 May 2020: IndoWordnet Parallel Corpus v0.2 released. Fixes critical isues with v0.1. [link]
- 01 May 2020: IndoWordnet Parallel Corpus is being used> for the WMT 2020 shared task on similar language translation for Hindi-Marathi translation. [link]
- 30 Apr 2020: AI4Bharat-IndicNLP dataset released (built in collaboration with IIT Madras). Contains NLP resources for 10+ Indian languages. [Link to paper]
- 25 Mar 2020: IndoWordnet Parallel Corpus v0.1 released. [link]
- 20 Mar 2020: Manuscript of Utilizing Language Relatedness to improve Machine Translation: A Case Study on Languages of the Indian Subcontinent available on arxiv. [link]
- Feb 2020: IndicNLP library featured on analyticsindiamag [link]
- 23 Jan 2020: IndicNLP library featured on AnalyticsVidhya [link]
- 16 Jan 2020: Lecture on Neural MT at CEP course on Deep Learning for Natural Language Processing at IIT Patna [slides]
- 05 Jan 2020: Our revised and expanded survey on Multilingual Neural MT is available.
- 26 Oct 2019: Tutorial on Multilingual NMT accepted at COLING 2020, Barcelona, September 2020 with Raj Dabre and Chenhui Chu.
- 26 Oct 2019: Workshop on Asian Translation (WAT) 2020 to be co-located with AACL/IJCNLP 2020. We will have en-hi and en-ta tasks. We may also have a multilingual Indic language translation tasks
- 30 Aug 2019: Started a collaborative catalog for Indian language NLP resources. Please contribute to improve the catalog.
- 30 Aug 2019: Invited talk at NASSCOM DSAI-CoE on NLP for Indian Languages: A Language Relatedness Perspective.[slides]
- 29 Jul 2019: Presented our paper Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach at ACL 2019. [video]
- 27 Jul 2019: Tutorial at the IIT Alumni Center Bengaluru AI Deep Dive Workshop 2019 on Natural Language Processing - A Distributional Approach. [slides]