Development of English to Indian Languages : Machine Translation System (E-ILMT)
1. Project overview
1.1 Proposal summary
The EILMT system aims to design and deploy a Machine Translation System from English to Indian Languages in Tourism and Healthcare Domains. The project is funded by Department of Information Technology, MCIT, Government of India. The project started from September 2006 onwards.
1.2 Consortium Members of EILMT system:
· C-DAC Mumbai · IISc Bangalore
· IIT Hyderabad · C-DAC Pune
· IIT Mumbai · Jadavpur University Kolkata
· IIIT Allahabad · Utkal University Bangalore
· Amrita University Coimbatore
· Banasthali Vidyapeeth, Banasthali
1.3 Why MT?
The main objective of Machine Translation (MT) is to break the language barrier in a multilingual nation like India. Majority of the Indian population is not familiar with English while most of the information available on web or electronic information is in English. So, to reach out to the common man across various sections, an automatic language translator is important.
2. Project Descriptions
C-DAC Mumbai’s task is to build statistical models and resources for a statistical MT (SMT) system from English to Hindi/Marathi/Bengali. Statistical Machine Translation is a Mathematical Model in which the process of human translation is statistically modelled. Statistical methods allow the analysis of parallel text corpora and the automatic construction of machine translation systems. In Statistical Machine Translation system, correspondences between the words in the source and the target language are learned from the bilingual corpora on the basis of alignment models. The engine uses state of the art statistical techniques which are presently gaining momentum in the MT community. The primary objective is to initially build an English-Hindi translation system capable of translation of free flow text as found on the web and gradually adapt it to other Indian language pairs as well. Various enhancements have been induced in the baseline system leading to an improved translation engine despite lack of linguistic resources and sophisticated tools for Indian languages.
3. Overall Strategy
· Collecting and Developing Monolingual & Bilingual corpora
· Building language models for Hindi/Marathi/Bengali
· Building Translation modules for English-Hindi/English-Marathi/English-Bengali
· Tuning
· Evaluation
4. Expected Outcome
· SMT system of English-Hindi/English-Marathi/English-Bengali
· Research paper
5. Current Status
As part of this on-going work the team has developed the following state of the art architecture for English-Hindi language pair that provides improved output at three different stages:
ª The SMT engine of the EILMT system was initially developed as a baseline system using the state of the art techniques and the tools then available, including the POS tagger (fnTBL), parser (Bikel), decoder (Pharaoh) etc. The training corpus (translation model) consisted of 5000 sentences and 800 sentences were split for testing and tuning.
ª Owing to the inherent structure of the Indian languages the baseline techniques were found to be inadequate in producing a very good quality output. Therefore, the system was updated with a pre-processing stage where syntactic re-ordering on the source language was performed to reduce long distance movements through SMT. This helped in obtaining a better phrase alignment table thereby leading to a good improvement in the translation quality using Moses decoder with Giza++ alignment tool. The corpus (translation model) training size for achieving this effort was 12299 sentences with additional 1570 sentences split for testing and tuning.
ª Due to the lack of corpus (of the magnitude expected), the problem of data sparsity was evident and often was a cause of some degradation in the output even after the syntactic processing. In order to counter this problem, the syntactically processed corpus was morphologically processed and used for training. Due to the unavailability of sophisticated morphological analyzers, a rule based suffix separation approach was used to separate the root word and the affixes.
§ The result of this was an improved phrase table enriched with word level information to a sufficient extent
§ This also lessened the data sparsity problem thereby reducing the degradation in the output due to lack of vocabulary
The present SMT system is extended to the English-Marathi and English-Bengali pairs with the following statistics:
|
Language Pair |
Training size |
Testing+tuning size |
|
English-Marathi |
13598 |
1500 (750+750) |
|
English-Bengali |
13015 |
1550 |
5.1 SMT architecture
Figure 1: Syntactic and Morphological Processing: Schematic
5.2 Salient features
· Baseline SMT · Baseline + syntactic information
· Baseline + syntactic information + Morphological information
· Statistical tools (Pharaoh, Moses, Giza++, fnTBL, Bikel)
6. International Publication(s)
· R. Ananthakrishnan, Jayprasad Hegde, Pushpak Bhattacharyya, Ritesh Shah and M. Sasikumar, Simple Syntactic and Morphological Processing Can Help English-Hindi Statistical Machine Translation , International Joint Conference on NLP (IJCNLP08), Hyderabad, India, Jan, 2008.