Tigrinya Morphological Segmentation with Bidirectional Long Short-Term Memory Neural Networks and its Effect on English-Tigrinya Machine Translation
September 2018 (applied and currently under review)
This thesis presents various fundamental natural language processing (NLP) research for morphologically rich and low-resource Tigrinya language. We compiled new Tigrinya language resources including a medium-sized news text corpus, the first morphologically segmented corpus and an English-Tigrinya parallel corpus which were employed in the following research.
First, we utilized the unique morphological patterns of Tigrinya to boost performance of a part-of-speech (POS) tagger, particularly of unknown words, with support vector machines (SVM) and conditional random fields (CRF). Furthermore, we obtained 91.6% accuracy (state-of-the-art) approaching POS tagging as a sequence-to-sequence labeling using bidirectional long short-term memory (BiLSTM) networks with word embeddings forgoing feature engineering.
Second, we presented the first research of morphological segmentation for Tigrinya. We explored language-independent character and substring features based on CRF. In addition, we obtained state-of-the-art F1 score of 95.07% with BiLSTM networks using concatenated character and word embeddings. This approach does not require feature engineering to extract linguistic information, which is useful for languages lacking sufficient resources.
Finally, we explored machine translation from English to a morphologically rich language Tigrinya which is challenged by several factors including out-of-vocabulary problem, language model perplexity, and poor word alignment. We introduced shallow and fine-grained morphological segmentation to mitigate these problems. Generally, the results show that translation using the morphologically segmented models can improve translation quality.
Publication (Journal, all peer-reviewed)
- Yemane Tedla and Kazuhide Yamamoto, “Morphological Segmentation with LSTM Neural Networks for Tigrinya”, International Journal on Natural Language Computing (IJNLC) Vol. 7, No. 2, pp 29-44, 2018.
- Yemane Tedla and Kazuhide Yamamoto, “Morphological Segmentation for English-to-Tigrinya Statistical Machine Translation”, International Journal of Asian Language Processing, vol. 27 no. 2: pp. 95-110, 2017.
- Yemane Tedla, Kazuhide Yamamoto and Ashuboda Marasinghe, “Tigrinya Part-of-Speech Tagging with Morphological Patterns and the New Nagaoka Tigrinya Corpus”, International Journal of Computer Applications, Vol. 146, No. 14, pp. 33-41, 2016.
Publication (International conference, all peer-reviewed)
- Yemane Tedla and Kazuhide Yamamoto, “Analyzing word embeddings and improving POS tagger of Tigrinya”, in Proceedings of the International Conference on Asian Language Processing (IALP), IEEE, pp. 115-118, Singapore, 2017.
- Yemane Tedla and Kazuhide Yamamoto, “The Effect of Shallow Segmentation on English-Tigrinya Statistical Machine Translation”, in Proceedings of the International Conference on Asian Language Processing (IALP), IEEE, pp. 79-82, Taiwan, 2016.
Publication (Domestic confenrence, not peer-reviewed)
- Yemane Tedla, Kazuhide Yamamoto and Ashuboda Marasinghe, “Nagaoka Tigrinya Corpus: Design and Development of Part-of-speech Tagged Corpus”, In Language Processing Society 22nd Annual Meeting Papers Collection, The Association for Natural Language Processing, pp. 413-416, Japan, 2016.