Doctoral Dissertation

Name

YEMANE KELETA TEDLA

Title

Tigrinya Morphological Segmentation with Bidirectional Long Short-Term Memory Neural Networks and its Effect on English-Tigrinya Machine Translation

Date

September 30, 2018

Thesis

PDF file


Abstract

This thesis presents various fundamental natural language processing (NLP) research for morphologically rich and low-resource Tigrinya language. We compiled new Tigrinya language resources including a medium-sized news text corpus, the first morphologically segmented corpus and an English-Tigrinya parallel corpus which were employed in the following research.

First, we utilized the unique morphological patterns of Tigrinya to boost performance of a part-of-speech (POS) tagger, particularly of unknown words, with support vector machines (SVM) and conditional random fields (CRF). Furthermore, we obtained 91.6% accuracy (state-of-the-art) approaching POS tagging as a sequence-to-sequence labeling using bidirectional long short-term memory (BiLSTM) networks with word embeddings forgoing feature engineering.

Second, we presented the first research of morphological segmentation for Tigrinya. We explored language-independent character and substring features based on CRF. In addition, we obtained state-of-the-art F1 score of 95.07% with BiLSTM networks using concatenated character and word embeddings. This approach does not require feature engineering to extract linguistic information, which is useful for languages lacking sufficient resources.

Finally, we explored machine translation from English to a morphologically rich language Tigrinya which is challenged by several factors including out-of-vocabulary problem, language model perplexity, and poor word alignment. We introduced shallow and fine-grained morphological segmentation to mitigate these problems. Generally, the results show that translation using the morphologically segmented models can improve translation quality.


Publication (Journal, all peer-reviewed)

Publication (International conference, all peer-reviewed)

Publication (Domestic confenrence, not peer-reviewed)