Japanese Orthographical Normalization Does Not Work for Statistical Machine Translation


Kazuhide Yamamoto and Kanji Takahashi. Japanese Orthographical Normalization Does Not Work for Statistical Machine Translation. Proceedings of the International Conference on Asian Language Processing (IALP 2016), pp.133-136 (2016.11)


We have investigated the effect of normalizing Japanese orthographical variants into a uniform orthography on statistical machine translation (SMT) between Japanese and English. In Japanese, 10% of words have reportedly more than one orthographical variants, which is a promising fact for improving translation quality when we normalize these orthographical variants. However, the results show that SMT with normalization is equivalent to that without normalization by both BLEU and RIBES measurement, even though normalization reduces the size of language models, its perplexity, and the number of out-of- vocabulary words. We discuss the potential reasons in this paper.

More Information

We also investigated an effect of orthographical normalization on Neural Machine Translation.

We could see a similar tendency is observed between NMT and PBSMT.

Question & Answer at conference

You should evaluate sentences that contain orthographical variants.

That’s a nice comment. We will try to evaluate only sentences contain orthographical variants.

Did you apply pre-ordering or something good processing?

We didn’t apply because We’d like to investigate an effect of orthographical normalization.