Is Vietnamese Machine translation achievable?

Currently, Vietnam is one of the fastest-growing economies in the world, which inevitably adds it to the list of top countries in the business sphere, where big companies might want to invest. And wherever we have a growing business interest, translation services start to see growing demand too.

Nowadays the language barrier is easy to break by using machine translation and even more up-to-date would be to say neural machine translation (NMT). We have seen with various languages from the region as Chinese that sometimes even the most modern algorithms cannot deliver the quality of a near-to-human translation, which poses limitations to using modern technologies for Asian languages translation. We’ve decided to dig a bit deeper and see how this statement reflects the situation with Vietnamese language machine translation. Is it valid or Vietnamese language machine translation is actually achievable?

Looking back on past researches

I haven’t really considered the matter until the topic popped up in my mind recently, so I started from the beginning. It seems there is quite a lot on the matter, when it comes to describing the scientific approaches and how the algorithm works but none simply explained and understandable for people without technical knowledge.

In general, it is safe to say that achieving some reasonable level of machine translation for Vietnamese-English is not hard to obtain. Most difficulties come when the translation has to be between two Asian languages like Vietnamese-Japanese or Chinese-Vietnamese but that’s another story.

Recently with the development of NMT there is room for improvement in any kind of machine translation. However, in this article, we aim to look not at the state of the machine translation quality output that any tool will deliver but mostly to outline possible linguistic specifics of Vietnamese language, which may pose a difficulty in reaching a near-to-human quality of the translations.

Wordcount and why it matters?

When we translate from one language to another there are certain specifics like how much words there are in the source language text and how much there will be in the target language when the translation is finalized. This parameter is called wordcount and is well known in the translation industry.

Translating Vietnamese into English text will tend to shrink with about 40%, while the opposite will be valid for English into Vietnamese translation.

What does wordcount have to do with NMT?

I was wondering the same when I started reading about that but there is a feature that is easily showcased this way, just follow my train of thoughts here.

In Vietnamese, there are the so-called compound words, which simply put means that two words written separately in Vietnamese will form one word in English. For example:

anh em (brother)

When it is used as a noun:

It literally means an older bother with younger siblings but with the following nuance: (a brother with brothers only, or with one sister who is the youngest)
Its broad meaning is siblings
It can also mean family or brethren

When used as a pronoun:

Y’all, you guys, my bros, my pals

Ý tôi là thế, anh em thấy thế nào?
So that was my opinion, what do you guys think?

Speaking about Machine Translation

When we take the above example and we turn to Vietnamese language machine translation, science refers to these compound words as subwords.

We all know that machine translation works with the so-called “corpora” which is trained in advance and which actually is the very “brain” of the tools we are using. We usually train the “brain” to consider the string of letters between two white spaces to be considered a word. And here lies the specific fact about Vietnamese:

“From the linguistic point of view, each sequence of characters between two white spaces in Vietnamese texts cannot be considered as a word since it does not always have a full meaning to stand alone. For example, in the sentence “hôm nay là sinh nhật của tôi” (English equivalence: “Today is my birthday”), “hôm” and “nay” are not two words, they together form a word, which means “today”. Nevertheless, “hôm” and “nay” somehow still bear some meaning: “hôm”-“day”, “nay”- “now”. Similarly, “sinh”-“birth” and “nhật”-“date” also form the word “sinh nhật”-“birthday” but they are not two distinct words. We could also call them subwords.”

The thing about subwords is that usually, the smallest unit that is processed in NMT and MT is a word, so we additionally have to train the “brain” (corpus) to recognize them. This leads to an additional step and an additional algorithm to be implemented in order to achieve the desired state of translation quality. The process described is known as “segmentation”.

Languages like Vietnamese are called isolated languages, which means that a word is not necessarily separated only by spaces. They are harder to translate with NMT due to some specific characteristics they have – monosyllabism, inflected words, and grammatical specifics. The above-explained word segmentation (WS) is the first step in the process of machine translation for these languages.

In conclusion

There is a tremendous change and development in nowadays technologies when it comes to NMT and MT and there is no denial of that fact. The world is developing faster and faster which brings in the need for fast and efficient ways of communication especially in tourism, e-commerce and any other industry that is searching for new opportunities and new markets. But NMT and MT should also be used wisely and carefully considering the opportunities it gives us but also being aware of its limitations. That is valid without a doubt for Vietnamese language machine translation as well.

The Vietnamese market is surely there for the taking, one only has to speak the proper language to be successful there and make the most out of it. What do you think?