Challenges When Developing NLP for Vietnamese

The Vietnamese language is spoken by around 75 million people across the world and it is the official language of Vietnam. The language itself has undergone several centuries of development starting with borrowing words from Chinese until the 17th century when the language was Romanized by a Jesuit missionary. Today, the Vietnamese alphabet contains 29 letters. This includes one digraph and nine with diacritics. Five of these diacritics are used to designate tone while the remaining four are used for separate letters of the Vietnamese alphabet. In terms of phonology and diacritic marks, it can be said that these indicate tones while others represent accents. What makes the Vietnamese language even more complicated is the fact that its tonal system is even more complicated than the Chinese one. In Vietnam, the language has six basic tones, two more than Chinese, and to make things even more difficult, these tones will be pronounced differently, depending on the region in Vietnam where one finds oneself in. Thus, when it comes to natural language processing (NLP) for Vietnamese, certain difficulties and challenges arise which need further exploration in order to ensure more accurate English to Vietnamese and Vietnamese to English translations. Wondering what some of these challenges are? Let’s take a look below.

Challenges When Developing NLP for Vietnamese

NLP for Vietnamese is a complex sphere when it comes to producing accurate language translations. Creating algorithms and software to translate any language is a complex enough task. But when it comes to English to Vietnamese translations, especially when NLP is involved, many difficulties arise for human translators and for the machines. Here are a few of these.

Vietnamese Word Segmentation

Let’s begin by approaching the basics. A word is considered to be a linguistic unit that is made up of one or more morphemes. Meanwhile, word segmentation is the process of determining the word boundaries in a sentence/document by a computer program or specific software. With this in mind, we can now make some deductions about the Vietnamese language and NLP for Vietnamese. At its most basic, when approaching NLP for Vietnamese, word segmentation will be one of the first aspects to consider and getting this wrong can cause the rest of the translation to be nonsensical or inaccurate. This is why NLP must take into account word segmentation in Vietnamese to deliver accurate results.

Part-of-Speech (POS) Tagging

Next up, we come to POS tagging. In NLP, POS tagging refers to determining the meaning of certain words in relation to the parts of speech in the sentence to convey an accurate meaning as it relates to the definition of the word and its context. In Vietnamese, for example, the sentence “The old man walks too fast” can also mean “The father walks too fast”, “The old man died too fast”, “My father died too fast”, “You get old too fast”, “Grandfather gets old too fast”, and more. There is therefore a lot of ambiguity in the language that needs to be considered before an accurate translation is made.

Syntactic Parsing

Following POS tagging is the challenge of syntactic parsing. This aspect of language understanding, development, and translation deal with the syntactic structure of a sentence. According to sources, “the word ‘syntax’ refers to the grammatical arrangement of words in a sentence and their relationship with each other. The objective of syntactic analysis is to find the syntactic structure of a sentence which is usually depicted as a tree.” In the Vietnamese language, the general grammatical rules of structuring a sentence include the fact that it is similar to English in the sense that sentence structure is based on Subject+Verb+Object. NLP for Vietnamese needs to take this into consideration when it comes to translations, too.

Named-Entity Recognition (NER)

Named-entity recognition is another aspect of NLP that must be taken into account when translating the Vietnamese language – whether from English to Vietnamese or Vietnamese to English. Essentially, NER looks at aspects such as names of people, organizations, locations, times, quantities, monetary values, percentages, and more within a sentence in order to provide the reader with more context and information about the depth of the text and result in a logical outcome and accurate translation. One example sentence that illustrates this point is: “Ousted XYZ founder John Jones sells London penthouse for £10 million”. At present, this sentence contains information about the organization (XYZ), the person (John Jones), the location (London), and the monetary value (£10 million). Each of these linguistic components builds up the sentence to give it meaning. This is why NLP for Vietnamese needs to take NER into account in order to produce accurate translations. However, NER is not always straightforward and NLP software must have accurate NER inputs to yield the desired result.

Coreference Resolution (CR)

Coreference resolution (CR), is a subtask of NER. When referring to entities in a sentence or a document that needs to be translated, it is common for pronouns to be used to refer to the entity instead of repeating the same entity several times throughout the sentence. For example, one would not say “John Jones is selling John Jones’ penthouse” but would rather say “John Jones is selling his penthouse” to convey a truer translation that’s free of repetition and uses accurate entity descriptions. When it comes to Vietnamese, however, it has been found that CR has received very little attention in the Vietnamese NLP community. In fact, it appears at present that there are only two researchers that have used CR as a subtask of NER in NLP. This is another challenge that arises with NLP for Vietnamese – there is simply too little data in the NLP database to yield better results.

Other Challenges

NLP for Vietnamese must also take into consideration Vietnam’s unique writing system and the lack of resources for the Vietnamese language. For example, some sources state that there are approximately 40,000 to 50,000 Vietnamese words that have been defined in modern dictionaries. This, coupled with the fact that several words in Vietnamese that are separated by spaces actually represent one word, make translating this language from English to Vietnamese and the other way around much more difficult.

NLP for Vietnamese Is Making Progress

Despite the challenges outlined above, research is slowly but surely making progress in identifying difficulties in NLP for Vietnamese and addressing the intricacies of the language when developing software and algorithms to produce more accurate translations. Several studies have found that using hybrid algorithms can address these challenges with a relatively high percentage of accuracy. Nevertheless, there remains a lack of resources to see that this takes off as effectively as it could. However, by ensuring a translator is aware of these language translation difficulties, better reproduction of the language will be possible. In addition, as advances in technology continue and more research is carried out in this field, we are likely to see better NLP for Vietnamese in the future.

Difficulties With Developing NLP for Vietnamese

Challenges When Developing NLP for Vietnamese

Vietnamese Word Segmentation

Part-of-Speech (POS) Tagging

Syntactic Parsing

Named-Entity Recognition (NER)

Coreference Resolution (CR)

Other Challenges

NLP for Vietnamese Is Making Progress