Data Annotation for Chinese Language

It is no secret that in our ever-growing industry the need for automation, AI and modern technologies is a thing that pulls us forward. At the same time these might not work well, if not used expertly. We got to the point where machine translation has been growing and growing and along with it the need for data annotation has become a niche service, which might tempt most companies to give it a try.
So far in the past few years AI has been implemented with great success to many languages but was found to be lacking in others, especially when it comes to Asian languages.
However, the truth is that smart technologies and science always find a way and the new technologies and modern approaches are doing wonders to machine translation and its quality.

For this article we’ve chosen to talk about the hot topic of data annotation for Chinese language from the point of view of a company that was involved in new technologies back in the days when the first corpora for Asian languages were built.

What is data annotation?

Data annotation is exactly what it sounds like: the process of identifying and labeling information, so it can be later used for AI purposes. When it comes to languages, it is the process of “categorizing” and tagging the different parts of a language in order to have it later used in practice. There are various kinds of annotation when it comes to language: POS (part-of-speech), phonetic, semantic or stylish annotation, etc. (these are not all of course)

What does it have to do with language?

If you noticed the types of annotation described above you already know the answer. Data annotation is done by people, and who better know the intricacies of a language than a translator or an experienced linguist. And here come the LSPs as a natural partner for providing the service.

Now add the Chinese language, its two writing systems, the hundreds of dialects and the fact that it is a tonal language.

There are numerous cases when the annotators have to decide on the meaning of a character based on their linguistic knowledge, so a sentence is annotated correctly while building the corpora of a Chinese language processing tool. Due to this a very significant knowledge of the language itself especially in syntax is needed. Understanding the language, its sentence structure and punctuation is a key factor in creating a quality database.

Chinese language has no spaces between words and phrases, so it is hard to do word segmentation. It is also difficult to deal with words that can be different parts of speech but at the same time they don’t change.
One more challenge is the fact that Chinese uses tones and honorifics, which are very difficult to assess via an algorithm.

Where are we heading to?

There is a long road ahead until a human quality translation in Chinese language can be achieved. At 1-StopAsia, we are very eager to see where new technologies are heading. With regards to Asian languages in general, so far besides the standard human translation services, we have taken the middle ground when it comes to modern technologies. Depending on the client’s needs, when machine translation is requested we can then add a post-editing step to improve the final result.