Machine Translation Quality Assessment

The world of translation has never been an easy one to navigate, and it has grown in complexity over the last few decades. One of the outgrowths of translations has been Machine Translation (MT), which, as the name implies, is a method of inputting raw data in the source language and getting the translated results in the target language. At this point, you’ll be familiar with Google Translate, which offers a translating service that’s linked to MT. But what might not be so familiar is the acronym BLEU. Although it sounds just like the French “blue”, don’t be misled into thinking it’s a color. BLEU stands for Bilingual Evaluation Understudy and it arose as a result of a research group’s findings from IBM, which stated that human evaluations of MT are not only time-consuming, but expensive, too. BLEU arose as a more affordable method to help individuals get quicker results.

Learn more about our Machine Translation servicesCheck It Out

But what is BLEU? It is essentially a score that’s allocated to a translation based on a relatively basic string-matching algorithm that “provides basic quality metrics for MT researchers and developers”. It has become one of the most widely used MT quality assessment metrics over the last 15 years and continues to be “a primary metric to measure MT system output even today”.

So, how does this score work? It essentially measures “the degree of difference between human and machine translations”. It starts off by comparing individual segments (which are typically sentences), and follows this by providing an average value for the entire text. The closer the MT comes to a human translation, the higher the BLEU score. A score of 1 usually means that the translation is identical to a human translation, while a score of 0 usually means that the MT has no matches with the human translation. Ultimately, the main goal is to produce translations with the “highest degree of accuracy, and not to imitate the provided references.”

Benefits of BLEU

At this point, you might be wondering what the benefits of BLEU are. We offer some of these below:

It has reached wide levels of adoption
It has a high correlation with human evaluation
It is language independent
It is relatively quick to calculate
It is easy to understand
And it is also more affordable

What can BLEU be used for?

Apart from translation, the BLEU score can be used for other language generation problems which utilize deep learning methods. Some examples of this include speech recognition, image caption generation, language generation, as well as text summarization. This, however, is not an exhaustive list. It can also be used for comparing the quality of different MT systems in enterprise use settings.

Some of the challenges related to the BLEU score

As a starting point, BLEU scores can be problematic regarding very short translations. If you were to require the translation of a single word or two words together, you’d most probably get a very high BLEU score, but this wouldn’t always be accurate. This is because “the tokens are nicely covered by the references.”

Next, the BLEU score does not take into account word order. This can yield results that don’t make sense.

Thirdly, BLEU scores work best when there is a large data-set of sentences to work with. It therefore makes sense not to rely on a single instance, but to “check the performance on many sentences, and combine the scores for a more comprehensive and accurate evaluation of the model.”

Another issue relates to the difficulty of accuracy scores or the difficulty of measuring the accuracy of translations. This is also the case with human translations, which will often have different interpretations based on the translator themselves. This begs the question: when is a translation 100% accurate?

Furthermore, BLEU is considered shortsighted as even a correctly translated sentence can still receive a low score, depending on the human reference. If a word is placed incorrectly in a particular sentence, this small word order change can produce drastic results and changes in meaning. BLEU, however, doesn’t take errors into consideration.

In addition, BLEU only measures direct word-to-word similarity as well as the extent to which word clusters in two sentences are identical. An accurate translation which might use different words, may receive a poor score simply because they don’t match the selected human reference.

BLEU also does not take into consideration paraphrases and synonyms, which ultimately means that scores can be misleading in terms of overall accuracy.

Where to from here?

According to Slator, in an interview with Google’s Markus Freitag, Google has been trying to make improvements in MT and in improving the use of BLEU, while overcoming its challenges mentioned above.

The article goes further to cite research into the situation where humans paraphrase standard references for use in automated MT evaluation, finding that the automated evaluation correlates better with human judgement. This tends to sidestep the system’s preference for monotonic translations that contain the same words as the reference, “resulting in a fairer assessment of alternative, equally good translations.”

In a study, which targeted professional linguists, the latter were asked to paraphrase reference translations as much as possible without receiving the source sentences. This included using different words as well as sentence structure, all while keeping the reference “a natural instance of the target language.”

The second group of professional linguists was then asked to rate the reference translations – both those where were paraphrased and not paraphrased – in side-by-side evaluations, and they too were not provided with the source sentences.

As a result, it was found that a majority of human translators preferred the paraphrased reference translations, which ultimately results in the conclusion that these translations were of a higher quality than the MT results.