Towards Social Robots: Best Way to Evaluate MT Outputs?

Written by Şölen Aslan | Nov 10, 2017 10:53:42 AM

What is the single most important element that is present in every translation or localization workflow? There is one process that cannot be eliminated from any type of translation or localization task, which is evaluation. This process becomes even more important when it comes to machine translation (MT).

How to evaluate an output produced by artificial intelligence? Should humans take over the task or let robots decide? Or, does something in-between need to be invented? This blog gives some more insights about the evaluation side of the long-lasting debate about humans vs. machines. Whether you are piloting MT, customizing it or implementing it in production, you need to make a decision on how you are going to evaluate its output. Three principal quality indicators emerge for the case of machine translated texts:

Adequacy (also known as fidelity or accuracy): refers to mapping of the content in the source and the target text
Fluency (also known as intelligibility or comprehensibility): focuses on the perception of quality of the translated output among the target audience regardless of the source.
Grammaticality: a widely used attribute in the initial evaluation process which highlights the linguistic correctness of the translated output.

Human Evaluation Strategies

For human evaluation, there are several strategies that can be employed:

Error identification: the evaluators exhaustively review a text to pinpoint a set of pre-established errors.
Scales: evaluators rate a particular attribute on a (usually) 4 or 5 point scale.
Ranking: evaluators order a number of translations for the same source from best to worst.

Error identification is the most exhaustive of all approaches, as it identifies and locates all errors present in the text. It is seen as the most objective approach thanks to the pre-established set of errors, which often include grammar, punctuation, terminology and style. However, it is also the most time-consuming and requires the most highly trained/qualified evaluators.

On the other hand, scales offer a more global view of quality. Instead of identifying specific errors, evaluators assess the overall quality of each sentence according to a particular attribute.

Thirdly, ranking aims to speed up human evaluation and to reduce the cognitive effort involved. It is particularly useful when comparing translations (and systems) but it doesn’t provide any information on the actual quality of each sentence.

Apart from these, the industry devises its own indirect MT assessment methods such as usability tests. These methods are more geared geared towards measuring the usefulness and user satisfaction of a document. In essence, they do not necessarily target the linguistic quality of a text but rather evaluate its global acceptance by the target audience. Customer feedback received at the support service or collected through like/dislike ratings on web pages can also be indicative of text quality/usefulness of a piece of text.

Pros and Cons of Human Evaluation

Much as the different evaluation strategies try to overcome frailties, human evaluation is criticized for being subjective, inconsistent, time consuming and expensive.

Obviously, each evaluator has an individual type of expertise depending on their training, experience, familiarity with the MT and personal opinions about MT, which plays a vital role on their quality judgment. Yet, humans are still the most reliable source to obtain meaningful informative evaluations. Users are also human, after all.

Automatic Evaluation Metrics

There is a long list of automatic evaluation metrics to choose from: BLEU, NIST, METEOR, GTM, TER, Levenshtein edit-distance, confidence estimation and so on.

In the main working principle of automatic metrics lies calculating how similar a machine-translated sentence is compared to a human reference translation or previously translated data. It is assumed that the smaller the difference, the better the quality.

For example; BLEU, NIST, GTM and the like try to calculate this similarity by counting how many words are shared by the MT output and the reference, and reward long sequences of shared words. TER and the original edit-distance are more task-oriented and try to capture the work that is needed to improve MT output to human translation standards. They seek to correlate with post-editing activity. They measure the minimum number of additions, deletions, substitutions and re-orderings that are used to transform the MT output into the reference human translation.

Lastly, confidence estimations works by obtaining data from the MT system and learning features that relate the source to the translation from previous data.

Pros and Cons of Automatic Evaluation

Automatic metrics emerged to address the need of objective, consistent quick and cheap evaluations. The ideal metric has been described by developers of METEOR, Banerjje and Lavie, as a fully automatic, low cost, tunable, consistent and meaningful.

No matter how many times you run the same automatic metric on the same data, scores will be consistent, which is an advantage over the subjective nature of human evaluation. The question with automatic metrics is how to calculate a quality attribute. What does quality mean in the language of computers? The algorithm used is objective, yes, but what is it calculating?

The solution so far has been to come up with an algorithm, whatever it does, that correlates with human responses. Automatic metrics are certainly quicker than human evaluation. Hundreds of sentences can be scored in the click of a mouse. Nevertheless, it also takes a lot of preparation. Metrics need to be trained on similar data and/or require reference translations for each of the sentences you want to score. That in itself can be costly and time-consuming.

Social Robots as the Solution?

There are different implementations of automatic evaluation algorithms, different approaches for calculating the final scores for measurement, and different definitions and penalizations for error categories. However, translation has evolved towards transcreation and is highly focused on localization which inevitably add the context and culture as vital variables into the equation. MT plays its part by speeding up the process, however, detailed evaluation is a must to secure the quality of the end-product. The ultimate solution would require something in-between robots and humans and let’s say we call them ‘social robots’.. Social robots that are objective yet sensible enough to take variable contexts into account may be the answer long sought after.

The diversification in content types and rapid adoption of translation technologies (including machine translation) drives the need for more dynamic and reliable methods of quality evaluation. TAUS Quality Dashboard provides the objective and neutral platform needed to evaluate translation quality. For more information about how to measure translation quality more efficiently:

6 minute read

View full post