What is the single most important element that is present in every translation or localization workflow? There is one process that cannot be eliminated from any type of translation or localization task, which is evaluation. This process becomes even more important when it comes to machine translation (MT).
For human evaluation, there are several strategies that can be employed:
Error identification is the most exhaustive of all approaches, as it identifies and locates all errors present in the text. It is seen as the most objective approach thanks to the pre-established set of errors, which often include grammar, punctuation, terminology and style. However, it is also the most time-consuming and requires the most highly trained/qualified evaluators.
On the other hand, scales offer a more global view of quality. Instead of identifying specific errors, evaluators assess the overall quality of each sentence according to a particular attribute.
Thirdly, ranking aims to speed up human evaluation and to reduce the cognitive effort involved. It is particularly useful when comparing translations (and systems) but it doesn’t provide any information on the actual quality of each sentence.
Apart from these, the industry devises its own indirect MT assessment methods such as usability tests. These methods are more geared geared towards measuring the usefulness and user satisfaction of a document. In essence, they do not necessarily target the linguistic quality of a text but rather evaluate its global acceptance by the target audience. Customer feedback received at the support service or collected through like/dislike ratings on web pages can also be indicative of text quality/usefulness of a piece of text.
Much as the different evaluation strategies try to overcome frailties, human evaluation is criticized for being subjective, inconsistent, time consuming and expensive.
Obviously, each evaluator has an individual type of expertise depending on their training, experience, familiarity with the MT and personal opinions about MT, which plays a vital role on their quality judgment. Yet, humans are still the most reliable source to obtain meaningful informative evaluations. Users are also human, after all.
There is a long list of automatic evaluation metrics to choose from: BLEU, NIST, METEOR, GTM, TER, Levenshtein edit-distance, confidence estimation and so on.
In the main working principle of automatic metrics lies calculating how similar a machine-translated sentence is compared to a human reference translation or previously translated data. It is assumed that the smaller the difference, the better the quality.
For example; BLEU, NIST, GTM and the like try to calculate this similarity by counting how many words are shared by the MT output and the reference, and reward long sequences of shared words. TER and the original edit-distance are more task-oriented and try to capture the work that is needed to improve MT output to human translation standards. They seek to correlate with post-editing activity. They measure the minimum number of additions, deletions, substitutions and re-orderings that are used to transform the MT output into the reference human translation.
Lastly, confidence estimations works by obtaining data from the MT system and learning features that relate the source to the translation from previous data.
Automatic metrics emerged to address the need of objective, consistent quick and cheap evaluations. The ideal metric has been described by developers of METEOR, Banerjje and Lavie, as a fully automatic, low cost, tunable, consistent and meaningful.
No matter how many times you run the same automatic metric on the same data, scores will be consistent, which is an advantage over the subjective nature of human evaluation. The question with automatic metrics is how to calculate a quality attribute. What does quality mean in the language of computers? The algorithm used is objective, yes, but what is it calculating?
The solution so far has been to come up with an algorithm, whatever it does, that correlates with human responses. Automatic metrics are certainly quicker than human evaluation. Hundreds of sentences can be scored in the click of a mouse. Nevertheless, it also takes a lot of preparation. Metrics need to be trained on similar data and/or require reference translations for each of the sentences you want to score. That in itself can be costly and time-consuming.
The diversification in content types and rapid adoption of translation technologies (including machine translation) drives the need for more dynamic and reliable methods of quality evaluation. TAUS Quality Dashboard provides the objective and neutral platform needed to evaluate translation quality. For more information about how to measure translation quality more efficiently:
6 minute read