Customized Quality Estimation models: Are you ready for them?

by David Koot 9 Jan 2026

TAUS EPIC API's customizable Quality Estimation models can enhance translation workflows and meet specific needs without requiring in-house NLP expertise.

A quick quiz: what’s a fork, really?

A small metal tool to bring food to one's mouth
A vehicle used for loading, transporting, and unloading goods
A software application whose codebase duplicates and modifies an existing one

For most of us, the first association is answer A, cutlery. But once we read the other answers, we quickly realize that the concept has, well, forked into other meanings. It is a fascinating example of a strong metaphor shaping language, but one that doesn’t necessarily exist in the same way in every language.

Outside of the 'home & kitchen' domain, a fork can have very different translations.

Why generic QE models work, until they don’t

By default, TAUS EPIC uses our generic QE model trained on multiple languages and domains. Its training is particularly strong in areas such as legal, business, commerce, healthcare, and IT, which form the backbone of the localization industry.

But the generic model has predictive power over more than the languages and the domains it was initially trained on. Certain patterns of translation quality generalize well across domains and languages, especially because the underlying architecture is largely language-agnostic.

But the further you move away from the core training, the model will come across patterns that no longer resemble the training data, and the quality estimation will give more random responses. This can come from metaphors and homonyms whose meaning depends on the context, such as the fork example above. But it’s not just domain or content type that matter here, translation preference of style impacts quality estimation as well.

To understand this better, let’s take a look at some examples. The TAUS QE generic model was trained to, among other things, assess translation accuracy. This includes expectations such as a symmetry in punctuation between source and target (taking into account the various language rules). It will also penalize cases where the target contains words or meanings that are not present in the source.

In most cases, this behavior is entirely justified: a low score will lead a reviewer or post-editor to take a closer look. Now consider a translation guideline that mandates copying software strings untranslated into the target text into the target, followed by a translation in brackets and/or quotation marks. From a generic QE perspective, this violates accuracy on multiple counts.

Another problem for generic models is the frequent use of acronyms for certain processes, product lines, functions or others. Acronyms like these can be very local in scope, and only used in this company or line of work. If the generic QE model comes across an unfamiliar acronym that has been expanded into a full form in the target language, it may interpret this as overtranslation.

When is choosing Quality Estimation custom models the right move

So when is the TAUS generic QE model sufficient, and when does a customized model make a real difference? As you might expect, these lines are blurry.

The generic model is free to try in the sandbox, and, in general, there is not much effort needed to start using it. Most users begin there to see how it performs on their language combinations and content. For run-of-the-mill translations in major domains, as mentioned above, and common language pairs, such as English into major European or other widely used languages, its performance is often quite accurate.

To really evaluate whether the generic model performs well for your content, there are statistical methods to test to benchmark the QE scores against the human validation of quality that you’re used to or expect. If these tests show insufficient correlation, that’s a strong signal that your quality requirements fall outside the scope of the generic model.

Common reasons for this are when the content covers more specific domains or niches, language locales or particular style rules or terminologies.

For example, we recently created a high-performing customized model for an e-commerce website with a catalog of a rather narrow set of product types but a long list of possible product specifications and a deterministic way of translation. The combination of relative niche domain and strict requirements, called for this type of QE model. The fork example that opened this blog was inspired by another customized model based on translations in the agricultural domain.

What goes into training a custom QE model?

Once you've decided on going for the custom solution, what are the next steps?

When we start training a custom model, we start with getting to 'know' your translations. What do you consider good translations? And what do you consider bad translations? We refer to these as approved and disapproved data. Both types are good sources to have for the training process.

You probably already have an abundance of approved data in your translation memory. Ideally, this data is collected in the default translation standard, namely tmx files. However, other formats can be processed as well. If there is consistent structure in your TM, we will probably find a way to get a sentence-by-sentence extraction of the data. What matters most is that the domain and content type is relevant to the model, and that the data is not polluted and genuinely approved - meaning that you have used this for real translation.

While this type of data is usually our biggest pool of information, we also know that these translation memories are not always clean or structured. So a big part of our preparation for model training involves rigorous data cleaning. We apply a large set of different tools and algorithms to discard anything that looks suspicious or contradictory. It’s a strict process, but very much needed to create a noise-free data set that is essential for a reliable model.

1. Pre- and post-edit data: gold for QE

Even more valuable than translation memory data are translations with a recorded editing history: pre- and post-edit translations that show how translations got corrected. These are very clear markers of your company's translation effort, and give the best insight in your quality requirements.

This data isn’t always available, but if you work with XLIFF files, there’s a good chance it can be recovered. If this is not the case, TAUS Data engineers usually can find other more creative solutions to retrieve this as well. During training, we usually assign extra weight to this kind of pre-/post-edit data.

2. Terminology, style, and process knowledge

If you use terminology, this is a very important part of our training routine as well. Glossaries allow us to check compliance of your data to terminology requirements and to generate additional examples. The same applies to do-not-translate lists, which are overviews of all terms or strings that should not be translated.

Any other information about your translation process can help us to improve our model. This can be a style guide, information about the translation engine you use, or your translation process in general.

Which challenges do you face in your translation process, or which things 'always' go wrong? It is this special attention to these details that can make your model the perfect quality assistant.

3. How much data is enough?

Finally I want to address the burning question of the volume of data that should go into the training: how many translation segments is enough? There is no answer to this question without mentioning some trade-offs.

In general: more is better. Technically, 20k of good segments might be sufficient to train a QE model. The trade-off here is that we then need to supplement that corpus with our own data or synthetically created data, which potentially makes the model less tailored to your content. Furthermore, an ample margin has to be taken into account here. In practice we often don’t use all the provided data: inconsistencies, noise, or irrelevance mean that a portion is filtered out by our cleaning process as it’s not fit for training.

Another key factor in assessing the data volume is how the model extrapolates to your real-life translations. If your translation projects cover a narrow domain with limited variation, you are probably fine with a smaller data set for training. With wider or more diverse domains, a larger dataset covering all of these is needed.

One important thing to keep in mind: models trained on smaller training sets often have the tendency of giving inconsistent estimations: they can be spot-on for cases close enough to those in the training set, but when given a slight variation, the estimation can be completely off.

Our experience is that the strongest (best performing) customized models are trained on over 100k segments of high quality, approved data. With lower amounts it’s important to keep the trade-offs and risk mentioned above in mind.

During the preparation of the customized model, we will get back with feedback on the training data, or updates on the progress, whenever needed. The customized model will be delivered in combination with a report that documents the focus points of the training and the results of the test cycle.

Trying and testing and sharing feedback is imperative to determine whether potential fine-tuning is needed to improve the model further. After that, the final model is delivered for production.

From onboarding to impact

During the onboarding process, we review the quality goals and expectations, available training data, and any practical constraints or challenges together. This is also the moment to align expectations around training timelines, evaluation, and next steps. A tight and transparent collaboration in this early phase is the best recipe for building a custom QE model that truly supports your translation process.