Why Using LLMs for Quality Estimation Is Challenging and Complex

by Amir Soleimani 28 Aug 2025

LLMs' limitations on QE tasks versus more specific solutions like TAUS EPIC

Can we use Large Language Models (LLMs) for Quality Estimation (QE)?

The short answer: yes, but with significant caveats.

On the one hand, you can prompt ChatGPT, Claude, or any other commercial or open-source LLM to evaluate the quality of a translation. In fact, research has already explored this path. For example, the GPT Estimation Metric Based Assessment (GEMBA) uses different GPT models to output a metric for estimating translation quality.

LLMs also come with strong advantages: they offer broad multilingual coverage, adapt across domains, and are relatively easy to query. But these benefits come with trade-offs that make relying on them for QE more complicated than it may seem.

Key Limitations of LLMs for QE

Non-determinism

Even with the same prompt and a temperature of zero (i.e., more determinism), LLMs can still produce different scores across runs due to randomness in inference and model serving. This lack of reproducibility makes it difficult to rely on them for consistent evaluation and comparison.
Biases in Evaluation
LLMs often show position bias: they pay disproportionate attention to the beginning and end of a prompt, and may favor whichever translation is presented first. This skews results in unpredictable ways.
Hallucinations
LLMs are prone to hallucination—producing confident but inaccurate judgments. For QE, this means they may fabricate justifications or give misleadingly precise scores, creating a false sense of reliability.
Engineering Overhead
Using LLMs effectively requires difficult choices and expertise. Each step involves experimentation, tuning, and maintenance, which add complexity and cost.
- Which model to use (e.g. GPT-5, Claude, Deepseek, reasoning vs. non-reasoning)?
- How to design effective prompts (i.e. using tone words, role assignment, chain-of-thought, or other prompt-engineering techniques)?
- What is the best way to evaluate models without bias from exposure to public datasets, given that LLMs are trained on public data?
- Which metrics best reflect QE performance for benchmarking?
Edge Cases & Customization

Handling cases such as language mismatches, non-translatable terms, or niche domains is especially challenging for generic LLMs. Moreover, customizing an LLM for specific domains, terminology, and language pairs is far from straightforward.

Why EPIC Offers a Better Path

This is where EPIC makes a difference. Unlike generic LLMs, EPIC is purpose-built for QE and avoids many of the pitfalls listed above.

Deterministic and Reliable: EPIC delivers consistent scores across runs, ensuring reproducibility and reliability.
Bias-Resistant: Its design reduces common evaluation biases.
Customizable: EPIC can be trained on your own terminology, glossaries, and domain-specific data, providing higher accuracy where it matters most.
Practical and Efficient: No need for extensive prompt engineering or ongoing model selection—EPIC is streamlined for QE from the ground up.
Speed: Streamlined and efficient by design, EPIC delivers fast, consistent scoring—even at scale—and maintains speed and stability as volumes grow.

LLMs are powerful and versatile, and they have a role to play in translation quality assessment research. However, for organizations that require reliable, customizable, and production-ready QE, EPIC offers clear advantages. It combines technical robustness with practical usability—making it the smarter choice when translation quality really matters.

Got curious to try how TAUS QE works? Try out EPIC Free Trial