
When it comes to evaluating translations, we have all these scores, metrics, as well as human judgement to asses the quality. In this article, we’re going to be discussing Quality Estimation (QE) scores, which are gaining more and more importance today. They have one big advantage: there’s no need for a reference translation.
What are Quality Estimation scores?
Quality Estimation scores are predictions of translation quality. They are generated by machine learning models that analyze the source text and its translated output, and the neat part is that the QE systems don’t require a human-created reference translation to compare against.
Here’s how they work: QE systems learn patterns from large datasets of translations that have been previously annotated for quality. It’s because of this training that they can estimate how accurate, fluent, or usable a new translation is likely to be.
These scores can be produced at different levels of granularity. Some systems give sentence-level predictions, while others can assign scores to individual words or phrases. It’s thus possible to identify specific problem areas within a translation.
Common quality categories
Raw QE scores are often continuous numerical values (0.0 to 1.0, for example), but companies often convert numerical QE outputs into categorical labels because they are easier to interpret and act upon. You’ll generally find these four categories:
- Best
- Good
- Acceptable
- Bad
Best is reserved for translations that are nearly indistinguishable from high-quality human translations. They show fluency, accuracy, and stylistic appropriateness. There’s little to no editing involved in this category.
Good means that a translation is generally accurate and fluent, with only minor errors that do not significantly affect comprehension. Human reviewers will need to do some light post-editing before they publish the content.
Acceptable is a translation that manages to convey the meaning of the text but may include issues that are easy to notice: awkward phrasing, minor inaccuracies, stylistic inconsistencies. This level can be suitable for low-stakes use cases (internal communication or content where you don’t need perfect fluency, for example).
Bad means that a translation that contains major errors (mistranslations, missing content, severe grammatical issues) that distort the meaning or make the text unusable. Humans will likely have to retranslate translations marked as “bad” because post-editing will not suffice.
Quality Estimation scores vs. traditional metrics
In a previous article, we described the most common metrics used to evaluate AI-generated translations. It’s important not to get these two terms confused. Metrics like BLEU and COMET evaluate translation quality by comparing the system output to one or more human-generated reference translations. But as we mentioned in the beginning of the article, QE operates without any reference; it relies solely on the source text and the generated translation.
We can conclude that QE scores are suited for production environments where translations are generated in real time and there’s no gold-standard comparison. QE models estimate intrinsic quality based on learned patterns of good and bad translations.
QE scores are not here to replace the traditional metrics, as they can serve complementary purposes. Traditional metrics are still useful in research and system development, where you have reference translations and you need controlled comparisons.
Final thoughts
QE systems are not perfect, because their predictions depend heavily on the quality and domain of the training data. You get an estimate, as the name suggests. Even so, when these numerical predictions are turned into intuitive categories, a wider range of users is able to understand quality assessment.