Blog

Greek, Ukrainian Language Support and all new multilingual models

Tue 26 July 2022

We're pleased to announce that TextRazor now supports both Greek and Ukrainian language analysis through our public API. Simply send text in these new languages to the API (or try our demo) and you will receive our Entities, Topics and Categories.

We've made a number of more general improvements to our multilingual analysis pipeline as part of this new language support. Historically we've only ever used native models for each of our languages. Generally this is far superior to approaches based on translation to English - a translation step can introduce noise, and accuracy is improved with models that have learned the clues and intricate details of a language from native data.

Our new languages are a little different as they have much less available data, so for the first time we've noticed an accuracy improvement by introducing translated document features into our classification models. We've developed a lightweight translation system that selectively translates shorter or more difficult documents to generate English context as additional features to help inform our native algorithms.

This has the biggest impact on our new low resource languages, but the system has also shown improvements for our other languages in certain domains. We are now detecting the specific documents where translations improve accuracy and selectively applying the new system. This is currently enabled for German, Swedish, Danish and Finnish, we'll gradually roll out the update to our others too. The upgrade is very subtle for these languages, but for certain types of document we're now returning more useful Topics and Categories and there are no breaking changes to our scoring thresholds.

Since our translation system does not need to give a human quality translation it can work a lot faster than most, we can use its output without significantly slowing our pipeline. We have therefore maintained our existing pricing for these languages.