Blog

Arabic Language Support

Tue 02 November 2021

Arabic is the fifth most spoken language in the world, and we’ve seen huge demand for it ever since we launched the TextRazor API some 8 years ago. We develop individual models for each of our languages, so new ones are always a big effort. Arabic is a particularly interesting language to work with, it has a complex grammatical structure and extremely rich morphology - any given verb can have thousands of different forms. This means models designed to detect features in English don't port across well at all. Our approach leverages large volumes of data to automatically learn such ambiguity, and our Entity Knowledgebase helps significantly improve accuracy with our rich language independent features.

Our models were built on Modern Standard Arabic content, which represents the majority of formal Arabic written documents. Our Entity, Topic and Classification pipeline now fully support Arabic, transparently handling parsing and normalisation of the language so you don’t have to. Feel free to try it out on our demo, or try sending some Arabic documents to the API.