Blog

Language Update: TextRazor for Chinese

Tue 28 March 2017

We’ve long supported the analysis of a number of European languages, but it has taken a little time to solve the tricky problems that Asian languages pose to Text Analysis systems. Today we’re pleased to announce TextRazor support for one of our most requested human languages - Chinese.

It’s always important to ensure that linguistic models can capture the intricacies of different target languages, and this is particularly true with Chinese. Since Chinese doesn’t typically contain whitespace, even the simplest task of splitting a sentence into words is difficult. Chinese sentences can also contain a mixture of Simplified and Traditional characters, and there are major variations between speakers in Mainland China, Hong Kong, Taiwan, Singapore and Malaysia. Single characters are frequently combined to completely change the meaning of a phrase, confusing naive classification solutions. We’ve leveraged our huge knowledgebase and state-of-the-art algorithms to teach TextRazor the meaning behind Chinese sentences - our popular Entity Recognition and Disambiguation, Topic Tagging, and Classification extractors now support Chinese out of the box.

Like all of our languages, we return Chinese Entities and Topics with both Localised and English canonical IDs, and our classifiers can assign Chinese to the same industry standard taxonomies as our other languages. This makes it easy to build multilingual NLP systems without relying on a messy translation step.

We are planning to bring more languages to you soon. You don’t need to make any changes to take advantage of this, we’re now returning Chinese results wherever we auto-detect Chinese content. You can try it out by pasting a Chinese document into our demo.