Blog

IAB, IPTC and Custom Taxonomies - The TextRazor Universal Classifier is here

Mon 08 February 2016

Today we're excited to announce a major TextRazor upgrade. Over the last couple of years we've annotated billions of documents with Topic tags, an unstructured set of high level categories that represent the semantic "aboutness" of your text content. With use cases ranging from on-page SEO enhancement to contextual ad targetting, Topic Tagging helps effectively manage and enrich document collections at scale.

Many projects require classification to a more formal set of categories, either one specific to a particular industry vertical, or to a standardized taxonomy for interoperability with other systems. We've just launched built-in support for automatically categorizing text in 10 languages to two industry-standard taxonomies with state-of-the-art accuracy. The new functionality also allows you to extend and customize TextRazor with your own categories to help tailor, tweak and optimize your results.

Out of the box we support:

  • textrazor_iab - Internet Advertising Bureau QAG segments aim to standardize content classification across the internet advertising industry. The public QAG taxonomy consists of approximately 400 high level categories arranged into two tiers.
  • textrazor_newscodes - IPTC newscodes. The International Press Telecommunication Council maintains public sets of news metadata concepts. TextRazor supports the full "subject code" taxonomy, approximately 1400 high level categories organized into a three level tree. Find out more about the IPTC subject codes here.

Extending TextRazor's Classification Engine

TextRazor also makes it easy to define your own classifiers. This is useful if your documents use specific, technical terminology, or you are simply interested in a different set of categories to those we provide by default. Instead of a lengthy and expensive data acquisition and training process, TextRazor allows you to get up and running in a few minutes using what we call concept queries.

Simply upload a few words that concisely describe each of your categories. We use state-of-the-art Machine Learning algorithms built on the relationships between the entities, words, synonyms and topics found in your documents to automatically calculate relevance scores with no further training. Head over to our updated tutorials for some examples of the new classification engine and see how easy it is to start creating your own categories.

This new functionality is available immediately through our API on all of our subscription plans. To use one of our classifiers simply add the ID to your analysis requests - see the documentation for your language of choice for details. We have released a new API endpoint to support creating and maintaining custom classifiers. Please contact us with some details of your classification problem and we'd be happy to help you get started.

If you'd just like to try it out, the demo will show you results using our default classifiers. You can find those in the top right of the results page.

As always, it's free to get started with TextRazor with any programming language. Please let us know if you have any thoughts on this new functionality, or if you notice anything that could be improved.