Classification is the process of assigning high level categories to your content, a core requirement for many text analysis projects. Take the following sentence for example: "The chancellor has postponed the sale of the government's final stake in Lloyds Banking Group". This could intuitively be put in the "Politics" or "Finance" category. At scale it's not feasible to manually assign categories to text, and so an automated solution is needed.
TextRazor combines its large Knowledge Graph, its semantic understanding of the relationships between words of your document, and state-of-the-art machine learning algorithms to automatically assign categories to each of your documents.
Fixed standardized taxonomies of categories can be useful for normalizing metadata use within your organization, and ensuring maximal interoperability with third party services.
TextRazor provides out of the box classification support for the largest public taxonomies, giving industry leading results with no further training:
|textrazor_iab_content_taxonomy ||Internet Advertising Bureau Content Taxonomy v2 is an updated version (2017) of the IAB QAG segments.|
|textrazor_iab_content_taxonomy_2.2 ||Internet Advertising Bureau Content Taxonomy v2.2 is an incremental update of the IAB Content Taxonomy.|
|textrazor_iab_content_taxonomy_3.0 ||Internet Advertising Bureau Content Taxonomy v3.0 is the latest 2022 version of the IAB Content Taxonomy.|
|textrazor_iab ||Internet Advertising Bureau QAG segments aim to standardise content classification across the internet advertising industry. The public QAG taxonomy consists of approximately 400 high level categories arranged into two tiers. This taxonomy is considered replaced by the Content Taxonomy.|
|textrazor_newscodes ||IPTC newscodes. The International Press Telecommunication Council is a consortium of the world's major news agencies and news industry vendors, and acts as the global standards body of the news media. It maintains public sets of news metadata concepts, to allow for consistent coding of news metadata. TextRazor supports the full "subject code" taxonomy, approximately 1400 high level categories organized into a three level tree. Find out more about the IPTC subject codes here.|
|textrazor_mediatopics ||IPTC Media Topics. IPTC's latest 1100-term taxonomy with a focus on text. The development started with the Subject Codes and extended the tree to 5 levels and reused the same 17 top level terms. The terms below the top level have been revised and rearranged.|
We would recommend new projects with no third party dependencies to start off with the latest textrazor_mediatopics and textrazor_iab_content_taxonomy_3.0. IPTC is the larger and generally more comprehensive taxonomy tailored for news content.
Sometimes the categories you might be interested in aren't well represented by off-the-shelf classifiers. TextRazor gives you the flexibility to create a customized model for your particular problem.
To create a new category, simply give TextRazor a word or two that concisely describes the type of information you are looking for. TextRazor will use this information to build a model that can identify documents that are semantically similar in concept to your words, even if they don't explicitly mention the theme.
This is a very fast and powerful way to build a customized classifier, without the hassle of building your own large training dataset. To further refine your categories the full range of our logic rules can be used in your queries.
Categories vs. Topic Tags
TextRazor supports Topic Tagging in addition to categorization, both systems can be useful for your application. TextRazor's Topic Tagging engine identifies hundreds of thousands of different topics at different levels of abstraction, a list that is constantly evolving to keep pace with changes in language. These can be useful for situations where you don't need the support of a formal taxonomy, can benefit from a larger number of categories, or you are building further classification algorithms on top of TextRazor's metadata.
The Universal Classifier
Classification is supported in all TextRazor's languages - Arabic, English, Chinese, Danish, Dutch, Finnish, French, German, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish.
Custom classifiers must be defined using English queries only, but can then be used to classify documents of any language. This powerful concept dramatically reduces the effort required to internationalize your classification algorithms.
To tag your text simply set the "classifiers" option in your request to one of the above classifier names. This will return a list of scored categories with your response. Scores are provided that determine the confidence TextRazor has that the category is valid with the document, depending on your application you may improve your results by ignoring categories below a certain score level, or simply take the top N categories.
Read more in our Python,Java, PHP or REST Documentation.
You can try out the classifier with your own documents through our Online Demo.