A New Generation of Multilingual Classification and NER Models
Today we’re announcing the largest upgrade to TextRazor’s classification and Named Entity Recognition (NER) systems since our 2013 launch - delivering higher accuracy, stronger multilingual performance, and significantly improved throughput across the API.
This release introduces upgraded support for IAB Content Taxonomy v3.0 and v2.2, alongside entirely new entity models across all supported languages. The new models are already live and automatically available to all users.
At the core of this release is a new generation of TextRazor foundation models built specifically for structured prediction tasks like classification and NER. Our approach combines the analytical strength of modern large language models with task-specific optimisation and knowledgebase grounding.
Rather than relying on large generative models at runtime, we distill compact, specialised models that run efficiently on CPU while maintaining strong accuracy, deterministic behaviour, and scalable production performance.
Classification: Greater Nuance and Stronger Grounding
Our updated classification models for IAB v3.0 and v2.2 deliver substantial improvements across the full taxonomy.
Large taxonomies often contain categories that appear similar but are conceptually distinct. For example, there are very subtle differences between IAB categories “Vegetarian Cuisine” and “Vegan Cuisine”, or “Off-Road Vehicles“ and “Pickup Trucks“. Our new models are trained to focus on these distinctions, using synthetic data from large open models, manually curated training data and relationships derived from our knowledgebase. Mentions of specific brands, products, and public figures act as grounding signals, allowing the models to disambiguate closely related categories more reliably than purely generative approaches.
We’ve made major gains in the "Brand Safety" sections of the taxonomy. General-purpose LLMs often apply broad safety guardrails that limit nuanced categorisation of profanity, sexual content, or other sensitive material. Our models are trained specifically for detection and classification of such content.
Multilingual performance has also improved, particularly in lower-resource languages. We've expanded our multilingual dataset with billions of tokens of high-quality synthetic data, validated with manual and LLM-assisted review. This results in more stable cross-language behaviour and better multilingual coverage in niche areas of the taxonomy.
Named Entity Recognition: New Foundation Models Across All Languages
Alongside classification, over the past few months we’ve rolled out a completely new generation of NER models.
We pretrained language-specific foundation models to ensure strong multilingual understanding, then fine-tuned them for structured prediction tasks before distilling and optimising them on our curated entity datasets. These models deliver our best performance to date, outperforming frontier LLMs on internal multilingual entity recognition benchmarks.
The models are more sensitive to context and significantly better at detecting smaller or less prominent entities. They have been specifically trained to recognise newly formed companies, lesser-known organisations and individual names that are not widely represented in public datasets.
Entity types are now aligned with Wikidata-compatible type ids. We derive and validate these types using an LLM-assisted process, which also improves our existing DBpedia and Freebase type mappings. This results in cleaner hierarchies, better interoperability and more consistent typing across languages.
Finally, Part-of-speech tagging using Universal Dependencies has now been added to all supported languages.
Performance and Infrastructure
Behind the scenes, this release includes significant infrastructure improvements. Throughput and latency has improved, particularly for IAB classification workloads.
Our new generation of models has been distilled and quantised using quantisation-aware training techniques, dramatically reducing model size while preserving accuracy. We’ve optimised inference for CPU deployment and upgraded our server fleet to target AVX512 VNNI instructions, enabling faster processing and improved scalability without requiring GPU infrastructure.
Looking Ahead
The new models are live in the API today. Since we're committed to backward compatibility, you don't need to do anything to pick up the changes, they are already enabled for our "entities" extractor and the IAB 3.0 and 2.2 taxonomy classifiers.
We plan to extend the same technology to our IPTC Media Topic taxonomy classifier in an upcoming release. As always, we welcome feedback!
Tweet