Over the past few months we've had our heads down scaling the TextRazor backend and building on some great feedback from our early testers. Today we're excited to announce that we're opening up to beta! TextRazor is a comprehensive integrated Text Analytics/Natural Language Processing API, offering fully scriptable entity recognition/linking, relation/property extraction, automatic categorization and topic tagging, synonym/entailment extraction and semantic dependency parsing.
We've been working hard to leverage the latest academic research in Parsing and Entity Extraction into an offering that really pushes the boundaries of production ready, robust text analytics. You can find out a bit more about how it all works here, have a play with the demo, or sign up for a free API key to get started right away! We're still in beta and actively tuning our models, please contact us with any comments/issues.
Accurate, Fast and Flexible
The Text Analytics industry has experienced tremendous growth over the last few years, but we think there's space for a lot of improvement. Here's how we're different:
Simple keyword or regular expression approaches to text mining can quickly break when sentences are even slightly rephrased. Our relation, property extraction and parsing algorithms are designed to help you mine the complex relationships between the words and entities in your documents. Our Entailment/Synonym extraction system allows you to develop robust extraction rules that abstract over multiple phrases expressing the same ideas.
Customization and domain adaptation is often crucial to the development of accurate text analytics applications. Our embedded Prolog logic engine allows you to rapidly combine the results of different annotations and add powerful domain specific rules to the extraction process.
Entity Recognition is an important first step to many text analytics tasks, but the ambiguity of language means it's very easy to misclassify entities without real world knowledge. Our system combines a large knowledgebase of facts of millions of known entities with deep semantic parsing to identify and disambiguate more entities than ever before, while still maintaining high precision. Rather than limiting extraction to a fixed number of types, we automatically place entities into an ontology of thousands of categories derived from linked data sources. We also go beyond named entity recognition to return more general concepts/tags mentioned in text, and can even identify relevant topic tags that aren't actually mentioned anywhere.
Painless integration and customisation. Our simple yet powerful API lets you rapidly built text analytics into your applications, and our client makes it easy to customise and combine the results to suit your specific needs. Users with more specific requirements can deploy the TextRazor service on their own hardware within their own firewall and call it over the same simple API.
Full semantic analysis of text has typically been prohibitively slow, but we believe the ability to deeply analyse unstructured text at speed is crucial to enable a new class of text analytics applications operating over large datasets. Scalability of the backend and the latency of individual requests has therefore been a major priority. The TextRazor backend has been written from the ground up in heavily optimised C++, tweaked to such an extent that we can run text through the full pipeline at up to 20000 words/second on a commodity server. The system automatically parallelizes processing of individual documents, minimising per request latency while also maintaining quality of service for concurrent callers.
We've developed our own training data for our statistical models, building from various domains across newswire and social media. This means our models are unencumbered for use in commercial applications unlike the majority of models distributed with open source NLP tools which have been derived from restrictive corpora.
TextRazor is constantly evolving, and we've some exciting plans for the future. We will be talking more about how it all works here on the blog, Follow us on Twitter to stay posted! In the mean time, we'd love to hear your thoughts. Before I go, I'd just like to say thanks to everyone who has been trying out TextRazor while it was hidden away, and to all those who've provided valuable feedback along the way.Tweet