Blog

Realtime Entity Indexing

Mon 17 September 2018

Many of our clients work with fresh news content, so being able to identify people, companies and products as they become newsworthy is a key requirement. While our statistical machine learning models can automatically identify new entities given the context they're used in, accuracy is greatly improved when incorporating real world data and actual examples of usage in context.

Typically entity recognition systems are trained on a static set of documents, which may be several years out of date. Our approach is a little different, we fully rebuild all of our models on a monthly basis. This gives our knowledgebase an improved coverage of newer entities and helps our models adapt to shifts in language over time. Of course with todays 24 hour news cycle, sometimes a few weeks can be a long time to wait.

We've brought this indexing latency right down in our latest release. We're now following live updates to new entities, building a special incremental index every single day. At most it should be 24 hours before TextRazor is properly identifying new people and companies - anyone working with news or fresh social media content should already be seeing the benefits of this.

We've been quietly releasing a collection of smaller updates to the entity tagging system. Notably, we've just released a new time tagging system that is significantly more accurate than the previous version, while being twice as fast. We've made some classification improvements too. We've brought the analysis latency down by 50% for our larger classifiers, and added support for two new taxonomies - textrazor_iab_content_taxonomy and textrazor_mediatopics - find out more on our classification page.

As always, any feedback on this would be gratefully received!