Blog

New Release: TextRazor Deep Spelling Correction is now available

Tue 17 January 2017

Typos in names and keywords can cause huge problems for text analysis systems, especially when working with noisy Tweets or Facebook Posts. Today we're pleased to announce a new Spelling Correction system that is designed to help seamlessly tackle noise in social data and other user generated content. Our state-of-the-art algorithms check your content for spelling errors and generate ranked contextual corrections, with results easily integrated into your app through a simple API call.

TextRazor's Deep Spelling Corrector takes technology a step beyond the spellcheckers that you might find in a word processor:

  • Detect typos in millions of real world entities like people, places and brands that aren't necessarily in the dictionary.
  • Identify incorrect homophones - words with the same pronunciation but different meanings (mail vs male).
  • Generate a confidence score for each correction suggestion using a deep analysis of the sentence context.
  • Correct and score thousands of words per second.

Finding words that aren't in the dictionary is fairly straightforward, but generating correction suggestions can be more complex. In the sentence 'He caught the train frm Old Street.', for example, the system needs to know that generally you catch trains 'from' places, and that 'Old Street' is a place. Without this knowledge it's not possible to decide whether 'frm' should be 'from', 'form' or 'farm'. TextRazor uses a state-of-the-art Neural Network based language model to automatically learn these patterns, and is constantly evolving to learn new words.

The new functionality is designed to power the next generation of smart content creation tools, and more generally help improve the accuracy and resilience of the TextRazor analysis pipeline. We're already using it to improve our Entity Extraction results. We've found that performing explicit spelling correction jointly with Entity Recognition can boost accuracy by several percent for both, even when working with relatively clean documents like news.

Person names benefit the most, being the most misspelled entities on the internet. In our research 'Steve Carell' had the dubious honour of being the most misspelled celebrity on Twitter, getting the correct number or 'r's and 'l's is an impossible task for most (and it's not only the Tweeps who gets this wrong!). It's also not easy to detect whether a person should be 'Steven' or 'Stephen' (ala Stephen Fry), something that is only possible with a ton of data on correct names. TextRazor can handle all of these cases.

You don't need to do anything to benefit from the improvement in Entity Extraction results, the new system is now live. TextRazor can also return the scored suggestions that it generates for you to integrate in your app. You just need to add the new spelling extractor to your analysis requests. You can also try the system through the demo, have a look at the new 'suggestions' column under the sentences tab.

Please get in touch if you have any feedback!

"My spelling is Wobbly. It's good spelling but it Wobbles, and the letters get in the wrong places." - Winnie-the-Pooh