Blog

Words, Senses, Entities

Wed 02 October 2013

Language is noisy and ambiguous, as humans we're really good at using context to understand the meaning of a sentence. Computers don't have it so easy. Consider the following two sentences:

  • "The bank charged $30 to go over my overdraft!"
  • "Dutch Drivers Take Tesla Model S 388 Miles On A Single Charge".

How can we extract useful information from these? It's obvious to a human that the word "bank" refers to a financial institution rather than part of a river, but a computer needs to be told the difference. The word "charge" is used in both sentences, does it mean a financial payment or battery capacity? The different senses have very different implications for the underlying sentiment of the sentence.

We've just rolled out some improvements to our engine for handling cases like these, made possible by enhanced support for Word Sense Disambiguation in TextRazor's API.

TextRazor uses knowledge of the English language and deep semantic relationships between words to help choose the most likely sense of each word. This makes it possible to correctly extract the negative opinion in the first sentence above, and classify the second story under "electric cars".

Linked Data

One of TextRazor's most popular pieces of functionality is Entity Linking and Topic Detection, enabling powerful search and discovery applications by creating connections between themes in your documents.

TextRazor uses the standard WordNet 3.1 lexical database for our sense inventory. Many WordNet senses can be unambiguously mapped to Wikipedia and other Linked Open Data sources - which can then be used as a strong signal for classifying ambiguous words in the entity disambiguation stage. So, for example, if we spot the word "Apple" in the context of eating, we know that it's highly likely to be the "apple (fruit with red or yellow or green skin and sweet to tart crisp whitish flesh)" sense, which we know is the same concept as http://en.wikipedia.org/wiki/Apple. This helps us decide that the word should not be linked to http://en.wikipedia.org/wiki/Apple_Inc, and gives less support to the hypothesis that the document as a whole is about technology.

TextRazor gets much of its disambiguation power from the huge amount of contextual information on millions of different entities in Wikipedia. Wikipedia isn't so helpful with more common concepts though, and we've found that combining it with a comprehensive lexical resource like Wordnet can bring a big improvement in those areas.

Synonyms

Another useful TextRazor feature is the ability to return synonyms and other words related to your content. We use sense information to filter out synonyms that aren't relevant given their context, enabling powerful context aware search systems and more flexible text mining rules.

API

We've just launched support for returning word sense information in the API - simply add "senses" to the "extractors" in your TextRazor requests and we'll return an array of candidate WordNet 3.1 senses with each word, along with a confidence score of each one. Even if you don't need explicit sense information in your app, hopefully you have noticed a nice jump in entity extraction/topic detection/synonym extraction performance.