Named Entity Recognition

... the process of identifying People, Places, Companies, and other types of "Thing" in text, a crucial component of opinion extraction, document discovery and other text analytics applications.

Identification

TextRazor achieves industry leading Entity Recognition performance by leveraging a huge knowledgebase of entity details extracted from a number of sources, including Wikipedia, DBPedia, Wikidata, social data, and billions of crawled web pages. We also ingest entity data in various verticals, for example company and product data, and drug data from the FDA.

A deep learning tagger identifies People, Places and Companies that we don't already know about using clues from the context surrounding each mention.

entities: [
	0: {type:[Country, Place, owl#Thing, PopulatedPlace], matchingTokens:[0], entityId:Spain,}
	1: {type:[Organization, Company, Organisation, owl#Thing], matchingTokens:[3], entityId:Bankia,}
	2: {matchingTokens:[10], entityId:Portfolio (finance), confidenceScore:1.11858,}
	3: {matchingTokens:[20, 21], entityId:Parent company, confidenceScore:1.24656,}
	4: {type:[owl#Thing, Company, Airline, Organisation, Organization], matchingTokens:[23, 24],}
	5: {matchingTokens:[26], entityId:Iberia (airline),}
]

Disambiguation

Entity Recognition is a hard task due to the ambiguity of written language. The word "Lincoln", for example, could be used to refer to the President, a type of car, a place in England, a place in the United States etc... such mentions are incredibly common, and the only way to correctly disambiguate them is with a comprehensive model of real world knowledge of each entity.

We build up a deep understanding of the semantic context of your document, and use it together with our huge knowledgebase of real life facts to look for disambiguation clues for each entity. We combine a number of different signals into a single confidence score for each entity. These include both shallow methods ("car" used in the same document as "Lincoln" lends support to the analysis that Lincoln is a car, "president" that it's a person), graph based methods (do lots of car related web pages link to this page?), and deeper linguistic methods ("Lincoln" as the subject to the verb "crashed" means we can be fairly sure Lincoln is a car).

If there are multiple plausible entities we run a further disambiguation process to help spot between, for example, Lincoln, Nebraska or Lincoln, England (both places, but you'll probably want to know the difference if you're booking a holiday!).

Names

Names can be especially tricky to disambiguate. "Steven Allan Spielberg" is easy to identify - but "S. Spielberg", "Stephen Speelberg", "Spielbeg", "Спилберг, Стивен" all point to the same person. TextRazor can understand misspellings, homonyms, initials and translations. TextRazor uses its knowledge base of millions of different people to help spot names, but also employs machine learning techniques to help identify those we don't know about.

Linked Data

TextRazor disambiguates and links entities to canonical IDs in the linked web. We return Wikipedia,DBPedia and Wikidata IDs. We also continue to return links to Freebase to maintain existing applications that are dependent on the now frozen dataset.

Language Support

Entity Recognition, disambiguation and linking is supported in all of TextRazor's languages - Arabic, English, Chinese, Danish, Dutch, Finnish, French, German, Greek, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Ukrainian. Each language has its own intricacies, we maximize performance by building models specifically for each.

Where an entity can be linked to a localized Wikipedia page that has a corresponding ID in the English Wikipedia we return both, allowing you to recognize entities in English regardless of the language of your content.

Type Support

TextRazor can identify thousands of different entity types. We use post-processed and normalized versions of DBPedia and Freebase to deduce the best types for each disambiguated entity. The list of supported types is growing all the time, you can find the full list here.

Freebase as a service closed in 2016, but we think there is still value in the rich Freebase type ontology. TextRazor automatically tags newly discovered entities with Freebase compatible type data, you're safe to build on Freebase Types for the future.

Freshness

Language is always changing, new products are launched and new people can become famous overnight. Our models are updated on a daily basis, so you will never miss anything important. We also completely rebuild our models from scratch every month to pick up larger shifts in language use.

API Calls

To retrieve disambiguated entities from your text simply add the "entities" extractor to your request.

Read more in our Python Client or REST Documentation.

You can try out the Entity Extraction system with your own documents through our Online Demo.