LLM Entity Types
We've just released an upgrade to our Named Entity typing system, it's now easier than ever to categorize and refine the entities returned from the TextRazor API.
The classic Named Entity Recognition problem involves identifying People, Companies and Places in text. Many applications rely on these "type" definitions to reduce noise from entities that aren't relevant to their users. In the finance domain, for example, it's common to filter out all but People and Companies.
The three main types are often too simplistic for production applications. There's a large grey area around the definitions for valid types, and many entities can reasonably be placed in multiple. For example, a mention of a hospital could be classified as either a Place or an Organisation. These particular differences might not be interesting to your application, but it's important to be consistent across documents and entities and this is a subtle area where LLM entity extraction approaches can fail on production data.
We're often interested in other entity types too - Products, Drug Names etc, and TextRazor has been tuned to try and find anything that might be useful as an entity. We've therefore put a lot of thought into our type taxonomies to make them as broadly useful as possible.
We're now introducing Wikidata Types alongside DBpedia and Freebase. The Wikidata type system is far more comprehensive, with thousands of types. Going back to the hospital example, Q41176/building, Q30139652/health care structure, Q123349660/geolocatable entity are returned in addition to Q43229/organization, capturing the subtle differences between types. This also allows you to choose granularity that best suits your application.
We derive these types from Wikidata itself, but don't strictly adhere to the source taxonomy. Wikidata entries are assigned an "instance of" property, but alone this isn't much use. For example "New York City" is assigned "city in the United States": The raw set of types is noisy to the point of being unhelpful - following the subclass hierarchy Q1439/Texas the US state expands to Q43229/organization. We've developed a process that prunes these spurious connections using an LLM and manual review.
We've also made big improvements to our existing DBPedia and Freebase types. We continue to categorise new entities with Freebase compatible types, which remain popular. We've improved the accuracy of that system, while also including DBPedia types in that process and expanding coverage in non-english languages. We've noticed more holes over time with the DBPedia "official" types, the new system improves coverage to entities it's missing.
As always these changes are fully backwards compatible, you should already be benefitting from the new system. If you'd like to use the Wikidata types you'll find them in the "wikidataType" field of each entity.
Tweet