Blog

Entity Linking in the LLM Era

Mon 07 October 2024

We're pleased to share a big update to our Entity Linking system, with a focus on improving "Company" results.

Large Language Models (LLMs) have transformed the NLP landscape over the last few years, and can do a reasonable job of identifying people, places and companies in text simply through "zero shot" prompting. However, even frontier models lag the state-of-the-art specialized models from a few years ago in terms of accuracy, speed and consistency, and can't easily disambiguate entities across documents. We're using LLMs a little differently - by harnessing their strength in data cleaning and organization, we’ve enhanced our data ingest and backend processes, improving consistency and making entity disambiguation more robust.

  • We've grown our company universe by expanding our crawl to company websites, and by ingesting new company registry data. Using an LLM, we now distill key company information from millions of websites, and use that context to improve our disambiguation process.

  • We've greatly expanded our company identifier mapping system of PermID, Bloomberg FIGI, Crunchbase IDs and LEI. We now detect all publically listed FIGI, and we're also mapping many more PermID and LEI than before. Our new mapping system uses an LLM to effectively merge and disambiguate company records from our various sources, using features like their name, industry, description and web presence to identify duplicates and related entities. Since we do this offline we don't slow down our main customer analysis pipeline, and have time to use an ensemble system of large, slow LLMs to avoid any false postives in the mapping process.

  • We have developed a "human-in-the-loop" system that combines synthetic data from LLMs and manual review to augment our training data in niche domains, improving detection and disambiguation rates for smaller companies.

  • We’re improving our handling of companies that are part of larger corporate groups or have international affiliations. In most cases, when a company is mentioned, it typically refers to the group as a whole. For instance, while "Google Commerce Limited" in Ireland supplies Google products to the UK, a news article mentioning Google is unlikely to specifically reference that entity. Similarly, a mention of "Twitter" now generally refers to X, rather than the former Twitter Inc. Different identifiers also vary in granularity; for example, distinct PermIDs are assigned to regional entities, whereas Wikidata IDs better represent colloquial usage. Our mapping process now formalizes these relationships, returning the most relevant ID for a given company. If a specific entity is explicitly mentioned, we return its identifier; otherwise, we provide the parent entity’s ID.

As always, we're careful to make sure any changes like this are fully backwards compatible, with scores normalised to similar ranges as before. There's therefore no need to make any changes to your integration, and the update is live on the API now.