COVID-19 and Wikidata QIDs
Over the past few weeks I’ve spoken with a number of our clients in the media monitoring space about how they're adjusting to the COVID-19 crisis. There have been a few common themes throughout, so I thought I'd share some best practices for those analyzing the huge amount of news and social content that is being generated on the subject each day.
We rebuild incremental models on a daily basis to pick up new Entity mentions, so “Covid-19” was on our radar pretty quickly back in January. As the pandemic has spread and our understanding of it has grown, so have its synonyms - “Wuhan flu”, “2019 nCoV”, “Novel coronavirus” being a small selection of terms we're tracking. The Wikipedia page itself has been renamed and updated no less than 500 times in the past 3 days. This can all make it difficult to track - many applications require a single stable “ID” to normalise Entity mentions and minimise clutter for their users.
Where possible we return canonical links to various open datasets, the Wikipedia title being the most user-friendly and most popular. The Wikipedia title can change, however, as we have seen with COVID-19. We therefore typically recommend the “Wikidata QID” as the best general purpose identifier that is guaranteed to be stable over time. We return QIDs as part of both the "Entity" and "Topics" response. This QID can be used to query the Wikidata API to retrieve a human readable label in various languages.
TextRazor is currently returning several different QIDs relating to the pandemic, with slightly different meanings:
- Q84263196 - COVID-19 - The disease
- Q81068910 - 2019–20 COVID-19 pandemic - The current pandemic event
- Q57751738 - Coronavirus - The general Coronavirus family
- Q82069695 - SARS-CoV-2 - The virus that causes COVID-19
There are also a number of country-specific pages, 2020_coronavirus_pandemic_in_the_United_Kingdom for example. From an Entity Recognition standpoint these pages are rarely more useful than the “main” page, so you should see fewer of them as we explicitly detect and penalise them.
For consistency, we try to faithfully link to the most appropriate version of the above in Wikipedia and Wikidata. Colloquially the above are used interchangeably, so for most applications we’d recommend merging them together if the duplication is an issue.
For those looking for a higher level view, our IPTC and IAB classifiers have “virus” type sections of their taxonomies - for example in IPTC Media Topics look for "health>diseases and conditions>communicable disease>virus disease". These categories don’t change over time, but are constantly updated to detect changing language from the outbreak.
While we run incremental daily indexing to pick up newsworthy new entities as they appear, we’re also increasing our regular full rebuild schedule to make sure we don’t miss any more subtle shifts in language. We have also been building out our “health” related dataset and regression tests such that we don’t miss any possible mentions. On the infrastructure side of things, we’ve seen an unprecedented volume of data analysed through the TextRazor API in the past few weeks, but our backend has been happily scaling up to handle the increased load.
It has been great to hear from those of you building new tools to make sense of the news surrounding COVID-19. If you are working on analysis projects in the space please get in touch, we are always happy to help with academic and non-commercial projects.Tweet