What do we know about the degree of underreporting of COVID-19?

Last updated: 31 March; first posted: 24 March 2020

One acute contributor to uncertainty in this crisis is that we really don’t know how many people currently have, or have previously had, COVID-19.

First, some definitions.

  • A confirmed case is, or should be, “a person with laboratory confirmation of COVID-19 infection” (as defined in WHO situation reports); but different countries apply different standards, and standards have changed over time in some countries. Confirmed cases are typically reported both on a daily and on a cumulative basis. The cumulative figure therefore also includes people who went on to recover, as well as those who went on to die, and therefore is not equal to the number of current cases.
  • The number of confirmed cases is used to calculate the widely-discussed Case Fatality Ratio (CFR), which is simply the number of deaths attributed to COVID-19 divided by the number of confirmed cases.
  • For many reasons, we should care more about the number of actual cases, whether confirmed or not.
  • And similarly, we ultimately care about the Infection Fatality Ratio (IFR) more than the CFR, which is calculated by dividing the number of deaths attributed to COVID-19 by the number of actual cases.
  • Why? When you read about the estimated final attack rate (the proportion of a given population — a country, the whole world, a family, a cruise ship — who become infected over time), and want to translate that into potential fatalities, you need to apply the IFR rather than the CFR.

Why do we think the number of confirmed cases underreports the number of actual cases?

A1: Only a small proportion of each country has been tested. I highly recommend Our World in Data’s section on testing. Here are the two key charts showing total tests and tests per million people:

[Note: since I posted this, Our World in Data withdrew their charts on testing as they no longer regard the data as reliable; see hear for an explanation. I have left these charts as the best data I’ve found so far, but they should be regarded with suspicion.]

As you can see, there’s an enormous degree of variance in testing in both absolute numbers and relative to population.

In general, the lower the rate of testing per capita, the more sceptical we should be about the number of reported cases. Even South Korea has tested less than 1% of its population. But countries that test more of the population, and who aggressively test patients at risk (those presenting symptoms, self-reporting or otherwise; those who may have had contact with an infected patient; etc.) are likely reporting as confirmed cases a higher percentage of the actual cases than other countries.

What other factors impact the reported data on confirmed cases?

We’ve frequently seen both lags between testing and reporting, and inconsistent frequency of reporting by some countries. This can lead to an apparent large jump in reported cases, when several days’ worth of confirmed cases are reported all at once, which can appear to overstate the rate at which actual cases might be growing. Similarly, lags in reporting can make it appear at other periods that cases are not growing. My sense is that these issues are improving over time, with many countries falling into a regular and predictable rhythm of reporting; but I don’t know that for sure..

If a country is increasing its rate of testing, could that make it appear that cases are growing faster than they really are?

Yes! I’m so glad you asked that, because this is very likely happening and it may be hiding some of the progress that’s being made.

In simple terms, if a country is increasing the degree of testing over time, they are likely therefore capturing an increasing proportion of the actual cases over time. During the period that the degree of testing is increasing (which I hope and believe is happening now in many countries to a significant degree), even a country that is seeing a slowing rate of growth of actual cases may appear to show a steady or increasing rate of growth of reported cases.

A simple example will illustrate this. Imagine that country Covland has 100 actual cases on day 1, and actual cases are growing at 10% per day. Now imagine that on day 1 it is only detecting 10% of cases, but then rapidly ramps the degree of testing to capture 10 percentage points more cases each day, up to 100%. Here’s what happens:

So Covland appears to have a much higher rate of growth than it actually does.

I think it’s highly likely that this is happening now in European countries.

Is there any way to assess the degree of underreporting to try to estimate the number of actual cases in each country?

Yes, there are several, but they all have limitations.

The mathematical epidemiologist Adam Kucharski has discussed this for some time, including in this accessible New York Times article (recommended).

A recent paper on this, published 23 March, uses Adam’s preferred methodology. While we don’t know the number of actual cases, we do have a strong sense of the number of deaths from COVID-19. We also have ranged estimates of the average time between infection and death, and (with very wide error bars) ranged estimates of the percentage of cases that result in death. Putting those together, we can look at the number of deaths that happen in a given country on a given day, and estimate how many people must have been infected 3-4 weeks ago to result in that many deaths.

[30 March update: the authors updated their estimates; see here. Chart below updated to match.]

Here is the output of that analysis:

Of course, this is at best a rough estimate, and probably only useful for making order-of-magnitude estimates.

Back when I was maintaining my own COVID-19 models, I used a simplified version of this methodology and concluded that the US and many European countries had 15-30x underreporting (i.e., were detecting and reporting only 3-6% of total cases). Interestingly, even then, I estimated that Germany and the Scandinavian countries were reporting a high proportion of the actual cases.

However, while other estimates I’ve seen or heard about come to similar conclusions, this is not a universally held view.

As an example of a much more extreme estimate, using a different methodology, this article in the FT on 24 March discusses a paper with the claim that SARS-CoV-2 may have already infected half of the UK population. I am sceptical of this claim, but if it were true it would completely change our understanding of the disease; and in particular to revise down our estimate of the Infection Fatality Rate.

On 30 March 2020, researchers at Imperial College published a paper (discussed in this article) which, while primarily oriented to assessing the effectiveness of control measures put in place in 11 European countries (conclusion: consistently effective in reducing the effective reproductive number), also tries to estimate the percentage of the population that is infected in these countries. Here is the key table, presenting ranged estimates at a 95% confidence interval:

I would be cautious with these figures as the study had to make a number of important assumptions (including that the impact of control measures was roughly consistent across these countries). Still, it gives us a sense of what the possible ranges might be.

If there is significant underreporting, doesn’t that mean that the Case Fatality Rates we read are potentially overstated, and that the final Infection Fatality Rates will be much lower?

This is such an important, and complicated topic, that it deserves its own page. I’ll try to get to that in the next few days, but in the meantime go read the section on this in Our World in Data (the anchor on their page isn’t working; scroll down). In the meantime, I’ll say very simplistically: yes and no.

Yes, the CFRs we were hearing early in the epidemic from Wuhan of 3-5% are almost certainly too high as estimates of the final IFR. They likely suffer from underreporting, but they also reflect the fact that Wuhan’s medical system was overwhelmed. CFRs will vary significantly depending on context (age and health of population, availability of care, etc.).

Yes, if underreporting is high, reported CFRs will be greater than the actual IFRs.

However, some people with COVID-19 today will die, and so CFRs can also understate the ultimate IFR. To give an obvious example, if 100 people catch COVID-19 today, and 5 ultimate die of it, the CFR appears to be 0% until the first death happens.

Why is this post labeled “draft”?

It’s a first effort. I need to get feedback and input for a period of time before being confident enough in the accuracy of the information I present here to remove “draft.” Of course, I’ll continue updating it after that.


5 thoughts on “What do we know about the degree of underreporting of COVID-19?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: