Tracking the COVID-19 infected population is of great interest to the public health community as they look to monitor the spread of the infection. How to best estimate population infection totals, however, is not under consensus in the academic community, especially as testing remains a strained resource, and various research groups have taken different approaches to estimating the prevalence of the virus in the general population. While many of these approaches rely on serosurvey methodologies others leverage mortality timay be used me series data to not only estimate prevalence but also measure how prevalence has changed over time.
The back calculation method, developed by our group led by Martina Morris at the University of Washington, is one such approach which in addition to mortality data, relies on estimates of age specific infection fatality rate (IFR), an estimate of the mortality lag between infection, age specific population data, and mortality, and optionally age specific case fatality rate (CFR) for distinguishing between cases, individuals who are symptomatic, and infections, those who are asymptomatic or only show mild symptoms.
The cumulative total infected persons at any time t is the sum of the infected persons that will be diagnosed at some point (Dx) and those that will never be diagnosed (nDx).
Note that the Dx infections at any time t is defined here as the infections that will ever be diagnosed, not just the cases that are already diagnosed and captured in the surveillance data; it is the sum of the infections that will be diagnosed in the future, and those that have already been diagnosed.
The cumulative total number of infected persons today are related to the cumulative death counts in a future period (Mt+lag) by the infection fatality rate (IFR) and the average lag time from infection to death lag, using the simple backcalculation formula:
The current Dx infections are also related to the death counts in a future period, by the case fatality rate (CFR) and the average lag time from infection to death, by an analogous formula:
The nDx infections are then easily obtained by subtraction.
A final implication of the definitions used for this method is that the fraction of cases that will and will never be diagnosed is a simple function of the two fatality rates. The intuition is that the IFR is the weighted sum of the CFR, and the fatality rate among those never diagnosed , where the weights are the fraction of infections that are diagnosed, and never diagnosed, respectively. We assume that all deaths due to infection are correctly ascertained, so the fatality rate for the nDx infections is 0.
and the expected nDx fraction is 1−dxFraction. One implication of this is that these fractions are constant, they do not vary over time.
In the demonstration of the method our team estimates the total infection count for King county using COVID-19 mortality data from Washington state’s Department of Health. This methodology may be applied to any COVID-19 mortality time series given that the data accurately captures nearly all COVID-19 related deaths for the associated geography. Age specific population data for the state of Washington was taken from the State Office of Financial Management. Data and uncertainty for IFR, CFR, and lag were taken from a recent publication by Verity et al from a study measuring epidemic statistics from Wuhan after the epidemic had passed. At the time of the articles publication the epidemic in Wuhan had passed although more recent writings hint at new cases developing and data being updated retroactively.
To arrive at the total number of infections age specfic population data is used to create proportional weights that some to one and match the grouping of the age ranges in the Verity et al publication. The weights are then multiplied by age specific IFR and summed to get at the population level IFR for King county. This approach is done rather than use the total population IFR from the Verity study because of differences in the age composition of the two populations. The newly derived population IFR is then multiplied with the mortality time series to construct a lagged time series of the total number of infections. The study uses several time lags in their analysis, however, the choice of mortality time lag has little effect on the end result. The estimated infected population time series is a single time series vector, however, parameter selection uncertainty can be induced by running the same process using the uncertainty of the IFR estimates.
A similar process can optionally be used with CFR to estimate the population that is symptomatic and from this we can distinguish between infected and cases in the general population.
The end results is a time series with uncertainty of symptomatic and mild/asymptomatic individuals in King county. The time series estimates up to April 1st, as of writing, with an estimated 45,000 individuals who were infected with COVID-19. Compare this to the much smaller number of 3000 confirmed cases from testing results and we see that their is a big discrepancy between the observed mortality numbers and what testing information is providing us as seen in Figure 1.
Even if we evaluate testing based on infections that will eventually be diagnosed rather than all cases, the number of active cases that we have captured is likely a large underestimate. According to our model, the number of cases that have not yet been identified from the population that should be eventually diagnosed is still more than 80%. Though this figure has been declining, as evidenced in Figure 2, with such a low capture rate of infected individuals it would be near impossible to implement measures such as effective contact tracing.
Our groups work is ongoing and the latest efforts involve pulling age specific mortality data in order to reduce the uncertainty of the estimates. You can stay up to date with her teams work and get a more in depth overview of the methodology and data from her working group’s website.