This report presents a novel approach to estimate the total number of COVID-19 cases in the United States, including undocumented infections, by combining the Centers for Disease Control and Prevention’s influenza-like illness surveillance data with aggregated prescription data. We estimated that the cumulative number of COVID-19 cases in the United States by 4 April 2020 was > 2.5 million.
([Related article:] See the Editorial Commentary by Faust on pages 2952–4.)
During the COVID-19 pandemic, many infections with mild to no symptoms are not reported due to various factors, including limited testing [1, 2]. There is a critical need to estimate the true scale of the pandemic for hot-spot detection, resource allocation, and intervention planning. Existing modeling approaches use epidemiology data  and digital technology/data [3–5] to estimate the scale of COVID-19.
In this report, we present a novel approach to estimate the total number of COVID-19 cases, including undocumented infections, in the United States (US) by comparing data from the US Centers for Disease Control and Prevention (CDC) Outpatient Influenza-like Illness Surveillance Network (ILINet), which targets all influenza-like illness (ILI), overlapping with COVID-19, against the aggregated prescription data of oseltamivir , which targets influenza only.
Our model shows that current official numbers are severely underestimated: We estimate that by the week ending 21 March 2020, there were > 1.3 million total COVID-19 infections in the US and that by the week ending 4 April 2020, there were > 2.5 million total infections in the US.
The CDC defines ILI as “fever and a cough and/or a sore throat without a known cause other than influenza” , which covers the common symptoms of COVID-19. CDC generates weekly reports on the ILI level  and conducts laboratorial influenza virologic surveillance.
Prior to mid-February 2020, these 2 surveillance measures moved in the same direction. Since mid-February, however, the 2 measures have diverged, with the difference between ILI and laboratory-confirmed influenza activities attributable to COVID-19 [7, 8]. If we can obtain an accurate measure for influenza level, we can then use the difference between the reported ILI level and the estimated influenza level to estimate the level of new COVID-19 cases on a weekly basis.
We used aggregated weekly prescription data of oseltamivir, prescribed to treat influenza A and B but not COVID-19, to estimate the influenza level. Specifically, we used a linear model to calibrate the CDC-reported ILI level to the oseltamivir prescription data from January 2010 to mid-February 2020, and then produced estimates for influenza activity for mid-February to early April 2020 (Figure 1).
Our estimated influenza level (blue line) closely matches the CDC-reported ILI level (Figure 1, black line) (correlation 0.974) prior to mid-February 2020, but significant gaps between the 2 levels (Figure 1, red and black lines) emerge after mid-February, which can be attributed to COVID-19. For the week ending 21 March 2020, we estimated that 47% of the reported ILI level could be from COVID-19, which corresponds to approximately 855 000 new symptomatic cases in the US. As the official confirmed number of new cases was 17 450 for that week , this result shows that there were > 800 000 unreported symptomatic cases. The figure also shows that the cumulative number of COVID-19 symptomatic cases in the US by the week ending 28 March 2020 was estimated to be > 2 million and that the cumulative number of symptomatic cases in the US by the week ending 4 April 2020 was estimated to be > 2.5 million.
Our results show that the official numbers are severely underestimated, a conclusion that appears to be supported by a recent large-scale screening study covering > 6% of the Icelandic population  and another antibody survey study in Santa Clara County, California (although the study was cautioned for its design and potential sampling bias) . Our study targeted symptomatic COVID-19 cases as we used the CDC-reported percentage of patients with symptomatic illness who would seek medical care in our estimation. Therefore, if we consider the substantial presymptomatic and asymptomatic cases revealed by the Icelandic study , the total number of COVID-19 infections in the US is likely to be even higher than our estimates.
Our estimation method is simple and intuitive. It contrasts the CDC-reported ILI level with the estimated influenza level from influenza-specific prescription data to obtain an estimate of the COVID-19 level. Our approach innovatively combined the traditional syndromic surveillance system with big data from pharmacy prescriptions. It provides a feasible solution for estimating unreported COVID-19 cases with mild symptoms.
One limitation of our model is that the estimate might become more conservative through time due to administrative/government interventions. Toward the start of April, the syndromic surveillance system ILINet got more and more affected by the changes in the healthcare system, including increased use of telemedicine, the recommendation to limit hospital visits to only severe illness, and tightened social distancing. These changes affect the total number of hospital visits, patients’ inclination to seek outpatient healthcare, and doctors’ medication prescription. Thus, our estimates in early to mid-March could be more accurate as these changes had not yet taken place, and our estimate would serve as a lower bound for the symptomatic cases of COVID-19 in later weeks.
Our study indicates the feasibility to estimate COVID-19 case count using multiple data sources. This approach can be used in conjunction with approaches utilizing digital data sources for COVID-19 case estimation [12, 13]. COVID-19 presents an unprecedented challenge. Conquering it requires unprecedented levels of collaboration and data sharing across government agencies, research institutes, and the private sector.
Potential conflicts of interest. L. G. is directly employed by Alvogen. All other authors report no potential conflicts of interest. All authors have submitted the ICMJE Form for Disclosure of Potential Conflicts of Interest. Conflicts that the editors consider relevant to the content of the manuscript have been disclosed.