Big data is not entirely reliable in managing the coronavirus outbreak

How often do you see a piece of economic or financial information revised upward by 45 per cent? And how reliable would you regard a data set that’s subject to such adjustments?

This is the problem confronting epidemiologists trying to make sense of the novel coronavirus spreading from China’s Hubei province. On Thursday, the tally there surged by 45 per cent — or 14,480 cases. The revision was largely due to health authorities adding patients diagnosed on the basis of lung scans to a previous count, which was mostly limited to those whose swab tests came back positive.

The medical data emerging from hospitals and clinics around the world are invaluable in determining how this outbreak will evolve — but the picture painted by the information is changing almost as fast as the disease itself, and is not always of impeccable provenance. Just as novel infections exploit weaknesses in the body’s immune defenses, epidemics have an unnerving habit of spotting the vulnerabilities of the data-driven society we have built for ourselves.

That is not a comforting thought. We live in an era where everything seems quantifiable, from our daily movements to our internet search habits and even our heartbeats. At a time when people are scared and seeking certainty, it is alarming that the knowledge we have on this most important issue is at best an approximate guide to what is happening.

“It’s so easy these days to capture data on anything, but to make meaning of it is not easy at all,” said John Carlin, a professor at the University of Melbourne specialising in medical statistics and epidemiology. “There’s genuinely a lot of uncertainty, but that’s not what people want to know. They want to know it’s under control.”

That is most visible in the contradictory information we are seeing around how many people have been infected, and what share of them have died. While those figures are essential for getting a handle on the situation, as we have argued, they are subject to errors in sampling and measurement that are compounded in high-pressure, strained circumstances. The physical capacity to do timely testing and diagnosis cannot be taken for granted either, as my colleague Max Nisen has written.

Early case fatality rates for Severe Acute Respiratory Syndrome were often 40 per cent or higher before settling down to figures in the region of 15 per cent or less. The age of patients, whether they get sick in the community or in a hospital, and doctors’ capacity and experience in offering treatment can all affect those numbers dramatically.

Even the way that coronavirus cases are defined and counted has changed several times, said Professor Raina MacIntyre, head of the University of New South Wales’s Biosecurity Research Program: From “pneumonia of unknown cause” in the early days, through laboratory-confirmed cases once a virus was identified, to the current standard that includes lung scans. That is a common phenomenon during outbreaks, she said.

Those problems are exacerbated by the fact that China’s government has already shown itself willing to suppress medical information for political reasons. While you would hope the seriousness of the situation would have changed that instinct, the fact casts a shadow of doubt over everything we know.

How should the world respond amid this fog of uncertainty?

While every piece of information is subject to revision and the usual statistical rule of garbage-in, garbage-out, epidemiologists have ways to make better sense of what is going on.

Well-established statistical techniques can be used to clean up messy data. A study this week by Imperial College London used screening of passengers flying to Japan and Germany to estimate the fatality rate for all cases was about 1 per cent — below the 2.7 per cent of confirmed ones found in Hubei province, but higher than the 0.5 per cent seen for the rest of the world.

When studies from different researchers using varying techniques start to converge toward common conclusions, that is also a strong if not faultless indication that we are on the right track. The number of new infections caused by each coronavirus case has now been identified in the region of 2.2 or 2.3 by several separate studies, for instance — although that number itself can be subject to change as people quarantine themselves and self-segregate to prevent infection.

The troubling truth, though, is that in a society that expects to know everything, this most crucial piece of knowledge is still uncertain.

Google can track my every move and tell me where I ate lunch last week, but viruses do not carry phones. The facts about this disease are hidden in the activity of billions of nanometre-scale particles, spreading through the cells of tens of thousands of humans and the environments we traverse. Big data can barely scratch the surface of solving that problem.

David Fickling is a Bloomberg Opinion columnist.

Big data is not entirely reliable in managing the coronavirus outbreak

The picture painted by the information on the outbreak is changing almost as fast as the disease itself