The Social Value of Models and Big Data

Posted on Tuesday, April 28, 2020

Author: Prof. Kelly Bronson and Prof. Robert Smith?

Core Member and Inclusive Innovation Research Cluster Co-Lead, ISSP
Assistant Professor, Faculty of Social Sciences,

Faculty Affiliate, ISSP
Full Professor, Disease Modelling, Faculty of Science, uOttawa

On April the 3rd, CBC news ran a story with the headline, “COVID-19 could kill 3,000 to 15,000 people in Ontario, provincial modelling shows.” What does this vastly wide range of future scenarios say about disease modelling? And what does the fact that this wide range made news say about the place of modelling in policy and culture?

A public health policy approach—one that deals in aggregate statistics and makes decisions in the interest of the population—is the substructure behind the development and use of the mathematical modelling we now see reported almost every day. This is a century-long approach, at least in the west, which most of us now take for granted - for example by vaccinating ourselves for influenza each year. But something is different today in our technical capacity to inform public health decisions: the omnipresence of data tracking.

The big novelty of our historical moment is that we live our lives shedding digitally collected data points (including every time we move about and click online). These data are brought together by various actors – from those trying to sell us toasters and jeans to mathematicians trying to predict disease spread and impact. Today, these data—so voluminous they are referred to as big data—are being drawn on and fed into mathematical models that help policymakers and individual people make some sense of mind-bogglingly complex and, frankly, frightening situations. Said differently, we can rely on big data and modelling for a sense of stability in times of uncertainty.

Models have two components. The first is mechanistic: describing how interactions occur between different actors, be they humans, animals, viruses, etc., or any combination thereof. The second is quantitative: determining the precision of those interactions – the transmission rate of a disease, the birth rate of a particular species, the rate of mutation of a drug-resistant virus and the like. Data informs both components: directly, in the second case, and indirectly in the first, where patterns must be discerned from the information at hand.

There is little doubt that data and models are useful for helping us compensate for the frailties of human cognitive biases and distortions. They also inform important policy decisions like mandating physical distancing. Part of the appeal of models is that they are the only thing we have that can predict the future. (Crystal balls don’t exist!) Early modelling in the SARS, swine flu and MERS epidemics turned out to be broadly accurate when compared to the overall outcomes, suggesting that the tools we have are useful, despite the presence of unknown or incomplete data. Modelling, like science, is a precise process that often produces fuzzy outcomes; consequently, models must account for this degree of uncertainty and can compensate for lack of data by making multiple predictions simultaneously.

However, there are limitations to mathematical models. The essential idea behind modelling is to reduce complex information about the world to more easily digestible processes, from which decisions can be made. This is akin to making a map that includes key geographical features and ignores the rest through a process of selective ignorance: choosing what to include and what to ignore.

For example, models of COVID-19 are usually describing the average susceptible person, the average infected person, the average recovered person, and so on. By design, they usually ignore outliers (it is of course possible to include them if they are deemed important – a decision made by humans). During the SARS epidemic, for instance, superspreaders (individuals who spread the disease at a much higher rate than most people) were a crucial vector and were included in many models.

Outliers may be few, but their experiences matter. Public health advice that is grounded in modelling may not account for the inequitable vulnerability of Canadians. Not everyone is equally able to physically distance, for example. Farmers and food system workers across food supply chains are uniquely vulnerable if they are facing difficulty sourcing farm inputs, accessing markets or bringing in farm workers who typically arrive from other countries.

Another limitation of models relates to uncertainty in the data: the accuracy of models diminishes the longer the prediction period. Just as the weather forecast is accurate for tomorrow, less accurate for next week and entirely impossible to predict accurately for next year, models of chaotic systems lose predictability over time. But more data does not necessarily lead to better modelling outcomes. Famous statistician and modeller Nate Silver uses the aphorism that “big data creates bigger haystacks”. As we add more data points, it is often the case that we uncover many more statistically significant correlations or relationships among variables. Most of these correlations are spurious (not causally related) and therefore not necessarily informative. In fact, many of the correlations might be distracting and undermine our ability to find explanatory purchase.

Furthermore, relying on models alone to get us through a crisis runs the risk of substituting quantitative data for qualitative explanations. The latter often contain the insights needed to design models in the first place. In 2008, Chris Anderson, the then-editor of WIRED magazine, declared that linguistics, sociology, psychology and the normal scientific process of hypothesis development and testing were all “dead” because “we can track and measure why people do what they do with unprecedented fidelity. With enough data, the numbers speak for themselves.”

But Anderson is wrong. Big data and the mathematical models they feed deliver some explanations but they do not do well when it comes to describing the social context around data. We have made great strides in amassing quantitative data, but we still need qualitative theory to interpret and build mechanistic relationships that exist in these data but that may not be visible without a deeper understanding of behaviour.

Human decisions are not discrete data points; they are enmeshed in sequences and contexts and contradictions. For example, models have done very little to explain why Germany and other European countries have shown such dramatically different COVID-19 outcomes despite similar rates of infection. This requires more sociological lines of inquiry: what is it about the daily habits and culture of Germans, versus say Italians, that influences the spread and impact of the disease?

We have all made dramatic changes to our lives to prevent the worst case mathematical prediction (100,000 deaths in Ontario) but we have done so in large part because of an inability to care for this volume of sick given a lack of capacity in our healthcare and medical supply systems. COVID-19 therefore begs for careful analysis of fragilities contained in our health care and our global supply chains. Messy context (and the qualitative data that often speak to it), theory and history are needed for an approach through COVID-19 that is grounded in data and modelling yet delivers something useful and equitable.

Back to top