26 March 2020
How well have naïve, simplistic models of the epidemic done?
I’ve mentioned before that I am a great fan of Philip Tetlock’s Superforecasting.
One thing he argues is that many so-called expert forecasts are mealy-mouthed, in the sense that they are so vague as to give plenty of wiggle room. “The market is overvalued and valuations will come back to earth” is almost certainly going to be true at at least some future date, allowing the expert to sound smart today and choose the point of time in the future at which he can point to his so-called forecast. So Tetlock, and forecasting tournaments in general, require forecasters to make specific predictions linked to an objectively measurable outcome on a given date.
Kahneman discusses the challenge of decision-making in “low-validity environments.” Roughly, these are situations that are not conducive to developing high-fidelity pattern recognition/intuition, because they are noisy (outcomes are not consistently correlated with the inputs); because the true outcomes come too long after the causally-related input; or because few individuals are exposed to decisions and their outcomes with enough frequency. Kahneman argues that simple algorithms perform best in such situations, versus “expert” judgement or complex models.
Below, I’ll share what happened when I decided to build a very simple model of a very complex domain I knew little about.
My naïve model: the backstory
The origins of this blog / newsletter go back to 28 February. Several friends, who are among the smartest human beings I know, and I were e-mailing each other about this thing called COVID-19.
I had stumbled upon some of the early modeling attempts that were being discussed online, including this great interactive widget that permitted users to input a few key assumptions and get an instant forecast of the epidemic. I plugged in some plausible figures for R0 (basic reproduction number) and post-control Re (effective reproductive number) and was shocked by what I saw.
I spent the next few days looking for published epidemiologic models and was disturbed by what I found. Several teams and individuals announced that they had built models; but no one was sharing the output. That seemed … worrying. Why weren’t they publishing their predictions?
I started reading up on epidemiological modelling techniques (this book was helpful) and realised that, as someone who hadn’t done serious maths in 30 years, it was going to take me too long to get up to speed. Plus, while these models clearly were soundly based on the underlying dynamics of how epidemics spread and ultimately end, I worried that they would be very sensitive to inputs that were essentially unknowable at this early stage, and therefore not useful for prediction.
I decided to try a very, very simple approach.
We know that epidemics, at least in their early stages, follow exponential curves. I downloaded the data (first from the WHO, then from Johns Hopkins’ outstanding resources, and later from my currently prefered source, Our World in Data) and did some simple curve-fitting.
Wanting to model a curve of the form y=m*e^(b*x), I chose a least-squares fitting of ln(y)=ln(m)+b*x, using Excel (a painful mistake as the model grew larger; note to self, must learn Python).
By 4 March, I had my first model working. For a number of reasons, I chose to exclude China and to only forecast one output: the confirmed number of cases outside of Mainland China.
On the 6th, I shared the following precise forecasts with a close friend:
Date: Fri, 6 Mar 2020 14:25:18 +0000 Subject: exponential curve fit for total cases outside china From: Christopher North Using basic exponential growth curve on WHO data below. I ran it with data through 3 March and it underpredicted the next two days by 23%. I think the model is overfitting to the early days of reporting when the cases were probably even more severely underreported than today. Predicted doubling time is about 2.5 days. Trebling time is almost exactly 7 days which makes for easy mental maths. y=11.44*exp(0.1554x) r2 = 0.9665 100K cases outside china 19 March 200K 23 March 500K 29 March 1M 3 April 5M 13 April 10M 18 April 20M 22 April Of course this is extremely crude. There are more interesting models out there. Prediction: schools and most businesses close in around 3 weeks. Chris
I then went on to update the model daily, regularly updating the forecasts and assessing how the model was performing. Whilst I made a number of improvements to data sources, and whilst I was painfully aware of the limitations of the model, I didn’t change the underlying methodology.
I went on to published a daily update in the form of a newsletter to a few dozen friends starting on the 11th of March.
So, how did these predictions perform?
I was clearly wrong about when schools would close (although correct that they would close). Saying they would close “in around 3 weeks” was mealy-mouthed, of course. I made the prediction on 6 March (in the UK, referring to UK schools) so that implied a prediction for 27 March. In fact, UK school closures were announced on the 20th of March to take effect on the 23rd; I was a week, or one-third, off.
In terms of forecasts for specific milestones, here’s a table showing what I forecast vs what has actually happened or is highly likely to happen:
(The “Predicted” column is based on taking the most recent actual and doing simple extrapolation; it’s not an output of the model.)
It’s important to say that I don’t feel a high degree of conviction about the forecasts going through the rest of April. There are many things that stop or slow the exponential growth phase of an epidemic (control measures like we’re seeing take place around the world; behavioural change; vaccines; mutations to the virus; weather changes; etc.). And I’m fully aware of the many limitations of my model; indeed, from the beginning, I highlighted them (you can find them in the FAQs in the first newsletter here).
I’ll also add that I’m surprised that the model’s predictions have held up so well. The model isn’t aware of things like control measures, social distancing, or herd immunity. It can’t predict when the “curve will flatten” or when the epidemic will reach its peak and begin to die out. It simply assumes that exponential growth will continue along the same path.
With that in mind, it’s very depressing, as well as a clear failure of collective action, that despite having ample information at the end of February to know that we were facing the prospect of a serious global epidemic, we are still debating today in some countries and cities whether we need to take serious action.
As I mentioned in my last post, in passing:
The doubling time in the US has been around 3 for more than two weeks now. Let’s put that in mathematical context: if the US continues to double the number of cumulative confirmed cases ever 3 days for another 30 days, the number of cases will increase 500-fold. `A 500-fold increase in the number of confirmed cases would take us from 55,000 cases on 25 March to more than 27 million cases!
Again, to be clear, I don’t think this is what will happen. But the maths say it could happen if the US does not take very significant actions soon.