COVID-19 predictions, Dunning-Kruger effect and Hippocratic oath of a data scientist

Despite all the temptation, I have stayed away from building any models, sharing my insights or forecasts publicly using the various COVID-19 data sources that are available out there in public. Some close friends have asked why I haven’t done so.

I told them that I am not qualified to do so.

I will digress for a few minutes to make a point. I promise that I will come back to make a point.

Pittsburgh, 1995: Two men rob a bank in broad daylight without wearing a mask or disguise of any sort - even smiling at surveillance cameras on their way out. Later in the night, police arrests one of the robbers who is in utter disbelief. The man and his accomplice believed that rubbing lemon juice on their skin would render them invisible to surveillance cameras, as long as they do not go close to a heat source.

Motivated by the Pittsburgh robbery, Kruger and Dunning at Cornell University decided to conduct a study of how people mistakenly hold favorable views of their abilities and skills. The study was eventually published in 1999 as ‘Unskilled and Unaware of It: How Difficulties in Recognizing One's Own Incompetence Lead to Inflated Self-Assessments’.

Dunning-Kruger effect is a cognitive bias that leads to inflated self-assessments. People who are less experienced (less skilled, less competent, or less self-aware) not only make mistakes, but also fail to realize their mistakes. On the other hand, experts (people with more knowledge and experience) tend to be more self critical and aware of their short comings.

At the Data Science and Data Engineering bootcamp, half-way into the bootcamp, trainees reach the peak of their confidence - thanks to the amazing tools available for doing data science. Within a few lines of code one can get that amazing visualization or a complex model without having to worry about the complexities of implementation of these libraries. Most of them are amazed at how easy data science, AI and machine learning is. I show them this illustration of Dunning-Kruger effect and point them to peak of mount you-know-what(This has always been taken in good humor, except when one attendee actually got offended by this. I have not stopped giving this example). A few more modules into the bootcamp, when asked to do some more feature engineering and tune the model parameters, bootcamp attendees are puzzled, even frustrated.

On one occasion, one of the attendees exclaimed, and I quote here:

‘How is this machine learning? Why do I have to do all the feature engineering, data cleaning, and parameter tuning myself? Why can’t we automate this?’

When asked a similar question, I often respond like this:‘Welcome to real-world data science and machine learning. Data science and machine learning is much more than R, Python, DeepLearning and TensorFlow’.

So why am I writing this blog?

With the COVID-19 outbreak, a lot of people have started sharing their work on available data sources. I love the creativity and effort put into the work. I have seen cool visualizations in every possible tool out there. I have seen models, including forecasts on how many cases will emerge in a country the next day/week/month. In most cases, I find these insights and conclusions just disturbing, but also downright irresponsible.

If you are not familiar with at least the basic principles of epidemiology, economics, public policy and healthcare policy, please stop drawing conclusions that mislead and scare - or for that matter give false sense of comfort to - people.

I created infographic called 'Hippocratic oath of a data scientist' a few months ago inspired by mathematical modelers hippocratic oath.

Next time you decide to share any insights and make recommendations on economic, public or healthcare policy, ask some of these questions:

  • Are you aware of something called confounding variable?
  • Does population density impact the spread of virus?
  • Have you considered the GDP, HDI, and other economic indicators in in your model?
  • Can any social norms influence the spread of disease? For instance, all cultures greet in their own unique way. Bowing, kissing one’s cheek, hugging, shaking hands or just nodding are some of the ways people different cultures greet each other.
  • Can a western democracy impose a lock down similar to China and Singapore?
  • Singapore recently introduced fines for one’s inability to maintain social distance? How many other countries would this work in?
  • If you lived paycheck to paycheck or possibly work on daily wages, would your conclusions be the same?
  • Put small business owner and an expert in infectious diseases. Will they agree on what is the right course of action?
  • If we put a few experts in epidemiology, economics, healthcare policy, public policy, and psychology, will they agree on what measure should be taken.

Two of the points are particularly relevant here:

On Impact: My model may impact lives, society and the economy. I will make sure that everyone is aware of the possible pitfalls of my model.

On Predictions: I will remember that not everything can be predicted.

Let’s demonstrate more social responsibility by not sharing insights unless we understand what were are doing. I accept that there are things that I do not know understand and I am happy with it.

I, for one, will not share any forecasts or public policy recommendations on COVID-19 outbreak.


This is a companion discussion topic for the original entry at https://blog.datasciencedojo.com/hippocratic-oath-of-a-data-scientist/