Skip to main content

It’s time to address the reproducibility crisis in AI

Image Credit: littlehenrabi / iStock

Watch all the Transform 2020 sessions on-demand here.


Recently I interviewed Clare Gollnick, CTO of Terbium Labs, on the reproducibility crisis in science and its implications for data scientists. The podcast seemed to really resonate with listeners (judging by the number of comments we’ve received via the show notes page and Twitter), for several reasons.

To sum up the issue: Many researchers in the natural and social sciences report not being able to reproduce each other’s findings. A 2016 Nature survey indicated that more than 70 percent of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments. This concerning finding has far-reaching implications for the way researchers perform scientific studies.

One contributing factor to reproducibility failure, Gollnick suggests, is the idea of “p-hacking” — that is, examining your experimental data until you find patterns that meet the criteria for statistical significance before you determine a specific hypothesis about the underlying causal relationship. P-hacking is known as “data fishing” for a reason: You’re working backward from your data to a pattern, which breaks the assumptions by which statistical significance is determined in the first place.

Gollnick points out that data fishing is exactly what machine learning algorithms do, though — they work backward from data to patterns or relationships. Data scientists can thus fall victim to the same errors made by natural scientists. P-hacking in the sciences, in particular, is similar to developing overfitted machine learning models. Fortunately for data scientists, it is well understood that cross-validation, by which researchers generate a hypothesis on a training dataset and then test it on a validation dataset, is a necessary practice. As Gollnick put it, testing on the validation set is a lot like making a very specific prediction that’s unlikely to occur unless your hypothesis is true, which is essentially the scientific method at its purest.


June 5th: The AI Audit in NYC

Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.


Beyond the sciences, there’s growing concern about a reproducibility crisis in machine learning as well. A recent blog post by Google research engineer Pete Warden speaks to some of the core reproducibility that data scientists and other practitioners face. Warden references the iterative nature of current approaches to machine and deep learning and the fact that data scientists are not easily able to record their steps through each iteration. Furthermore, the data science stack for deep learning has a lot of moving parts, and changes in any of these layers — the deep learning framework, GPU drivers, or training or validation datasets — can all affect the results. Finally, with opaque models like deep neural networks, it’s difficult to understand the root cause of differences between expected and observed results. These problems are further compounded by the fact that many published papers fail to explicitly mention many of their simplifying assumptions or implementation details, making it harder for others to reproduce their work.

Efforts to reproduce deep learning results are further confounded by the fact that we really don’t know why, when, or to what extent deep learning works. During an award acceptance speech at the 2017 NIPS conference, Google’s Ali Rahimi likened modern machine learning to alchemy for this reason. He explained that while alchemy gave us metallurgy, modern glass making, and medications, alchemists also believed they could cure illnesses with leeches and transmute base metals into gold. Similarly, while deep learning has given us incredible new ways to process data, Rahimi called for the systems responsible for critical decisions in health care and public policy to be “built on top of verifiable, rigorous, thorough knowledge.”

Gollnick and Rahimi are united in advocating for a deeper understanding of how and why the models we use work. Doing so might mean a trip back to basics, maybe as far back as the foundations of the scientific method. Gollnick mentioned in our conversation that she’s been fascinated recently with the “philosophy of data” — that is, the philosophical exploration of scientific knowledge, what it means to be certain of something, and how data can support these.

It stands to reason that any thought exercise that forces us to face tough questions about issues like explainability, causation, and certainty could be of great value as we broaden our application of modern machine learning methods. Guided by the work of modern science philosophers like Karl Popper and Thomas Kuhn, as well as the 18th century empiricist David Hume, this type of deep introspection into our methods could prove useful for the field of AI as a whole.

The original version of this story appeared in the This Week in Machine Learning & AI newsletter. Copyright 2018.

Sam Charrington is host of the podcast This Week in Machine Learning & AI (TWiML & AI) and founder of CloudPulse Strategies.