“most published research findings are probably false” – John Ioannidis


The worlds of academic and commercial research are being riven at the moment with concerns and accusation about how poor much of the research and conclusions that have been published are. This particular problem is not specifically about market research, it covers health research, machine learning, bio-chemistry, neuroscience, and much more. The problem relates to the way that tests are being created and interpreted. One of the key people highlighting the concerns about this problem is John Ioannidis from Stanford University and his work has been reported both in academic and popular forums (for example The Economist). The quote “most published research findings are probably false.” comes from Ioannidis.

Key Quotes
Here are some of the quotes and worries floating about at the moment:

  • America’s National Institutes of Health (NIH) – researchers would find it hard to reproduce at least three-quarters of all published biomedical findings
  • Sandy Pentland, a computer scientist at the Massachusetts Institute of Technology – three-quarters of published scientific papers in the field of machine learning are bunk because of this “overfitting”
  • John Bohannon, a biologist at Harvard, submitted an error stewm paper on a cancer drug derived from lichen to 350 journals (as an experiment), 157 accepted it for publication

Key Problems
Key problems that Ioannidis has highlighted, and which relate to market research are:

1. Studies that show an unhelpful result are often not published, partly because they are seen as uninteresting. For example, if 100 teams look to see if they can find a way of improving a process and all test the same idea, we’d expect 5 of them to have results that are significant at the 95%, just by chance. The 95 tests that did not show significant results are not interesting, so they are less likely to be published. The 5 ‘significant’ results are likely to be published, and the researchers on that team are likely to be convinced that the results are valid and meaningful. However, these 5 results would not have been significant if all 100 had been considered together. This problem has been widely associated with problems in replicating results.

2. Another version of the multiple tests problem is when researchers gather a large amount of data then trawl it for differences. With a large enough data set (e.g. Big Data), you will always find things that look like patterns. Tests can only be run if the hypotheses are created BEFORE looking at the results.

3. Ioannidis has highlighted that researchers often base their study design on implicit knowledge, without necessarily intending to, and often without documenting it. This implicit process can push the results in one direction or another. For example, a researcher looking to show two methods produced the same results might be thinking about questions that are more likely to produce the same answers. Asking people to say if they are male or female is likely to produce the same result, across a wide range of question types and contexts. By contrast, questions about products that participants are less attached to, in the context of a 10 point-scale emotional associations are likely to be more variable, and therefore less likely to be consistent across different treatments.

4. Tests have a property called their statistical power, which in general terms is the ability of the test to avoid Type II errors (false negatives). The tests in use in neuroscience, biology, and market research typically have a much lower statistical power than the optimum. This led John Ioannidis in 2005 to assert that “most published research findings are probably false”.

Market Research?
What should market researchers make of these tests and their limitations? Test data is a basic component of evidence for market research. Researchers should seek to add any new evidence they can acquire to that which they already know, and where necessary do their own checking. In general, researcher should seek to find theoretical reasons for the phenomena they observe in testing – rather than relying on solely on test data.

However, let’s stop saying tests “prove” something works, and let’s stop quoting academic research as if it were “truth”. Things are more or less likely to be true, in market research and indeed most of science, there are few things that are definitely true.

The ‘science’ underpinning behavioural economics, neuroscience, and Big Data (to name just three) should be taken as work in progress, not ‘fact’.

Is Ioannidis Right?
If we are in the business of doubting academic research, then it behoves us to doubt the academic telling us to be more skeptical. There are people who are challenging the claims. For example an article published on RealScience.com in January 2013 claims that the real figure for bad biomedical research is ‘just’ 14%, rather than three-quarters.

10 thoughts on ““most published research findings are probably false” – John Ioannidis

  1. Interesting. Love the Salmon fMRI story. A few questions…

    1. Does Mr. Ioannidis insist that his surname not be capitalized or was this a grammatical typo? If the former, then I deduct 20 credibility points for pretentiousness.

    2. Key Problem #2 – “Tests can only be run if the hypotheses are created BEFORE looking at the results.” In other words, researchers should guess before analyzing data?

    3. “In general, researcher(s) should seek to find theoretical reasons for the phenomena they observe in testing – rather than relying on solely on test data.” BRAVO! Perhaps the most insightful statement in the article.

    Happy 2014.

  2. Thanks Jim, I can’t see anywhere in the article where there is a small I for Ioannidis?

    The standard assumption in testing is that the hypotheses must be created before the data are inspected and tested. So, the normal process is to look at data, listen to clients, draw from experiences and come up with suggestion, hypotheses, stories etc. However, these can only be ‘tested’ in other data sets.

  3. Ah yes, the meta-analysis problem and the publication bias. After combing through every journal to find every article on your topic, you can be guaranteed you’ve only found the 5% that generated results that were strangest in some way – very significant, very large, or very weird. The other 95% that reflect reality are in the wastebasket waiting to burned.

    Give someone 1 significant p-value and they go nuts. Hey…. i’ve got data tables with 4000 p-values….

  4. Statistical significance tests should be replaced with confidence intervals. And in many cases done away with altogether, then we would focus on the actual data.

    This would be a great advance. BUT a single study is still untrustworthy and tells us little, we need verifications/extensions under different conditions, by different researchers, using different measures and instruments. Only then can we find out where in the real world the finding might hold.

  5. I confess that of late I’ve read of several health research news releases in local and national newspapers that I read. Typically, I see an excited headline touting a revolutionary finding, and upon reading lower in the column I learn that the sample size was 22 (e.g.). I wonder, on reading such stories, how much alteration to the release has been made by the editors, and whether such editorial attention is coupled with an understanding of statistical principles. I suggest that the problem Prof. Ioannadis finds may be more complicated than it would appear at first glance. Still, the professor must know that using the word “Most” is totally unscientific, but the word False does have its place in statistical parlance. “Most” doesn’t tell me exactly how big the problem is. And does he include in the “false” category research that does properly disclaim its own shortcomings as a suggestion for more research? Bottom line, I will never believe that “most” published research findings are false; this is simply too broad a general assertion.

  6. I agree with this entirely; that what isn’t significant doesn’t get published, as well as the other points. We talked about that ‘publish or perish’ phenomena at length in grad school.

    Last summer Leny ran a simple blog post of mine re Type I & II errors and ‘insights’ in MR: http://www.greenbookblog.org/2013/09/06/raise-your-hand-if-the-truth-starts-at-05/#comments.

    This entire topic is the elephant in the room and eventually it will be noticed by study sponsors, in my opinion.

  7. I couldn’t agree more with this: “The ‘science’ underpinning behavioural economics, neuroscience, and Big Data (to name just three) should be taken as work in progress, not ‘fact’.”

    That’s absolutely the case – one good example of this is the research on the link between glucose and mental energy via ego depletion (a reasonable summary here: http://www.nytimes.com/2011/08/21/magazine/do-you-suffer-from-decision-fatigue.html?pagewanted=all). However, recent research (http://www.ncbi.nlm.nih.gov/m/pubmed/24389240/) has now suggested that is perhaps not the case.

    I also agree that we should stop quoting academic research as “truth”. However, in my view one of the issues especially in MR when it comes to using academic research is the low level of engagement with the scientific/academic community. If your only exposure to e.g. behavioural economics comes from three books (say, Predictably Irrational, Nudge & Thinking, Fast and Slow), you’re basing your knowledge on a handful of studies cherry-picked to tell a story. There’s nothing necessarily wrong with the studies quoted in these books but it is a narrow set of information to rely on. Instead, we’d be better off having a broader, deeper understanding of human decision making in general, against which we could then assess all new research we see. Findings in one particular paper might be false, but by having a sense of entire bodies of work helps in assessing what seems credible and not place too much weight on any particular paper without seeing it in a broader context.

    I don’t think the answer is to doubt EVERYTHING academic research is telling us – not all research findings are false, even taking into account every bias under the sun. In my view, any research is better than no research. Even if the process is messy and flawed, I believe academic research into the psychology of decision making still manages to improve our understanding of human nature and even if one particular researcher on his/her own may not be able to achieve Truth, as a collective it’ll eventually get closer and closer to it. Embracing the messiness and incremental nature of research doesn’t undermine the validity of the process itself.

    The problem in MR is that, firstly, we often lack the time and resources to fully understand the big picture of these academic subjects that we are borrowing for our work: it takes a lot of time to develop a deep knowledge of a subject like behavioural economics, an investment we can’t always afford. Secondly, even if we, as researchers, did understand and embrace the messiness of scientific research, we will need to simplify it for clients who may not have the same depth of knowledge, which then often means we end up claiming more faith in a particular piece of research than it perhaps would otherwise merit.

    Instead of focusing on p-values and statistical significance, should we be thinking about Bayesian statistics instead? Even if that might be a more useful conversation to have, the likelihood of the entire industry changing course is so small (we’d have to educate everyone from agencies to clientside) and overhaul… everything. It might be a better way of describing the world through numbers, but we won’t even consider it. Like most of the scientific community, we’re stuck in our (flawed) ways of doing things because the barrier to changing is too high, so we end up doing things in a way that’s “good enough”. In the end, there is no such thing as a “fact” in either academic or commercial research when it comes to social sciences: data isn’t waiting “out there” to be collected like apples in a tree. Everything from the way we ask questions to the way we collect it constructs a particular view of reality, and “blaming” our analysis methods slightly misses the point. It’s all a proxy of “reality”, whatever that is, and we do the best job we can in interpreting it.

  8. In addition – stop applying statistics that assume population normal distribution to non-normal data.

Comments are closed.