Researchers left at crossroads after doubts cast on scientific probability test

The American Statistical Association has sent shockwaves through the research community by voicing concern that “misunderstanding or misuse” of techniques used to identify the probability of findings being repeated is leading to claims that too often fail to stack up.

It’s amazing what scientists have found out about human behaviour. They have shown that watching a heart-warming film clip can make us more patient, while how we feel about relatives can be influenced by where we put dots on graph paper.

Such unusual insights have long been a staple of media coverage of science – the “amazing but true” tales we all know and love.

Whether we believe them is, however, another matter. Certainly doubts about the reliability of such claims have long circulated among academics.

Now these doubts are fuelling a controversy about the future of the US$1.5 trillion (Dh5.5tn) global research enterprise.

It centres on the reliability of the techniques routinely used to decide if a finding is worth taking seriously.

This month, the American Statistical Association (ASA) sent shockwaves through the research community by voicing concern that “misunderstanding or misuse” of these techniques is leading to claims that too often fail to stand up.

The unprecedented public statement follows the failure of attempts to replicate findings published in research journals – among them those claims about heart-warming film clips and dots on graph paper.

Such failures are causing “much confusion and even doubt about the validity of science”, according to the ASA, which is now calling for “renewed and vigorous attention to changing the practice of science”.

It is hard to overstate the implications of the ASA’s statement. Replication is the acid test of science, with a track record of weeding out faulty, flawed or fraudulent claims.

The ASA is now raising concerns about the reliability of techniques playing a key part in that process.

Taught to generations of researchers, so-called significance testing is supposed to cast light on the likely success of replication. At its core is a figure known as the p-value, which is worked out from raw data.

This measures the chances of getting at least as impressive an outcome as that seen, assuming it’s really just a fluke.

By convention, if the p-value comes out at less than 1 in 20, the outcome is deemed “statistically significant”, on the grounds that it’s unlikely to be a fluke.

Of course, a study result can be misleading for many other reasons, from dodgy design to faulty equipment. But at least mere chance has been ruled out with 95 per cent reliability.

Except it has not – and believing otherwise is precisely the misunderstanding that the ASA is worried about.

The belief that p-values measure the chances of a result being just a fluke is alarmingly widespread, and even pops up in many textbooks. But as they are calculated on the assumption that fluke is the true cause, p-values clearly cannot also be used to test if that assumption is true.

Doing so is akin to assuming a rule is accurate, and then claiming to show it by measuring the distance between two points using the same rule.

The good news is that it is possible to convert p-values into the chances of a finding being a fluke. The bad news is that most researchers do not know how to do it. Instead, they simply flip p-values below one in 20 around to convince themselves the chances of their result being real are thus 19 in 20.

The dangers of such faulty reasoning can no longer be dismissed as academic.

Last year, the journal Science published the outcome of an international effort to replicate 100 studies published in three psychology journals.

Virtually all the original studies had passed the classic test of “statistical significance”, with p-values below one in 20. Yet barely one in three of the studies were successfully replicated, and even those that were typically produced much less impressive outcomes than the original studies.

This month, Science published the outcome of a similar effort at replication, this time of studies published in two leading economics journals. The results were more encouraging – even so, about 40 per cent of the original findings failed to replicate and again, the findings were typically far less impressive than in the original studies.

As the ASA points out, none of this should come as a surprise. Statisticians have been warning researchers about the dangers of misinterpreting p-values for decades. Even the Cambridge mathematician who invented significance tests in the 1920s knew the scope for misunderstanding.

Shortly after including them in his hugely influential textbook Statistical Methods for Research Workers, Ronald Fisher advised researchers to use p-values only as a test of what to ignore, rather than what to take seriously.

Yet by the 1950s, Fisher’s tests had become a part of every scientist’s toolkit for making discoveries, while their “terms and conditions” were ignored.

In its statement, the ASA highlights other abuses of p-values. These include “data dredging”, where researchers hunt for anything giving p-values below one in 20 – and thus “statistical significant” findings they can publish. Such practices haunt the burgeoning field of Big Data, relied on by businesses to extract insight from their data sets.

The ASA wants to see researchers move away from the simplistic pass-fail mentality encouraged by p-values, towards more sophisticated alternatives. That is, however, more easily said than done.

But there is a more formidable barrier to change: the profession of science itself. Anyone wanting a career in research must publish in journals, which in turn prefer eye-catching advances over damp squibs. Until now, both these agendas have benefited from the low bar for “significance” set by p-values. Moving to alternatives as suggested by the ASA is likely to set that bar higher.

It’s no exaggeration to say that the scientific enterprise stands at a crossroads. Will researchers opt to give themselves a much harder time, or will they continue to fool themselves and us with “discoveries” that are neither amazing nor true?

Robert Matthews is visiting professor of Science at Aston University in Birmingham, England. His new book Chancing It: The Laws of Chance and What they Mean for you is out now.