Statistical Significance in Science – How to Game the system

A new study has produced “statistically significant” results supporting the (false) hypothesis that adults who listen to children’s songs become younger, not just feel younger, but become actually chronologically younger.

In statistical studies, there is always the possibility that results that seem to support a hypothesis could have occurred purely by chance and actually have nothing to do with the hypothesis.  In statistical jargon, the measure of this possibility is called the “P-value” which is an estimate of the probability that results occurred purely by chance.  Researchers strive for a low P-value.  The current arbitrary gold standard in some sciences is a P-value less than or equal to 0.05, meaning that there is only a one in twenty chance that the results are accidental.  The results with a P-value less than 0.05 are called “statistically significant” and are deemed to support the hypothesis.

To publish a statistics-based paper in a prestigious scientific journal, the results must be “statistically significant.” There are many ways, however, in which researchers can manipulate their data to meet the mathematical requirements for “statistical significance” and still be very wrong in their conclusions.  This seems especially common in health studies.

The study mentioned above was part of a larger study on how researchers can game the system to produce statistically significant results supporting their hypotheses.

Statistician William M. Briggs has an interesting post “How To Present Anything As Significant” in which he reviews a new paper: “False-Positive Psychology : Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant

In the paper, the authors “show that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (P≤0.05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis.”

In the paper introduction the authors say:

“Our job as scientists is to discover truths about the world. We generate hypotheses, collect data, and examine whether or not the data are consistent with those hypotheses. Although we aspire to always be accurate, errors are inevitable.  Perhaps the most costly error is a false positive, the incorrect rejection of a null hypothesis. First, once they appear in the literature, false positives are particularly persistent. Because null results have many possible causes, failures to replicate previous findings are never conclusive. Furthermore, because it is uncommon for prestigious journals to publish null findings or exact replications, researchers have little incentive to even attempt them. Second, false positives waste resources: They inspire investment in fruitless research programs and can lead to ineffective policy changes. Finally, a field known for publishing false positives risks losing its credibility.

In this article, we show that despite the nominal endorsement of a maximum false-positive rate of 5% (i.e., p≤.05), current standards for disclosing details of data collection and analyses make false positives vastly more likely. In fact, it is unacceptably easy to publish “statistically significant” evidence consistent with any hypothesis.

The culprit is a construct we refer to as researcher degrees of freedom. In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared?

Which control variables should be considered? Should specific measures be combined or transformed or both?

It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various analytic alternatives, to search for a combination that yields “statistical significance,” and to then report only what “worked.”

The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding at the 5% level is necessarily greater than 5%.”

The authors provide guidance for authors and reviewers to remedy the situation.  Briggs summarizes them as follows:

The authors list six major mistakes that users of statistics make. They themselves used many of these mistakes in “proving” the results in the experiment above.

“1.Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article.” If not, it is possible to use a stopping rule which guarantees a publishable p-value: just stop when the p-value is small!

 “2. Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data-collection justification.” Small samples are always suspicious. Were they the result of just one experiment? Or the fifth, discarding the first four as merely warm ups?

“3. Authors must list all variables collected in a study.” A lovely way to cheat is to cycle through dozens and dozens of variables, only reporting the one(s) that are “significant.” If you don’t report all the variables you tried, you make it appear that you were looking for the significant effect all along.

“4. Authors must report all experimental conditions, including failed manipulations.” Self explanatory.

 “5. If observations are eliminated, authors must also report what the statistical results are if those observations are included.” Or: there are no such things as outliers. Tossing data that does not fit preconceptions always skews the results toward a false positive.

 “6. If an analysis includes a covariate, authors must report the statistical results of the analysis without the covariate.” This is a natural mate for rule 3.

The authors also ask that peer reviewers hold researchers’ toes to the fire: “Reviewers should require authors to demonstrate that their results do not hinge on arbitrary analytic decisions.”

The bottom line here is to always be somewhat suspicious of papers whose results depend upon statistical manipulation or modeling versus papers that present actual observations.

“There are three kinds of lies: lies, damned lies and statistics.” – Mark Twain


See also:

Statistical Games #1

Statistical Games #2 Stroke for Stroke



One comment

  1. True – ultimately the responsibility is on the researcher. The researcher is also working in an impossible world.

    In a publish or perish world, we encourage manipulation to get published. The goal has become “publish” not “find truth.” Find another metric of research productivity that supports the actual goal.

    Editors can start accepting research that is well-designed, but not statistically significant, too. 

    Finally, we shouldn’t ignore the multitude (majority) of researchers who are honestly seeking truth and honestly reporting their findings – flaws and all, openly.

Comments are closed.