Statistical Significance in Science – How to Game the system

A new study has produced “statistically significant” results supporting the (false) hypothesis that adults who listen to children’s songs become younger, not just feel younger, but become actually chronologically younger.

In statistical studies, there is always the possibility that results that seem to support a hypothesis could have occurred purely by chance and actually have nothing to do with the hypothesis.  In statistical jargon, the measure of this possibility is called the “P-value” which is an estimate of the probability that results occurred purely by chance.  Researchers strive for a low P-value.  The current arbitrary gold standard in some sciences is a P-value less than or equal to 0.05, meaning that there is only a one in twenty chance that the results are accidental.  The results with a P-value less than 0.05 are called “statistically significant” and are deemed to support the hypothesis.

To publish a statistics-based paper in a prestigious scientific journal, the results must be “statistically significant.” There are many ways, however, in which researchers can manipulate their data to meet the mathematical requirements for “statistical significance” and still be very wrong in their conclusions.  This seems especially common in health studies.

The study mentioned above was part of a larger study on how researchers can game the system to produce statistically significant results supporting their hypotheses.

Statistician William M. Briggs has an interesting post “How To Present Anything As Significant” in which he reviews a new paper: “False-Positive Psychology : Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant

In the paper, the authors “show that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (P≤0.05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis.”

In the paper introduction the authors say:

“Our job as scientists is to discover truths about the world. We generate hypotheses, collect data, and examine whether or not the data are consistent with those hypotheses. Although we aspire to always be accurate, errors are inevitable.  Perhaps the most costly error is a false positive, the incorrect rejection of a null hypothesis. First, once they appear in the literature, false positives are particularly persistent. Because null results have many possible causes, failures to replicate previous findings are never conclusive. Furthermore, because it is uncommon for prestigious journals to publish null findings or exact replications, researchers have little incentive to even attempt them. Second, false positives waste resources: They inspire investment in fruitless research programs and can lead to ineffective policy changes. Finally, a field known for publishing false positives risks losing its credibility.

In this article, we show that despite the nominal endorsement of a maximum false-positive rate of 5% (i.e., p≤.05), current standards for disclosing details of data collection and analyses make false positives vastly more likely. In fact, it is unacceptably easy to publish “statistically significant” evidence consistent with any hypothesis.

The culprit is a construct we refer to as researcher degrees of freedom. In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared?

Which control variables should be considered? Should specific measures be combined or transformed or both?

It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various analytic alternatives, to search for a combination that yields “statistical significance,” and to then report only what “worked.”

The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding at the 5% level is necessarily greater than 5%.”

The authors provide guidance for authors and reviewers to remedy the situation.  Briggs summarizes them as follows:

The authors list six major mistakes that users of statistics make. They themselves used many of these mistakes in “proving” the results in the experiment above.

“1.Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article.” If not, it is possible to use a stopping rule which guarantees a publishable p-value: just stop when the p-value is small!

 “2. Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data-collection justification.” Small samples are always suspicious. Were they the result of just one experiment? Or the fifth, discarding the first four as merely warm ups?

“3. Authors must list all variables collected in a study.” A lovely way to cheat is to cycle through dozens and dozens of variables, only reporting the one(s) that are “significant.” If you don’t report all the variables you tried, you make it appear that you were looking for the significant effect all along.

“4. Authors must report all experimental conditions, including failed manipulations.” Self explanatory.

 “5. If observations are eliminated, authors must also report what the statistical results are if those observations are included.” Or: there are no such things as outliers. Tossing data that does not fit preconceptions always skews the results toward a false positive.

 “6. If an analysis includes a covariate, authors must report the statistical results of the analysis without the covariate.” This is a natural mate for rule 3.

The authors also ask that peer reviewers hold researchers’ toes to the fire: “Reviewers should require authors to demonstrate that their results do not hinge on arbitrary analytic decisions.”

The bottom line here is to always be somewhat suspicious of papers whose results depend upon statistical manipulation or modeling versus papers that present actual observations.

“There are three kinds of lies: lies, damned lies and statistics.” – Mark Twain


See also:

Statistical Games #1

Statistical Games #2 Stroke for Stroke



Be skeptical of health studies linking X to Y

On Tuesday we were treated to two front page stories in the Arizona Daily Star linking a health phenomenon to a supposed cause. However, such epidemiological studies prove nothing regarding cause and effect; the link or association is merely suggestive. Frequently such studies fail to consider other possible causes or confounding factors. The association between X and Y could in fact be valid, or it could be a coincidence.

The first story, written by Tony Davis, is “UA study: Diesel exhaust here linked to childhood wheezing.”

In this case, University of Arizona researchers suggest “Infants and very young children in Tucson exposed to high levels of vehicle diesel pollution are more likely than other kids to suffer from early childhood wheezing, a potential asthma indicator.” The study involved 700 children, a very small sample size for such a study, and compared the incidence of childhood wheezing with traffic patterns. According to information in the story, the researchers did not consider some confounding factors such as allergens in the study area or emissions from gasoline-powered vehicles. They did note “A majority of children have wheezing problems in the first few years of their lives due to viral infections..” But the report of the study did not say how that factor was separated from the diesel fumes association. The study report did not dig deeply into socio-economic factors that could impact pre-natal and post-natal care. The study leaves many uncertainties. And perhaps more ominously, “The researchers are also going to see if any kinds of public policies need changing to protect such children.” What would bureaucrats do; forbid families with young children from living near major traffic routes?

The second study, “Kids may help prevent heart disease in men” reported by the Associated Press involved 138,000 men. This study, by AARP, the government, and several universities noted that men with children have lower testosterone and a lower incidence of heart disease. Unlike the first story, this one was more circumspect in its claims. The story noted “a study like this can’t prove that fatherhood and mortality are related.” The story also admitted that it did not consider confounding factors such as the cholesterol and blood pressure data, fertility of the men’s partners, nor the case of being childless by choice.

These kinds of stories make good headlines but often bad science and  unwarranted worry. They can also precipitate harmful government regulation. For instance, see my post “Ozone theory has holes.”

In that story I report that the FDA is banning inexpensive over-the-counter inhalers for asthmatics and forcing them to buy more-expensive prescription medications on the theory that the CFC propellants in the cheap inhalers are harming the ozone layer. Science has proved that wrong, but the FDA apparently hasn’t gotten the message.

So, whenever you see such a story reporting a link or association of one thing to another, be skeptical, and remember this coincidence: human life expectancy has increased since the invention of the Yo-Yo. Will we someday see this headline: “Study: Yo-Yos linked to longer life”? That headline has the same validity as the stories mentioned above.

See also (links updated):

Statistical Games #1

Statistical Games #2 Stroke for Stroke

El Nino incites wars and the Post Office controls temperature

This post shows the length to which some researchers go to get on the climate change bandwagon. It also shows that statistics can be invented and manipulated, and that correlation does not prove causation.

El Nino incites wars:

We have a paper titled: “Civil conflicts are associated with the global climate” published in Nature (Vol. 476, 25 August 2011). Full paper here.

In the paper, the three researchers use statistical methods to correlate civil wars in countries throughout the world with the El Nino (warm phase) of the El Nino/Southern Oscillation (ENSO).

From their abstract:

Historians have argued that ENSO may have driven global patterns of civil conflict in the distant past, a hypothesis that we extend to the modern era and test quantitatively. Using data from 1950 to 2004, we show that the probability of new civil conflicts arising throughout the tropics doubles during El Nino years relative to La Nina years. This result, which indicates that ENSO may have had a role in 21% of all civil conflicts since 1950, is the first demonstration that the stability of modern societies relates strongly to the global climate.

Doesn’t that mean El Nino had no role in 79% of all civil conflicts? So what is the real purpose of this paper?   This study is purely a statistical manipulation which pays no attention to socioeconomic data in the countries studied.

The authors invent a statistic which they call the “annual conflict risk” (ACR). To calculate that statistic:

We examine the Onset and Duration of Intrastate Conflict data set 17,which codes a country as experiencing ‘conflict onset’ if more than 25 battle-related deaths occur in a new civil dispute between a government and another organized party over a stated political incompatibility. Following common practice, a dispute is new if it has been at least 2 years since that dispute was last active; however, individual countries may experience conflict onset in sequential years if the government has disputes with different opposition groups.

Here is their graph correlating ACR with El Nino:


 Not a bad apparent correlation. But, is this correlation a reflection of cause and effect, data manipulation, or merely coincidence? The study period is relatively short. Would the relation hold over a longer period? You can read the paper and decide. Notice also, that many of the ACR highs appear to precede the El Nino highs – oops.

According to the paper, this study was partially funded by the Environmental Protection Agency, the Soros Foundation, and the Environmental Defense Fund, organizations not known for their scientific integrity.

The Post Office controls temperature:

To show that correlations can develop by chance, no matter how absurd the relationship, I present a graph showing the correlation of U.S. first-class postage rates versus temperature for the period 1880 to 2005:


 The graph implies that there is a causal relationship. If so, then the Post Office has the solution to global warming: reduce first class postage cost back to 25 cents.

By massaging data and using statistics, you can find correlations (or anti-correlations if that is your goal) for almost anything.

For some real information on El Nino behavior versus climate models see here.

Statistical Games #2 Stroke for Stroke

Earlier this week I wrote about some common pitfalls in statistical analysis. (See Statistical Games #1)

This morning’s Arizona Daily Star presented a good example of a questionable statistical study:

Study Walking lowers stroke risk

The study claims that “Women who said they walked briskly had a 37 percent lower risk of stroke than those who didn’t walk.” “The research involved about 39,000 female health workers 45 or older enrolled in the Women’s Health Study. The women were periodically asked about their physical activity. During 12 years of follow-up, 579 had strokes.”

The apparent problem with this study is that the “37% lower risk” is a relative risk which says nothing about your actual chances of getting a stroke. Also, a red flag is “women who said.” This was not a clinical trial, but relied on the memory and veracity of the study participants. The basic input data were completely uncontrolled.

Since 579 women of the 39,000 had strokes, this yields a stroke incidence 1,484 per 100,000 for the study period. (The incidence per 100,000 is a common way to report incidence of a condition in a population.)

I searched the internet to find data on incidence of stroke in the general population and met with limited success. Here are two studies I found:

“A 1998 study by researchers at the University of Cincinnati Medical Center suggests that the number of strokes in the United States may be dramatically higher than previously estimated. According to the study, approximately 700,000 strokes occur in the United States every year.” This works out to an incidence of 233 per 100,000 people in the U.S.

Another study, published on BioMedCentral: “There were 712,000 occurrences of stroke with hospitalization and an estimated 71,000 occurrences of stroke without hospitalization. This totaled 783,000 occurrences of stroke in 1996, compared to 750,000 in 1995. The overall rate for occurrence of total stroke (first-ever and recurrent) was 269 per 100,000 population (age- and sex-adjusted to 1996 US population).” At least this study had some clinical evidence and the results of the two studies are in the same ball park. Apparently none of the studies considered confounding factors such as diet, smoking, or genetics.

So I ask, if the incidence of stroke in the general population is somewhere near 250 per 100,000, and the “brisk walkers” reported a stroke incidence of 1,484 per 100,000, where is the evidence that walking lowers the incidence of stroke? One could just as easily conclude from the study that women who walk briskly have a 6 times higher chance of suffering a stroke.

By the way, did anybody catch the switch I pulled? I said that the women had a stroke incidence of 1,484 per 100,000 for the study period. But the study period was 12 years. The other studies I quoted referred to annual rates. If the women’s study incidence of strokes occurred evenly throughout the 12 year period then there would have been 48 per year or 123 per 100,000, or about half that of the general population. Maybe walking does help.

But then again, the walkers had only a 37% lower incidence of stroke not 50%. The report of the study didn’t say how many were walkers. That would be important in evaluating the study.

Aren’t statistics fun? The point of this essay is that one must be wary of statistical studies, especially those that do not have controlled clinical or experimental data, and make sure they are not comparing apples to oranges.

Statistical Games #1

Statistics can be misleading and sometimes mind-boggling. They can be used to hide the true nature of a relationship, and stated in such a way that falsely supports or detracts from a certain position.

Here, I will explore some common pitfalls about the meaning of statistics. I am not an expert in statistics. I’ve taken the usual college courses in classical statistics, and later on, was trained in geostatistics which differs from classic statistical methods in that one has to pay attention to the context, not just the pure mathematics. (For any statisticians out there, geostatistics is akin to Bayesian methods of statistical analysis).

In this article, I present three examples of how statistics can be misleading.

1. Relative risk versus absolute risk

Relative risk is the risk in relation to something else. It can be scary, but it tells you nothing about the actual risk. Absolute risk is simply the probability of something happening. For example, in newspaper stories we frequently read something like this: If you use substance X, you double your chances of contracting dread condition Y. That’s a relative risk.

Let’s say that the incidence of condition Y in the general population is 1 in 100,000. Among long-time users of substance X, the incidence of condition Y is 2 in 100,000. The relative risk says you double your chances of getting Y; sounds scary. But the absolute risk or chance of contracting condition Y rises from a risk of 0.00001 to a risk of 0.00002. Not so scary.

The reasoning in this example can be applied in reverse. For example, have you seen claims that dietary supplement X or drug Z cuts the incidence of condition Y in half? Again, this is relative risk, while the real benefit might actually be very, very small.

P.S. to this section. According to the Arizona Lottery website, your chances of winning the Powerball jackpot are 1 in 195,249,054. Written as a decimal, the chance of winning is 0.000000005. Since this number is very close to zero, a cynic might say your chances of winning are almost the same whether or not you buy a ticket. (But you can double your chance of winning if you buy two tickets!) Risking a dollar on Powerball is a good bet only when the jackpot exceeds $195,249,054 because then the reward equals the risk.

2. What are the odds?

(Adapted from an essay by Tom Siegfried, Science News, 27Mar2010)

Let’s pose a hypothetical example and say that our favorite baseball player, “Slugger Bob” is one of a group of 400 players that were tested for steroid use, and that Slugger Bob tested positive. We will stipulate that the test correctly identifies steroid users 95% of the time. The test also has a 5% incidence of false positives. So, what are the chances that Slugger Bob is a steroid user?

Most people might say that there is a 95% chance that Slugger Bob is a steroid user, and perhaps classical statistics would agree.

But here is where the real world collides with classical statistics and where context matters. Let’s say that we know from prior testing and other experience that about 5% of all baseball players are actually steroid users. We would expect, therefore , that out of the 400 players, 20 are users (5 percent) and 380 are not users.

Of the 20 users, 19 (95 percent of 20) would be identified correctly as users.

Of the 380 nonusers, 19 (5 percent false positives) would incorrectly be indicated as users.

If 400 players were tested under these conditions, 38 would test positive. Of those, 19 would be guilty users and 19 would be innocent nonusers. So if any single player’s test is positive, the chance that he really is a user is 50%, since an equal number of users and nonusers test positive.

3. Clusters and Patterns

Geologists look for patterns in nature because patterns can give clues to special structural situations and mineral deposits. Other kinds of clusters or patterns may be of concern also, for instance, the apparent high incidence of childhood leukemia at Fort Huachuca. Such clusters must be investigated to see if a cause can be identified. But sometimes, such clusters occur just by chance.

In the figure below, we see an array of red dots superimposed over a geologic map of Arizona. Let’s say for now that the dots represent high copper values obtained from assays of stream sediment samples. The array of dots show some apparent clusters and patterns which may indicate some cause of interest.


What might catch a geologist’s eye is the line of dots extending from the northwest to the southeast, exactly along the Mogollon rim which is a structural separation between the lowlands of the southwest and highlands of the Colorado Plateau. There also is a cluster near Ajo, site of a copper deposit. There is a dot near Rosemont, another copper deposit, and a dot in the Galiuro Mountains, again an area with copper deposits. Dots also occur near uranium and coal deposits on the Colorado Plateau and near gold deposits in western Arizona. So, do the dots actually have significance? No, the dots are not copper assays; they represent random numbers. On my computer I generated 100 random numbers and normalized them to values between 1 and 100. I used the first 50 as the X-coordinates and paired them with the second 50 for the Y-coordinates and made a scatter plot of the data. The dots have no significance at all. The patterns occur just by chance, but our rationalizations can give them meaning when there is none.

So, be skeptical. Nothing can be proved with statistics, but sometimes statistics help us look in the right direction.

See statistical games #2