May 27, 2015

What is wrong with p-values?

Earlier this year, editors of the journal Basic and Applied Social Psychology announced that the journal would no longer publish papers containing p-values. The latest American Psychological Association Publication Manual states that researchers should "wherever possible, base discussion and interpretation of results on point and interval estimates," i.e. not p-values. FDA has been encouraging Bayesian analysis. What is wrong with p-values?

What is a p-value? In the classical statistical procedure known as "significance testing", we have a default hypothesis, usually called the null hypothesis and denoted by H0, and we wish to determine whether or not to reject H0 based on some observations X. We choose a statistic S=f(X) (a scalar function of X) that summarizes our data. The p-value is the probability of observing a value at least as extreme as S under H0. We reject H0 if the p-value is below some specified small threshold like α=0.05 and we say something like "H0 is rejected at 0.05 significance level." This threshold or significance level (α) upper bounds the probability of false rejection, i.e. rejecting H0 when it is correct.

Example: We toss a coin 1000 times and observe 532 heads, 468 tails. Is this a fair coin? In this example the null hypothesis H0 is that the coin is fair, observation X is the sequence of heads and tails, and the statistic S is the number of heads. The p-value, or probability of S ∉ [469,531] under H0, can be calculated as: \[ 1 - \sum_{k=469}^{531} {1000 \choose k} \left(\frac{1}{2}\right)^{1000} = 0.04629 \] We can reject the null hypothesis at 0.05 significance level and decide the coin is biased. But should we?

Objection 1: (MacKay 2003, pp.63) What we would actually like to know is the probability of H0 given that we observed 532 heads. Unfortunately the p-value 0.04629 is not that probability (although this is a common confusion). We can't calculate a probability for H0 unless we specify some alternatives. Come to think of it, how can we reject a hypothesis if we don't look at what the alternatives are? What if the alternatives are worse? So let's specify a "biased coin" alternative (H1) which assumes that the head probability of the coin (θ) is distributed uniformly between 0 and 1 (other ways of specifying H1 are possible and do not effect the conclusion). We have: \[ P(S=532 \mid H_0) = {1000 \choose 532} \left(\frac{1}{2}\right)^{1000} = 0.003256 \] \[ P(S=532 \mid H_1) = \int_0^1 {1000 \choose 532} \theta^{532} (1-\theta)^{468} d\theta = 0.001 \] So H0 makes our data 3.2 times more likely than H1! And here the p-value almost made us think the data was 1:20 in favor of the "biased" hypothesis.

Objection 2: (Berger 1982, pp.13) Well, now that we understand p-value is not the probability of H0, does it tell us anything useful? According to the definition it limits the false rejection rate, i.e. if we always use significance tests with a p-value threshold α=0.01, we can be assured of incorrectly rejecting only 1% of correct hypotheses in the long run. So does that mean when I reject a null hypothesis I am only mistaken 1% of the time? Of course not! P(reject|correct) is 1%, P(correct|reject) can be anything! Here is an example:


The table gives the probabilities the two hypotheses H0 and H1 assign to different outcomes X=1 or X=2. Say we observe X=1. We reject H0 at α=0.01 significance level. But there is very little evidence against H0, the likelihood ratio P(X|H1)/P(X|H0) is very close to 1, so the chance of being in error is about 1/2 (assuming H0 and H1 are a-priori equally likely). Thus α=0.01 is providing a very misleading and false sense of security when rejection actually occurs.

Objection 3: (Murphy 2013, pp.213) Consider two experiments. In the first one we toss a coin 1000 times and observe 474 tails. Using T=474 as our statistic the one sided p-value is P(T≤474|H0): \[ \sum_{k=0}^{474} {1000 \choose k} \left(\frac{1}{2}\right)^{1000} = 0.05337 \] So at a significance level of α=0.05 we do not reject the null hypothesis of an unbiased coin.

In the second experiment we toss the coin until we observe 474 tails, and it happens to take us 1000 trials. Different intention, same data. This time N=1000 is the natural test statistic and the one sided p-value is P(N≥1000|H0): \[ \sum_{n=1000}^\infty {n-1 \choose 473} \left(\frac{1}{2}\right)^n = 0.04994 \] Suddenly we are below the magical α=0.05 threshold and we can reject the null hypothesis. The observed data, thus the likelihoods of any hypotheses for this data have not changed. The p-value is based not just on what actually happened, but what could have happened. This is clearly absurd.

Objection 4: (Cumming 2012) If we base the fate of our hypotheses on p-values computed from experiments, at the very least we should expect the p-values (thus our critical decisions) to change very little when we replicate the experiments. Unfortunately p-values do not even give us stability, as this wonderful video "Dance of the p values" by Geoff Cumming illustrates:

Conclusion: (Jaynes 2003, pp.524) expressed the absurdity of significance testing best:

In order to argue for an hypothesis H1 that some effect exists, one does it indirectly: invent a "null hypothesis" H0 that denies any such effect, then argue against H0 in a way that makes no reference to H1 at all (that is, using only probabilities conditional on H0). To see how far this procedure takes us from elementary logic, suppose we decide that the effect exists; that is, we reject H0. Surely, we must also reject probabilities conditional on H0; but then what was the logical justification for the decision? Orthodox logic saws off its own limb.
Harold Jeffreys (1939, p. 316) expressed his astonishment at such limb-sawing reasoning by looking at a different side of it: "An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure. On the face of it, the evidence might more reasonably be taken as evidence for the hypothesis, not against it."


Unknown said...

Great summary! But there's a hyperlink missing from your reference to "Jaynes 2003, pp.524" at the end. I think it should point at

Deniz Yuret said...

Jaynes 2003 hyperlink fixed. Thanks for noticing.