July 12, 2004

SIGNIFICANCE AND POWER IN STATISTICS

This is a note on a technical point I was writing to a law professor about which I might use in my own writing or teaching later.

Suppose we estimate the effect of capital punishment on murder to be a reduction in the murder rate of 3, and we want to know how accurate that estimate is, and whether we can be confident the true value is not really 0. The standard test is the t-test, which might give us a confidence level (or significance level) of 30%. Conventionally, our conclusion would be that we cannot reject the null hypothesis that capital punishment has no effect on murder.

But what does the 30% mean? It means this. Suppose that the true effect is indeed 0. If our model is correct, and we ran the same kind of test on 100 different samples of data, we would expect to falsely reject the null hypothesis of 0 effect in 70 of those 100 tests. Each test would come up with a different estimate-- 3, 7.2,-8.4, 1.9, 0.0, -2.1, etcetera-- and 70 of those estimates would be far enough from 0 that we'd falsely conclude that the true coefficient was not zero. Thus, this test would be very misleading on this data if the true effect is indeed 0.

Should we therefore conclude that the true effect is 0? Not really. There is a second desirable feature of a test: its power. A test's power is the probability that the test gives us the right answer given that the null hypothesis is wrong-- that is, the probability that we wrongly fail to reject the null. It could be that our data is so poor that our t-test cannot reliably detect an effect of capital punishment on murder even though the effect exists and is strong. That, indeed, is truly the big problem for statistical studies of capital punishment, and one reason I tend not to pay them much attention.

Let me explain more. Suppose the true effect is 3.5. The power of the t-test is that probability that we don't fail to reject the null hypothesis of a 0 effect, given that the true effect is 3.5-- a number that we might estimate to be equal to 45% in this example where our estimate was 4. Thus, it might be although we think our test is too unreliable to tell us that the true value is different from 0, we might at the same time think our test is also too unreliable to tell us that the the true value is different from 3.5. So we would *not* conclude that we can say capital punishment has no effect-- though we can't say it does have an effect, either.

The reason that power values are not reported in studies is that the value of the power depends on the true effect, something we don't know. There is just one null hypothesis, so it is easy to find the significance of a test. But there are lots of possible true effects. In the last paragraph, I assumed the true effect was 3.5, and got a power of 45%. If the true effect were not 3.5, but 1.2, then the power might be 7%. That would be because it would be very hard for a test to be powerful enough to distinguish between a true effect of 1.2 and a null hypothesis of 0. Or, it might be that the true effect was 17.5, and the power was 91%. If the true effect is as big as 17.5, it would be very unlikely that our test would cause us to believe it was 0.

Thus, in thinking about statistical studies that fail to find a statistically significant effect of X on Y, we must remember that maybe the data was just so poor that if such an effect did exist, the study wouldn't have found it anyway.