Statistical Power and Power analysis

Marc Deveaux
4 min readDec 3, 2021

--

Photo by David Clode on Unsplash

Working notes on power analysis.

Background

Power analysis ensures reproducibility by helping you avoid p-hacking and being fooled by false positives (recall Type II Error: not reject the null hypothesis when there is a significant effect [false negative], meaning the p-value is pessimistically large).

“the lack of statistical significance is often interpreted as the absence of an effect. Unfortunately such a conclusion is often a serious misinterpretation. Indeed, non-significant results are just as often the consequence of an insufficient statistical power”

Power analysis is normally conducted before the data collection. The main purpose of power analysis is to help the researcher to determine the smallest sample size that is suitable to detect the effect of a given test at the desired level of significance.

It is the most popular method to show the lack of an effect in the case of non-significant results. Other tests like the equivalence test and the confidence interval can be used.

Concept

Let’s say you have 2 sample distributions and you test whether they are coming from the same population distribution

  • Null Hypothesis: there is no difference between the 2 distributions
  • Alternative Hypothesis: there is a difference between the 2 distributions

In order to test this hypothesis, we will use p-value which is the probability of obtaining a result equal to or more extreme than was observed in the data. We can see that p-value is just a probability and that in actuality the result may be different. The test could be wrong. Given the p-value, we could make an error in our interpretation.

For example, if our 2 samples were coming from 2 different population distributions where only a small part overlaps, by doing several experiments we would tend to have a p-value < 0.05. But once in a while, the test could take random points from where the 2 sample overlaps, resulting in a higher p-value (let’s say 0.08). As a result we would reject the null hypothesis.

So even though we know the data came from 2 different distributions, we cannot correctly reject the null hypothesis that all data comes from the same distributions.

Statistical Power

Statistical power is the probability that the test correctly rejects the null hypothesis. It is only useful when the null hypothesis is rejected. Alternatively, Power is the probability that we will correctly get a small p-value (<0.05)

  • If you have a high probability of correctly getting a small p-value and rejecting the null hypothesis, then you have a large amount of power
  • If the 2 distributions completely overlaps, then the concept of Power, the probability that we will correctly reject the null hypothesis, doesn’t apply in this situation
  • if the 2 distributions have a lot of overlaps and we have a small sample size, we have relatively low power

Power analysis

Power analysis will tell you how many measurements (i.e how many random points from each distribution) we need to collect to have a good amount of power.

  • If you have only few measurements, it is harder to estimate the true mean of the distribution population (you could randomly pick a point located on the distribution extremities for example)
  • As you increase the number of measurements, you get to guess a good estimation of the true population mean of this distribution

Power analysis answers questions like “how much statistical power does my study have?” and “how big a sample size do I need?”.

Doing the power analysis

In order to conduct the power analysis, you have to choose the following 3 elements and plug them into a statistics power calculator (check online):

  • Power: we can pick any value between 0 and 1, but in practice a common value is 0.8. We can interpret it as “we want an 80% probability that we will correctly reject the null hypothesis”
  • Alpha: the threshold for significance level (boundary for interpreting the p-value). We can pick any value between 0 and 1 but in practice a common threshold is 0.05
  • Effect size (d): used to estimate the overlap between the 2 distributions. Overlap is affected by the distance between the population means and by the standard deviations. A common way to combine those 2 elements into a single metric is to divide the estimated difference in the means by the pooled estimated standard deviations. Other ways include Pearson’s correlation coefficient for the relationship between variables

Note: Pool estimated standard deviation is calculated as follow+

S1 = standard deviation of the first distribution

S2 = standard deviation of the second distribution

Interpreting the result

Let’s say the result is 9. It means that if I get 9 measurements per group, I will have an 80% chance that I will correctly reject the null hypothesis.

Sources

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Marc Deveaux
Marc Deveaux

No responses yet

Write a response