Is Statistical Significance an Accurate Reflection of Reality?
For over a century, statistical significance has been an important tool to validate research and to determine if it reflects the reality. In the new millennium, statisticians are revising the meaning of statistical significance and are getting ready to explore the world beyond “p<0.05”. Better understanding of hypothesis testing, significance, and p-values is extremely important for proper applications and interpretations. As practitioners, we always need to keep in mind the basic principles of statistical inference to ensure the most accurate and actionable findings and recommendations.
Testing for Statistical Significance
Statistical significance testing is a tool to determine if results, or effects, or associations between variables are likely to have been observed by chance, or if they truly are an accurate reflection of reality.
In statistical research, we have to accept that a sample does not fully reflect all characteristics of the entire population it has been sourced from. So, to accurately report observations and conclusions, and make recommendations based on the sample, we need to establish if our hypotheses are compatible with the data based on modelling assumptions.
Statistical Significance and Hypotheses
To test for statistical significance, researchers start with formulating a null hypothesis. Usually, it states that there was no result, effect or relationship between variables observed. For example, the null hypothesis could be “The new webpage does not produce more clicks than the old one”. An alternative hypothesis in this case could be “The new webpage produces more clicks than the old one”.
The next step is usually to select a level for a probability of error (alpha level). The pre-defined alpha level for the test is a measure of how rare the results are under the assumption that the null hypothesis is true. The choice for alpha is often subjective. In scientific literature, most researchers select an alpha=.05. This means that they are willing to accept a 5% probability of assuming the result is true when it is really not. In market research, we are often less demanding, and are using alpha=.1 (or 90% confidence).
After the data is collected, the p-value for the null hypothesis is estimated. The p-value reflects the effect of the sample size and variation on the tested metric. The lower the p-value, the less likely the results are due purely to chance. If the p-value is lower than the target (p<0.1), then the null hypothesis can be rejected in favor of the alternative. Again, this means the probability is small that the results were due solely to chance.
A Shift in Modern Statistics
For more than a century, scientists and practitioners were notoriously using the process described above in their work, but the last decade brought a crucial paradigm shift to the world of modern statistics. In March 2019, the American Statistical Association (ASA) published a special issue of its official journal, The American Statistician. The issue included 43 articles by renown experts denouncing the practice of classifying statistical results as “significant” or “non-significant” based solely on p-values and a threshold. Authors were arguing that this practice has originated from confusions in early history of statistics. The editorial by the ASA Executive Director and colleagues (Wasserstein, Schirm, & Lazar, 2019) recommended eliminating the use of “p < 0.05” and “statistically significant” in statistical analysis. Following the publication, The President of the ASA, Karen Kafadar, convened a Task Force to write a comprehensive and clear statement about the proper use of statistical methods, specifically hypothesis testing and p-values. An unanimously agreed Task Force statement was released this summer and sent to multiple journals to inform the broad community of scientists and researchers. The statement reiterated that the “p-values themselves provide valuable information” and that they “should be understood as assessments of observations or effects relative to sampling variation, and not necessarily as measures of practical significance.” But most importantly, the ASA Task Force underlined that “p-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data.”
Using Statistical Significance in Business
Companies use statistical significance to understand how strongly the results of a survey they’ve conducted should influence the decisions they make. For example, if we know that a new version of a webpage generates 6% more clicks than the old one, we will conclude that the result is surely of practical significance. If the associated p-value is below 0.1, the results will be considered statistically significant, and the null hypothesis will be rejected. But what if the p-value is more than 0.1? Will we disregard the findings only because the p-value is slightly above the threshold we decided on? Let us consider the opposite situation. What if we worked with a large dataset, and found out that customers are 0.0001% more likely to click on the new webpage than on the old webpage? Even if the result is statistically significant, would our client act and do anything differently based on an outcome like this? In business, a “significant” finding often means strategically important. As shown in the example above, in some cases, statistical significance does not mean practical significance and vice versa.
Non-sampling error due to survey design and other factors can be more critical than the sampling error. Statistical tests can be designed in different ways and consider different external factors. In most cases a robust sample, survey design, and a clean low-noise data are more important for the accuracy and consistency of observations and recommendations than statistical significance testing.
Practical applications of findings are the most important for decisions. For over a century, there was a bias in scientific literature that “a result wasn’t publishable unless it hit a p = 0.05 (or less).” But for many business applications, the more important question is, “Does the result stand up in the market, is it consistent and replicable?”
Reevaluating Statistical Significance in the Future
Revisiting and reevaluating fundamental concepts like statistical significance help researchers to deliver reliable and statistically sound results with consistent and clear recommendations. The idea is not to completely abandon p-values or hypothesis testing, but to make sure that the methodology is properly used, and the outcome is thoughtfully interpreted.
In summary, we should not base our conclusions and recommendations only on whether an association or effect was found to be statistically significant or disregard a finding only because the p-value was slightly above a threshold. Instead, we are focusing on context and objectives of research, and keeping them in mind through the whole process from design to analysis. Analyzing and interpreting results, we are considering not just statistical significance, but precision of the estimates, size of the effects, modelling assumptions, and practical implications.
Written by Faina Shmulyian, Vice President of Data Science at Big Village Insights