Over the last few years there have been many calls for market researchers to stop using significance testing based on assumptions of random probability testing to measure the potential impact of sampling error. For example, Annie Pettit writing in The Huffington Post asked “Stop Asking for Margin of Error in Polling Research”. But, despite the concerns about the correctness of using this technique, it seems to still be in common use.
In this post, I briefly explain what significance testing is (experts can jump this bit), why it doesn’t do what people seem to think it should do, and the way I think we should be using it in the future.
What Is Significance Testing?
The type of testing I am talking about in this post relates to sampling error. In quantitative research, a sample is taken from a population and one or more statistics are calculated. These statistics are then used to estimate the values for the total population.
For example, assume 1000 people are selected at random from a population of 20 million. Assume that 50% of the sample are female. The inference from this study is that it would be expected that 50% of the total population would be female. However, when a sample is taken there is always a probability that the sample will be wrong, just by chance – and this is called sampling error. Significance testing attempts to work out what impact sampling error might have. In the case of the example above, with an estimate of 50% and a sample size of 1000 people, there is a 95% chance that the impact of any sampling error is less than plus/minus 3% – so we expect between 47% and 53% of the population to be female – and we expect to be wrong 5% of the time.
What Are The Problems With Using Significance Testing?
There are two key problems associated with Significance Testing:
1. Most of the samples used by researchers do not approximate to a random probability sample. This is either because they did not start out as a suitable sample (for example people on an online access panel), or because the response rate was too low.
2. There can be a tendency for people to assume the error in the research is described (or limited to) the sampling error. One common name for this error is MOE (margin of error). However, there are many other sources of error, such as asking the wrong questions, getting the wrong answers, and doing the wrong processing (as we have seen with recent polling debacles, such as Trump vs Clinton in the USA and Brexit in the UK).
The Brexit polling in the UK shows the problems associated with significance testing. The polls that most closely approximated to random probability samples (the telephone polls) were more inaccurate than the online access panels (see YouGov). The reasons the polls did not perform better included: not sampling the right people (too few people over 70 were sampled and those that were sampled then had to be upweighted), not everybody questioned was willing to answer, the answers that some people gave about their likelihood to vote were inaccurate.
Research academics (and some market researchers) focus on something called Total Survey Error, which is a move in the right direction. But most software, most reports, and most articles in the media focus on sampling error, despite it being methodically inappropriate and sometimes not the biggest source of error. In elections such as the Trump/Clinton Presidential race, the very large number of polls mean that the net sampling error across all the studies would be minuscule – but the error in the net predictions were catastrophic. Or, to put it another way, the sampling error was small, but the Total Survey Error was unacceptably large.
So, Should Significance Testing Be Abolished?
My answer to this question is NO, but I think we need to change the context and role of significance testing.
Before I share my three-part recommendation, I want to spend a moment considering what the sampling error is actually reporting. Assume we run a study with an online access panel, and collect a quota-controlled sample of one thousand people. If we calculate the sampling error, it is not the sampling error of the entire target population. It is the sampling error of the people we might have interviewed with this specification. Let’s call this the potential study population (people who are on the panel, who were available when the survey was run, and who would have agreed to take part if asked).
The sampling error statistic does not tell us about how ‘wobbly’ our estimate of the target population is. The sampling error statistic tells us how ‘wobbly’ our estimate of the potential study population is. In essence, the sampling error statistic we calculate from non-probability samples is a measure of reliability – but tells us little about validity.
So, my three recommendations are:
- Researchers should avoid reporting sampling error to non-specialist audiences, and they should ban the term MOE (margin of error) being used in the context of sampling error.
- Researchers should promote the concept of Total Survey Error, especially with junior researchers and research users.
- Sampling error should be calculated to divide results into two categories: a) results that are potentially interesting (subject to further validation), b) results where the differences are too small to assume that they actually exist.