From time to time, I am asked to write some notes (or teach a section) on hypothesis testing. Each time I do this, I am reminded how little the theory of hypothesis testing has to do with modern, commercial market research. Perhaps we should stop focusing on a theory that does not really apply, and talk about what we actually do?
At its simplest, the hypothesis process is as follows:
- Decide we want to show X is correct
- Design a situation ‘Not X’ and collect data to investigate ‘Not X’
- Show that ‘Not X’ is very unlikely
- Assume X is right.
This is highly unnatural for most people. People want to focus on X, not show it as a by-product of something completely different. This method is loosely what is done in academia, but almost never in the commercial world.
Consider an example from concept testing
Assume we are testing three new concepts and the forecast market share values are 5%, 6%, and 12%. What do we really want to know?
On most occasions, I think we would like to know whether we should choose the concept with the 12% score. For example, is it genuinely better than the 5% and 6% results? And, when we say better we tend to mean that the 12% concept will sell more stuff than the other two, if it is actually produced.
There are a wide range of reasons why 12% may not be better than the 5% and 6% results, including:
- The test we are using might be not very good at predicting market sales.
- The market might change between now and the time of the product launch.
- The product when launched might not match the concept that was tested.
- The launch of the product may not be supported with enough advertising and marketing spend – or the campaign might not be very good.
- We may have tested the concept amongst the wrong people, for example we might have tested it on brand loyalists rather than the wider market.
- Sampling error may, by chance, have produced a rogue result.
Classic hypothesis testing only looks at item number 6 from this list. However, many people (especially the behavioural economists and neuroscientists) have suggested the biggest problem is number 1 (the tests we use are not great predictors of future behaviour). Many market researchers would say that the main cause of market failures relate to items 3 (poor product) and 4 (not enough marketing). A few people use Total Survey Error, but that is still relatively rare.
Will the product really reach a 12% share in the market?
The short answer is, usually, no. There are too many factors at play to expect the forecast to be accurate, for the reasons outlined in the six points above.
Using benchmarks to aid forecasting
What typically happens in concept testing (and in particular in sales forecasting) is that agencies create models that link historical research results to actual outcomes. This means that the real message is not our forecast that the share is X, plus or minus some sampling error. The real message is that based on previous tests, we expect the result to be between X and Y. Better still, if you can tell us how much you plan to spend on marketing and what percentage of stores will stock this product, we can give you an estimate that is more likely to be correct. But when we say more likely, we do not mean theoretically more likely, we mean that based on previous experience we believe the results will be more accurate.
Is there a role for significance testing in market research?
I think the answer is yes, but not as a method of assessing the validity of the results. If we can see a difference between two numbers and if testing says that the difference is not significant, it means that if we ran the test again, with the same survey, with the same sort of people, there is a good chance that the numbers will change. In effect, statistical significance is an indication of reliability (but not validity). Validity relates to whether the numbers are ‘right’. Reliability indicates whether the numbers are stable (a clock that is one hour slow is reliable, but it is wrong). If our results are not even stable, we should be worried – so significance testing is a useful check as to whether the numbers are at least stable.
For this reason, I am happy for people who do not understand the theory of hypothesis and significance testing to run tests. But, the key is that if the numbers are significant, that just means they are big enough to be worth looking at – it doesn’t mean that they are are important. If they are not significant, they could just be noise.
How do we assess confidence, if we don’t have benchmarks?
If we are running a research test and we do not have hundreds of other concept tests, or tracking studies, or customer satisfaction studies to benchmark our data against, how can we tell the user of the research how much confidence they should have in the results?
The answer to this question is not going to be found in terms of creating and dismissing a null hypothesis. The answer is going to be best achieved through as much triangulation as possible in the analysis. For example:
- Are the differences big enough to matter (i.e. commercially significant)?
- Are the differences statistically significant (if they aren’t, they might just be noise)?
- Are the differences in this data consistent with other studies, findings, or experience?
- Can the reason for the differences in the numbers be understood?
- Can predictions be made from this data, which can then be tested?
Of the reasons above, number 4 is probably the most powerful, and the most dangerous. Many times when I have conducted research and found something that looked like a finding but which was not supported by theory, other data, experience etc. I have followed it up by talking to people who are in the relevant market (e.g. talking to customers). In many cases when you talk to a customer or somebody who deals with customers, the numbers make sense and this increases your confidence – that is the strength in this approach. However, humans are great at seeing patterns and reasons where they don’t exist, and that is the danger.
What do you think?
Should we stop assuming that every researcher should be taught how to use the null hypothesis, along with topics like Type I and Type II errors? Yes, some people should know this, for example the people designing new systems. But since they are almost never used commercially, can they be left to the marketing scientists and other specialists?