This post has been written in response to a query I receive fairly often about sampling. The phenomenon it looks at relates to the very weird effects that can occur when a researcher uses non-interlocking quotas, effects that I am calling unintentional quotas, for example when using an online access panel.
In many studies, quota controls are used to try to achieve a sample to match a) the population and/or b) the target groups needed for analysis. Quota controls fall into two categories, interlocking and non-interlocking.
The difference between the two types can be shown with a simple example, using gender (Male and Female) and colour preference (Red or Blue). If we know that 80% of Females prefer Red, if we know that 80% of Men prefer Blue, and if there are an equal number of Males and Females in our target population, then we can create interlocking quotas. In our example we will assume that the total sample size wanted is 200.
- Males who prefer Red = 50% * 20% * 200 = 20
- Males who prefer Blue = 50% * 80% * 200 = 80
- Females who prefer Red = 50% * 80% * 200 = 80
- Females who prefer Blue = 50% * 20% * 200 = 20
These quotas deliver the 200 people required, in the correct proportions.
The Problems with Interlocking Quotas
The problem with the interlocking quotas above is that it requires the researcher to know what the colour preference of Males versus Females is, before doing the research. In everyday market research the quotas are often more complex, for example: 4 regions, 4 age breaks, 2 gender breaks, 3 income breaks. This pattern (of region, age, gender, and income) would generate 96 interlocking cells, and the researcher would need to know the population data for each of these cells. If these characteristics were then to be combined with a quota related to some topic (such as coffee drinking, car driving, TV viewing etc) then the number of cells becomes very large, and it is very unlikely the researcher would know the proportions for each cell.
When interlocking cells become too tricky, the answer tends to be non-interlocking cells.
In our example above, we would have quotas of:
- Male 100
- Female 100
- Prefer Red 100
- Prefer Blue 100
The first strength of this route is that it does not require the researcher to know the underlying interlocking structure of the characteristics in the population. The second strength is that it makes it simple for the sample to be designed for the researcher’s need. For example, if in the population we know that Red is preferred by 80% of the population, then a researcher might still collect 100 Red and 100 Blue, to ensure the Blue sample was large enough to analyse, and the total sample could be created by weighting the results (to down-weight Blue, and up-weight Red).
Unintentional Interlocking Quotas
However, non-interlocking quotas can have some very weird and unpleasant effects if there are differences in response rates in the sample. This is best shown by an example.
Let’s make the following assumptions about the population for this example:
- Prefer Red 80%
- Prefer Blue 20%
- No differences in colour gender preferences, i.e. 80% of males and females prefer Red
- Female response rate 20%
- Male response rate 10%
The researcher knows that overall 80% of people prefer Red, but does not know what the figures are for males and females, indeed the researcher hopes this project will through some light on any differences.
The specification of the study is to collect 200 interviews, using the following non-interlocking quotas.
- Male 100
- Female 100
- Prefer Red 100
- Prefer Blue 100
A largish initial sample of respondents are invited, let’s assume 1000 males and 1000 females. Noting that 1000 males at 10% response rate should deliver 100 completes.
After 125 completes have been achieved the pattern of completed interview looks like this:
- Female Red 67
- Female Blue 17
- Male Red 33
- Male Blue 8
This is because the probability of each of the 125 interviews can be estimated by the combination of the chance it is male or female (10% male response rate and 20% female means that it is one-third likely to be a male and two-thirds likely to be a female) and the preference for Red (80%) and Blue 20%). Which to the nearest round percentages gives us the following odds: Female Red 53%, Female Blue 13%, Male Red 27%, Male Blue 7%.
The significance of 125 completes is that the Red Quota is complete. No more Reds can be collected. This, in turn, means:
- The remaining 75 completes will all be people who prefer Blue
- 17 of the remaining interviews will be Female (we already have 83 Females, so the Female quota will close when we have another 17)
- 58 of the remaining interviews will be Male, Male Blues will be the only missing cell left to fill
- The rapid filling of the Red quota, especially with Females, has resulted in interlocking quotas being created for the Blue cells.
The final result from this study will be:
- Female Red 67
- Female Blue 33
- Male Red 33
- Male Blue 67
Although there is no gender bias to colour preference in the population, in our study we have created a situation where two-thirds of Males prefer Blue, and two-thirds of the Females prefer Red.
In this example we are going to have to invite a lot more Males. We started by inviting 1000 Males, and with a response rate of 10% we might expect to collect our 100 completes. But, we have ended up needing to collect 67 Male Blues, because of the unintentional interlocking quotas. We can work out the number of invites it takes to collect 67 Male Blues by dividing 67 by the product of the response rate (10%) and the incidence of preferring Blue (20%), which gives us 67 / (10% * 20%) = 3,350. The 1000 male invites need to be boosted, by another 2,350, to 3,350 to fill the cells. Most researchers will have noticed that the last few cells in a project are hard to fill, that is because they have created unintentional interlocking quotas, locking the hardest cells together, which makes them even harder.
This, of course, is a very simple example. We only have two variables, each with two levels, and the only varying factor is the response rates between Male and Female. In an everyday project we would have more variables, and response rates will often vary by age, gender, and region. So, the scale of the problem in typical interlocking samples is likely to be larger than in this example, at least for the harder cells to complete.
Improving the Sampling/Quota Controlling Process
Once we realise we have a problem, and with the right information, there is plenty we can do to remove or ameliorate the problem.
- Match the invites to the response rates. If, in the example above, we had invited twice as many Males as Females the cells would have completed perfectly.
- Use interlocking cells. To do this you might run an omnibus before the main survey to determine what the cells targets should be.
- Use the first part of the data collection to inform the process. So, in the example above we could have set the quotas to 50 for each of the four cells. As soon as one cell fills we look at the distribution of the data and amend the structure of the quotas, making some of them interlocking, perhaps relaxing (i.e. make bigger) some of the others, and invite more of the sorts of people we are missing. This does not fix the problem, but it can greatly reduce it, especially if you bite the bullet and increase the sample size at your expense.
Working with panel companies. Tell the panel company that you want them to phase their invites to match likely response rates. They will know which demographics respond better. For the demographic cells, watch to see that they are advancing in step. For example, watch to see that Young Males, Young Females, Older Males, and Older Females are all filling at the same rate and shout if this is not happening.
It is a good idea to make sure that the fieldwork is not going to happen so fast that you won’t have time to review it and make adjustments. As a rule of thumb, you want to review the data when one of the cells is about 50% full. At that stage you can do something about it. This means you do not want the survey to start after you leave the office, if there is a risk of 50% of the data being collected before the start of the next day.
Questions? Is this a problem you have comes across? Do you have other suggestions for dealing with it?