Is weighting a form of synthetic data?

Ray Poynter, 14 May 2024

One of the surprises about the sudden and widespread interest in Synthetic Data is that it has forced us to revisit things we have been doing for decades. This revisiting of old practices is useful for framing where we think Synthetic Data should go and how it should be evaluated and regulated.

A case in point is weighting, which we have been doing with survey results for decades and never used to call Synthetic Data. But, as I will show with a simple example, weighting can be seen as synthetic data, and it can also highlight a potential benefit.

Example
Let’s consider a simple, hypothetical, quantitative survey project. We collect 400 responses. The responses should be 100 Young Men, 100 Young Women, 100 Older Men & 100 Older Women. However, it turns out we have 50 Young Men, 125 Older Men, 125 Young women and 10 Older Women.

Traditional Treatment
In cases like this, what often happens is that we weight the data. We up weight all the Young Men (to 2 in this case) and we down weight the Older Men and the Young Women to 0.8, leaving the Older Women at weight 1. This weighting process superficially makes the data look like 400 people in the right proportions. However, it is still only 50 Young Men, and as the statistically astute among you will know, the Effective Sample Size is now just under 350 (ESS = 348). See the table below.

If we focus on each of the Younger Men, we can see that their responses appear twice in the data now. We could think of one version of their data as the ‘real’ data. Its mirror image would then by a synthetic person who has been created by simply replicating one existing person. When we write this process down this way, it readily suggests there ought to be a better way of creating synthetic participants than simply making an identical copy (clone) of each one.

One undesirable effect of this form of synthetic data is that it is very lumpy when we analyse it. For example, we can never get an odd number of Young Men. If we are looking at preferences between, say, two test concepts, then the difference has to be 0 or 2, or 4, or some other multiple of two (in terms of Young Men).

Augmented Synthetic Data
There are lots of different ways of creating Augmented Synthetic Data, but here is a general description. We take the 400 interviews, and we look at the patterns within the data, including how similar are the Young Men to each other, how similar are Young Men to Young Women, how similar are Young Men to Older Men. We then create 50 profiles that draw on these patterns and distributions. These 50 cases would, in a simple case, have the same distribution and marginal totals as the real 50 cases, but they would not simply be clones.

This process can be much more complicated, it can be based on merging similar people, it can draw on other data sets. However, the general approach is using additional information to create the extra data, as opposed to simply creating identical copies.

One of the key benefits of Augmented Synthetic Data is that it typically behaves in a less lumpy way. If we now compare the scores for Young Men for two concepts that were tested, they can be identical or differ by any integer count, e.g. 1, 2, 3, 4. My experience with another Augmented Synthetic Data approach (Hierarchical Bayes) suggests that the Synthetic Data tends to be slightly more conservative than the original data, for example, with the same means and sums but with a slightly smaller standard deviation. This seems reassuring to me.

Augmented Synthetic Data is very different to Pure-play Synthetic Data
Pure-play Synthetic Data is created without any primary data. At present, no theory suggests why pure-play Synthetic Data should work, there is some evidence of it working and some evidence of it not working.

Augmented Synthetic Data can be understood as a method and readily tested. For example, a section of data can be removed, and the system can then check the extent to which the missing data’s effects can be replicated.

In Summary
Augmented Synthetic Data is already here. If you currently apply weighting to your data, you should probably consider whether Augmented Synthetic Data is a better option.

(BTW, I want to give a shoutout to Livepanel’s Ben Leet, who was the first person I saw highlighting the issue of weighting being a form of synthetic data.)

Want to know more about the Quantitative
uses of Generative AI?
I am running a course on using ChatGPT 4 with Quantitative Market Research on Wednesday May 29. The course is in-person, online, delivered by me and lasts 90 minutes (followed by up to 30 minutes extra Q&A). The price is $195 and the session starts at 10am New York (which is 3pm London). Check it out by clicking here.

Related