Ray Poynter, July 20 2023
(Note, my views expressed in this blog are my personal views and do not necessarily represent the views of other people or organisations.)
One of the topics generating the most heated debate in the AI/MR/insights arena at the moment is that of synthetic data. Here is my take on what it is, why it will be such a large part of what we do in the future and some of the key steps we need to take.
What is Synthetic Data?
Almost nothing is easy to define, which is true of Synthetic Data, but here goes. Synthetic Data is data that is constructed and used in place of ‘Real Data’. For example, imagine I have written a questionnaire for an online data capture system. I want to test that the questionnaire works in the way I intended. I could ask 1000 people to test the questionnaire, but that is slow and expensive. So, I can use the computer to generate, say, 10,000 random interviews. This will show whether all the branches are visited, nothing breaks, and everything seems OK. These 10,000 random responses are synthetic data.
On the other hand, and this is part of what is generating the heat, I could ask ChatGPT to pretend to be a middle-aged salaryman in Osaka and conduct a qualitative interview with this constructed persona. That would also be synthetic data, i.e. I have constructed data (with the help of a Large Language Model), and it is used in place of an interview with a real person. (Note, at this point, I am not considering whether this approach actually works.)
I acknowledge that I am using the terms ‘Real Data’ and ‘Constructed Data’ without delving into what the terms ‘Real Data’ and ‘Constructed Data’ mean ????
Some Synthetic Data has been here for years
Although the debate about Synthetic seems very new, we have been using some forms of synthetic data for many years, and here are a few examples.
Adding noise to protect anonymity
For decades Census authorities have added noise to Census records so that information can be shared with researchers without risking the researchers being able to identify individuals. This noise means that all large aggregations, for example, men versus women aged 20 to 30, will be correct, but small aggregations are not a one-to-one match with the underlying data. With the rise of more powerful algorithms and AI, the noise-adding process has had to become more sophisticated to prevent people from reverse engineering the underlying data.
Missing data / data imputation
Sometimes we have missing data, for example, when people do not answer a question or if it was not asked to them. At its most simple, using the basic SPSS algorithm, the missing value is replaced by the sample mean. These missing values are Synthetic Data. There is a wide range of well-established techniques for dealing with missing data, using techniques that tend to be termed data imputation.
Data Fusion refers to taking two data sets, for example, a set of data for media consumption and a set of information about purchases and combines them to create a single data set that appears to hold data for people who provided media data and purchase data. Generally, data fusion is done by trying to find matches in one set of data with cases in the other. For example, both data sets might have people who live in the UK postcode ‘NG4 3’; without any other information, each of the three from the media database could be matched arbitrarily with one of the three people in the purchase data. Each case will be wrong, but the aggregates may be right. (The better the matching data, the better the outcomes tend to be.)
Data for testing systems
When testing algorithms or new systems, the normal practice is to generate constructed data that matches the expected distributions. This test data can then be used to assess the algorithms or processes. Beyond research, this approach is used massively in the sciences, such as physics and biology, to test theories and in engineering to test buildings and products.
For decades, synthetic data has been at the heart of choice modelling. In a typical choice modelling (a form of conjoint analysis), the participants are asked to make about 10 choices. However, 10 choices do not provide enough information to calculate the partworths for each participant. Hierarchical Bayes (HB) is an iterative statistical process that generates multiple possible databases, which conform to specific distributions and compares them to the choices made by the real participants. The output from this process is a set of partworths for every attribute and every level for every participant in the study. At the individual level, the values from the HB process often do not match the actual choices made by that individual, but they do work in aggregate. Indeed, many studies show the HB, Synthetic Data, is more accurate than working directly with the underlying ‘Real Data’.
Of course, beyond HB, all conjoint What-If models are a form of Synthetic Data. The What-If model contains a set of partworths, enabling the analysis to create scenarios and test what the participants would have done if asked.
What new Synthetic Data is Emerging?
A wide range of new types of Synthetic Data is emerging, but here are a few examples to help shape the discussion.
Virtual Eye Tracking
In the past, eye-tracking companies tested materials (for example, a poster or a new pack) with real participants using eye-tracking hardware. Some have now trained their AI systems using their historical databases and test materials using their AI to predict where the eyes would have looked if real people had been asked. Virtual eye-tracking is much faster and much cheaper than eye-tracking with ‘Real’ participants.
This method can and is generalized to any large set of data. Machine learning can be deployed to analyse a large set of answers to create synthetic responses, i.e. enable you to answer new questions from old answers.
All LLMs are Synthetic Data
All Large Language Models seek to predict what a real person would say if asked a specific prompt. An LLM can be asked to adopt different personas so that it will answer the same question with different responses.
Using LLMs to generate qualitative responses
Companies have started using LLMs, particularly ChatGPT 4, to engage in prompt engineering to create personas and then to ask these personas to answer questions.
As a very simple example, I asked ChatGPT 4 to pretend to be a British 16-year-old male. I then offered it a choice between a burger made from beef or one made from liver and asked for an open-ended response. ChatGPT suggested a typical response might be “Alright, I’ll go for the beef burger. It’s just what I’m used to, you know? I’ve had loads of them and they’re always tasty. Liver? Not really my thing. It’s got a bit of a strong taste and I’m just not into it. Besides, you can’t beat a classic beef burger, can you?” I have no idea how valid this is, but it does seem plausible because plausible is what LLMs do well.
Note, the companies using this approach are somewhat more sophisticated than this example, which was constructed for illustrative purposes.
Using LLM to generate quantitative responses
Some companies are using LLMs in quantitative research to either a) create personas and then use them to answer quantitative questions or b) to directly predict the aggregate responses that we might get from Real Participants.
To take the simple example I created above, I asked ChatGPT to estimate how this teenager might rate their purchase intent for these two burgers using a 10-point scale. The answers were 9-out-of-10 for the beef burger and 2-3 out-of-10 for the liver burger. Again, I do no know how valid this answer is, but the LLM has created something plausible.
Also, taking the example above, I asked ChatGPT to estimate what might happen if we had asked 1000 young men in the UK this question. It estimated they 80-90% might choose beef and 10-20% choose liver. Once again, I have no idea if this is valid, but for me, the 80:20 split is less plausible, but a 90:10 split might have seemed believable.
Does Synthetic Data based on LLM Models Work?
In my opinion, we do not know yet whether this approach could be valid. If you search the literature, you will see papers showing examples of LLM working and examples of LLM not working. However, I feel much of the testing is interesting but not illuminating.
If I use an LLM Synthetic Data model to test something easy, like “Is beef more attractive than liver in burgers?”, I will probably get a plausible result. However, there is a bigger issue with this sort of test and technique. The critical skill with LLMs is prompt engineering. You can get almost any answer you want from an LLM by asking the correct prompt. If you are researching a field that you already understand, there is a risk that you will unwittingly shape your prompts in a way that helps generate findings that seem to validate the approach. But, even if you used your LLM for a challenging test and avoided prompting the result you want and got a good result, it could still just be chance, or it could be that it can solve that specific problem, but not other problems.
By contrast, papers are being published showing that an LLM was tested for a project and did not work. Here we must highlight three issues a) perhaps the LLM was not used well, b) perhaps this specific problem is not amenable to using an LLM approach, and c) maybe the next generation of LLMs will be able to do it.
We need much more work to identify when and how LLMs models work or don’t work.
Why is Synthetic Data going to be so big?
If we look at virtual eye tracking, we can see that it is much faster and much cheaper. When we look back at the adoption of the Internet for research, we see that it was much faster, much cheaper, and not as good (in most cases). However, the Internet quickly became the largest data collection mode.
A second research problem is growing concerns about the quality and validity of conventional panel data. These concerns hand the Synthetic Data vendors a marketing advantage.
If Synthetic Data shows itself to be as reliable as Real Data, then because it is cheaper and faster, it will become really big.
If the quality of Synthetic Data is unclear, then its cost and speed advantage will enable it to grow rapidly. One reason it might be unclear is that LLMs tend to produce plausible results, making a clear evaluation harder.
If Synthetic Data is shown to be worthless, it will struggle.
What are the concerns about Synthetic Data?
The main concern is whether Synthetic Data works, i.e. does it give us valid results?
A more subtle version of this concern is whether Synthetic Data will work in the future. At the moment, there is a lot of ‘Real Data’, and this can be used to train and evaluate ‘Synthetic Data’. But, if markets evolve, fashions change, and if there is little ‘Real Data’, will the Synthetic Data remain predictive?
A separate concern relates to the morality of tools like LLM. They have collected their learning without paying for it and are now competing with those who helped pay for that original information and who have to pay for new information. I have met several people with this moral objection, but I meet many more who see this as the protest of everybody who has had their livelihood removed by the progress of technology (such as the Luddites of 19th century England).
Some thoughts about a sensible approach to Synthetic Data
As I said in this post’s title, I am convinced that Synthetic Data will be a big part of market research and insights quite soon. However, for some of the reasons set out above, I think it will be a bumpy road. Some people will oppose it without reasonable grounds. Others will use it in folly and reap problems. Both of which will cause bumps for our industry.
As an industry, we need to consider if we should be neutral between the case for Real Data (e.g. focus groups and surveys with real people), or should we defend the commercial interests of that large part of our industry whose skill and livelihood depends on collecting ‘Real Data’? Perhaps we should note that the industry did not mount a campaign to protect the interviewers when we moved from paper-and-pencil and CATI to the Internet.
We need to consider the ethical considerations of Synthetic Data, including the issue of how buyers will be able to judge the quality of what they are buying. Another ethical consideration is to ensure that data can’t be reverse engineered to identify who the underlying people are.
We also need to consider the positive aspects of some types of Synthetic Data, protecting people’s anonymity and reducing the number of times we need to ask them questions or involve them in discussions.
We need to avoid treating Synthetic Data as a single thing. Nobody should be ‘For Synthetic Data’ or ‘Against Synthetic Data’, because Synthetic Data is too many different things. Creating clearer definitions of different types of Synthetic Data would undoubtedly be helpful.
We need to improve our methods of evaluating which types of Synthetic Data work, what we mean by ‘work’, and in which situations they work.
Well, if you have got to here, that was a long read, thank you. If there are things you agree with, disagree with, or would like to add, please use the comments box.
Check out the discussion on LinkedIn about this post
There is a fantastic set of comments from some really bright people on LinkedIn about this post, check them out by clicking here.