Synthetic Data –an Overview, a Taxonomy & some FAQs

Ray Poynter, 20 April 2024

AI is a hot topic, and in the insight and research ecosystem, the hottest topic is Synthetic Data. In this post, I look at what Synthetic Data means, why it is such a hot topic, some of the key issues, and share some thoughts about Synthetic Data and its use.

In writing this analysis, I am drawing on several resources, including a study conducted a couple of weeks ago by NewMR, which examined the global state of play with respect to AI and the insights and research industry.

The NewMR study asked, “Are you familiar with the term Synthetic Data
in the context of insights and research?” The table of results below shows that 50% are familiar, and nearly 80% are familiar or familiar to some extent. This is remarkably rapid since the current interest in Synthetic Data is mostly predicated on the launch of widely available LLMs (Large Language Models), and the earliest of the current crop was launched in November 2022, about 16 months ago.

What is Synthetic Data?
This is both an easy and difficult question. The easy version of the question suggests that Synthetic Data refers to using AI to create simulations of research participants so that research can be conducted with them, instead of conducting the research with ‘real’ people.

The more complex version of the question covers everything that is not primary, unmodified data collected from real people. In this broader context, people have highlighted that the term ‘Synthetic Data’ could be used to encompass a wide range of things. This wider range includes weighting the data, using data fusion, imputing missing values, and applying Hierarchical Bayes in fields like Discrete Choice Modelling. These wider uses of techniques, which might today be called Synthetic Data, go back many years, as described in my post on Synthetic Data from July 2023.

In the recent NewMR study, we asked, “What does the term ‘Synthetic Data’ mean to you?” as an open-ended question. This allows a taxonomy to be developed that reflects how the term Synthetic Data is being used. From responses in the study, it is clear that most people have a pragmatic and relatively narrow definition of Synthetic Data. A taxonomy of how the term Synthetic Data is used in the insights ecosystem at the moment is outlined below.

AI Based

AI is used to generate
- Answers to questions
- Constructed participants (qual or quant) that can then be questioned.
Answers are generated from
- Information that an AI system already holds.
- From first-party data (e.g. from surveys conducted)
- A combination of these two
Key Uses
- Creating personas that can be ‘questioned’ in a qualitative way.
- Creating quant data that can then be analysed using conventional quant tools.
- Augmenting data sets, for example, to boost hard-to-reach segments (both qual and quant).

Here are three archetypical descriptions of AI that were supplied to the study:

General, “Data that is created by AI or with the use of algorithms to mimic real-world data for the purposes of analysis or simulation where actual data is not available or cannot be used.”
Negative, “Mostly bollocks. The shallow simulations offered by what is termed as ‘Synthetic Data’ rarely capture the nuanced complexities of real-world scenarios.”
Positive, “Synthetic Data is essential for augmenting our datasets, allowing us to simulate a wider range of scenarios than our real-world data permits.”

Augmenting Data versus Creating Data
One important distinction in Synthetic Data is the difference between augmenting primary data and creating it from an AI system. The creation of Synthetic Data entirely from AI is a relatively new phenomenon, in the context of trying to get the Synthetic Data to replicate or mimic real, primary data. The use of algorithms or AI to take existing data and generate additional data has a much older history. Different proponents and vendors of this second approach often have different names for what they do, but the generic term for it is augmented data.

An augmented data approach, for example the approach offered by Argentinian company Livepanel, collects primary data in the conventional way and they creates additional cases to fill the spaces. For example, if a study has too few young men, the system can look at the distribution of responses by age and gender, it can look at the different patterns within the data and generate cases of young men who in aggregate complement what is already known.

Is Synthetic Data a good thing?
In the NewMR Study, we asked the participants, “Is Synthetic Data Likely to be a Good or Bad Thing?”

As the chart above shows, there is no consensus on whether Synthetic Data is likely to be a good thing or a bad thing. Whatever your view is about Synthetic Data, it is perhaps useful to remember that you are in a minority and that everybody else is in a minority too.

What are the ethical issues around Synthetic Data?
A key thing to remember about AI and ethics is that all of the existing research ethical issues apply, for example, as delineated by the ICC/ESOMAR Code, and then there are additional AI-related issues. Additional AI issues include identifying underlying biases, the risk of hallucinations, and the use of other people’s material to train the AI. Finally, Synthetic Data may add its own, additional ethical issues.

Potential ethical issues specific to Synthetic Data include:

Does it work? If it does work, under what conditions does it work?
What biases could the AI system, including its guardrails, introduce to the research process?
Will Synthetic Data get out of date, especially if there is a decline in the amount of primary data being collected?
Could Synthetic Data pollute AI systems and thereby erode their usefulness? For example, it is claimed that more than 50% of all images accessible on the Internet were created or modified by AI, which means AI is currently being trained on the outputs of AI systems. If reports based on Synthetic Data become normal and used to train AI systems, this feedback concern could be an issue.
Will Synthetic Data lead to more focus on mainstream segments of society at the expense of groups who are less represented in the data used to train the AI?
What needs to be disclosed to the buyer and the user of research utilising Synthetic Data. At present there seems to be a consensus that the use of AI in general, including the use of Synthetic Data, should be disclosed to buyers and users. ESOMAR has published 20 Questions that buyers of research utilising AI should ask to prospective vendors, and this is a great starting point for anybody thinking of using an AI system, including those that involve Synthetic Data.

How should Synthetic Data be tested?
There are two keyways of testing the usefulness of Synthetic Data, by comparing it with primary data collected from real people and secondly by comparing the outputs from Synthetic Data with real world outcomes. Both have their merits, but neither is perfect. If we compare the results from Synthetic Data with the results from primary data we have to accept that primary market research data is usually not 100% correct. Therefore, we are comparing Synthetic Data with something that is not accurate. This creates the possibility of Synthetic Data being simultaneously a) more correct than primary data and b) rejected as not being useful because it is not the same as primary data.

If we compare Synthetic Data with real world outcomes, we are assuming that the outcome is knowable from just the material that was tested. For example, if we pre-test a new product, that product’s performance in the market will depend on factors such as marketing spend, distribution, and the actions of competitors. There is a limit to how predictive a pre-test can be. A perfect result from Synthetic Data is likely to embody a large amount of luck, that will not necessarily be repeated in future tests. Similarly, an estimate that is too large could simply be the result of poor marketing, not a deficiency in the test.

A bigger problem with testing Synthetic Data relates to how generalisable a set of test results are. Most tests only show how good or bad a particular application of synthetic data was on that occasion. It is relatively easy to construct cases where Synthetic Data appears to work and cases where it appears to fail.

If one wanted to show that Synthetic Data works, it is easy to create a test where Synthetic Data produces a correct result. For example, I recently created 100 personas of UK cola-flavoured soft drinks and tested three options. The options were Regular Coke, a Coke with Beetroot flavouring, and None of These. The result was a big win for regular Coke, with None of these a clear second and Beetroot flavoured Coke a distant third. This result can readily be replicated, but it tells us nothing about tests in other fields, nor even tests with flavours of Coke that are more similar than regular versus added beetroot (such as vanilla or cherry).

If one wanted to show that Synthetic Data did not work, one could pick a field such as elections or opinion polling. In the tests I have run, Synthetic Data does a bad job of replicating polling data, unless one adds recent polling information to the mix. However, my tests only showed that the version of Synthetic Data I used did not work in the context of opinion polling. My tests did not show whether another Synthetic Data approach would have worked, or whether this approach would have worked in other contexts.

My feeling is that first we need a wide range of comparative tests to be conducted and then we need meta-analyses to identify patterns that suggest when, where and how Synthetic Data works, and what circumstances tend to indicate that it won’t work. Unfortunately, this work will need to be ongoing as we expect Synthetic Data to be get better and better over the next few years (and even if it does not get better, it will certainly change).

Synthetic Data is Going to Be BIG
My prediction is that Synthetic Data is going to be big, very big, quite quickly. By big, I am thinking of something like 10% of data being used being Synthetic in four to five years, and 20% in six to ten years, with perhaps 50% being Synthetic in fifteen to twenty years.

In looking at the future of Synthetic Data, I am mostly drawing on my experience of being an early adopter of online surveys and the pattern of adoption of online qual during the pandemic. Synthetic Data, in most cases, will be cheaper and faster than conventional data, but not as good. Online surveys were faster and cheaper than face-to-face and CATI surveys, but they were not as good. Companies are always faced with a trade-off between cost, speed and quality. In the 1980s, a 500-person CATI study was not as good as a 1000-person, door-to-door study conducted with high-quality protocols – but it was cheaper and faster and was chosen more often. I think there will be many clients who choose the faster/cheaper less good option for some of their research when Synthetic Data becomes more widely available and easier to buy.

The only caveat I would highlight is the general caveat about AI. If AI in general runs into problems, then Synthetic Data would encounter problems. For example, if Governments decree that any Generative AI that was trained on data it did not have explicit permission to use has to close, we would see the end of ChatGPT, Google Gemini etc., that would be a major problem for AI. If AI systems become the subject of major hacking, that could be another problem.

FAQs
In this final section I highlight some of the questions I tend to get asked on this topic and the answers I currently give.

Will Governments pass laws or make regulations controlling the use of Synthetic Data? My feeling is no, I don’t think we will see this soon. My logic is that this is too specific to the insights industry, and new laws are not often drafted that intentionally regulate the insights industry, we are normally caught as unintended consequence of some other topic, for example controlling telemarketing or data breaches. I think the most likely general topic that might impact Synthetic Data relates to deep fakes. A clumsy drafting of laws about deep fakes could, conceivably include Synthetic Data (as fake data).
Does Synthetic Data raise privacy concerns? It is possible to think of ways that Synthetic Data might impinge on data privacy. For example, if it enabled people to reverse engineer who is in the data. However, it is more likely that Synthetic Data will present fewer privacy issues that primary, real data. Indeed, organisations have used a version of Synthetic Data to make real data more anonymous. Census data for specific locations often has noise added to it to avoid individuals being identified, this noise is a form of Synthetic Data.
Should we change the name of Synthetic Data? There are companies who already do not use the term for their synthetic data. I see terms like augmented data, virtual participants, and consumer simulations being used. From an industry regulation and best practices point of view, I think it is useful to have one phrase that describes a phenomenon that needs specific attention. My feeling is that ESOMAR, MRS, Insights Association, JMRA, The Research Society etc should keep calling this family of approaches Synthetic Data. I think vendors are free to call their products whatever they like, as long as they acknowledge that what they are doing falls under the general umbrella of what the associations call Synthetic Data.
Isn’t Synthetic Data just Fake Data? The use of the term Fake Data by the detractors of Synthetic Data seems to me to be analogous with vendors choosing to call their product, say, virtual participants. It is perfectly acceptable, provided they are not intending to misinform. Personally, I would not use the term fake. To me it implies somebody trying to pass something off as real, when it is not real. If somebody were creating Synthetic Data and selling it as real, primary data, that would be fake to me. But I do acknowledge that some people use the term fake more widely, for example those who describe meat substitutes as ‘fake meat’.
Is Synthetic Data only applicable to advanced markets? This argument is based on the notion that online surveys and online qual are both much more prevalent in richer countries. However, I do not think that there needs to be a major difference based on how advanced a country or market is. The creation of Synthetic Data, especially augmented data, is not based on how the data is collected. Even if the original data was collected with pen-and-paper research, it exists as online records. However, see the next point about language and culture.
Does Synthetic Data work equally well for all languages and cultures? The answer is, almost certainly not. In particular, for those approaches that are not using augmented data there are significant challenges for languages other than English and for non-Western cultures. The reason for this is that most of the AI systems were trained predominantly by processing English-language texts and texts that were written by Western writers. When people test the current leading US-owned AI systems by asking them exam questions in different languages, they score more highly when being asked the questions in English. In time, we hope, this pattern will become less strong.
Do I use Synthetic Data? I use some of the augmented data approaches, for example to deal with missing data, Hierarchical Bayes when doing Conjoint Analysis, and occasionally data fusion. In live projects, I do not yet use data created solely from AI, but I do use that type of Synthetic Data to test software and to test/evaluate research designs.

2 thoughts on “Synthetic Data –an Overview, a Taxonomy & some FAQs”

Emily says:

April 30, 2024 at 12:30 am

Wow this is a great post! I am in the camp of thinking synthetic data will be the future of research. At my company we are creating generated respondents, but ours update and change based on real world events in real time.
Does this change your opinion of synthetic research?
For those skeptical, what would help you change your mind and adopt synthetic data?
Thomas says:

May 7, 2024 at 11:09 am

Great post! Thank you Ray!
Contrary to Emily I am in the camp of those who are very sceptical. Two reasons (among many, some of which Ray described):
– The tendency of humans (and professionals) to jump to (positive) conclusions (about a new method). Mistrust and asking critical questions therefore is a virtue.
– Also a problem: Of course you can trade a loss of quality for a better price when new methods appear, as Ray explains. But the more often you do that the worse the quality gets. So the question is when the loss of quality is so big that the upsides don’t matter anymore. If you reduce quality too often you may end up with bullshit which is not worth anything, may even cost you the costs of false decisions.

I strongly recommend to ALWAYS ask for the INDEPENDENT proof of concepts (and validity) everytime somebody comes with promises. And keep in mind that the question for the proof is not if synthetic data (or AI for that matter) does not work. The proof is about if it does work well enough for what.

Comments are closed.

Related

2 thoughts on “Synthetic Data –an Overview, a Taxonomy & some FAQs”