Synthetic Data: A Lexicon and Taxonomy

Pros and Cons of Synthetic DataRay Poynter, 22 September 2025

Edited 26 September 2025, new sections in blue.


Here is my attempt at an updated Lexicon and Taxonomy for Synthetic Data. I would love to hear your thoughts and suggestions.

Synthetic Data

Synthetic data, according to the ICC/ESOMAR Code, is “Synthetic data means information that has been generated to replicate the characteristics of real-world data.”

Why do we want to name things?

People sometimes explain the reason for defining names and terms as the Rumpelstiltskin effect. This effect, derived from the children’s fairytale, asserts that we gain power over a thing when we name it. If we want to investigate the process of creating data (instead of, or as well as, collecting it), and if we want to differentiate between useful processes and snake oil, then we need the clarity and precision that naming brings.

Is Synthetic Data the Right Term?

To some extent it does not matter what any one person or organisation thinks, the market and community determine the meaning and use of words. At present, Synthetic Data is the most widely used term in Research and Insights, and the most widely used in other domains.

Suggested Terms for Key Types of Synthetic Data

Here are my suggested key terms for synthetic data, in the context of market research and insights.

Synthetic Data Sets

Synthetic data sets are data sets that have been created either in part (see Augmented Data) or entirely. The synthetic data can be generated via a wide range of means, including but not limited to AI. Two common methods for generating synthetic data sets are through platforms that create augmented data sets and via synthetic agents. These data sets can then be used as inputs into the analysis process.

Synthetic Augmented Data

Data used in combination with a specific dataset, such as the results from participants responding to a questionnaire. The key characteristic of these approaches is that the synthetic data is relevant in the context of the real data, as an addition to it.

This category can be further divided in two ways:

Scope:
Does it add rows (creating new/additional respondents), columns (new questions for existing people), or fill in missing data (imputing values left blank by the people completing the questionnaire)? A project can be any combination of one, two or all three of these.

Method:
How was the synthetic data created? For example, purely statistical approaches (such as SMOTE), traditional AI (such as machine learning), and generative AI. Projects can combine these approaches.

Synthetic Agents

Agents is a term widely used outside the world of market research and insights, and aligns us with the broader use of synthetic data. Synthetic agents are entities that can be used to answer new questions. Agents can be used to interact with the user, or they can be used to generate synthetic data sets.

Some synthetic agents can be further subdivided into specific variants.

Digital Twins:

A digital twin is a synthetic agent constructed by using the data and information from a specific individual. This information might have been collected specifically for the purpose of creating the twin (as in the Stanford studies, for example, their 2023 study) or from information collected previously (for example, utilising past studies conducted with members of online communities).

Personas:

A persona is a synthetic agent intended to represent a group or category, rather than any specific individual. Some personas are constructed to be archetypes, for example, a set of personas to represent typical Gen Z types. Other personas are created to reflect a broader range of types, so that they can collectively mimic a population.

Testing and Validating Synthetic Data

In addition to establishing the key terms that define Synthetic Data, it is also helpful to highlight the key concepts in testing and validating Synthetic Data.

Can it replicate known data?

For augmented data, we can test this by removing some of the data from a real study, generating an augmented sample and checking that the augmented sample matches the removed sample and that the total sample with the augmented data matches the total sample with the original data.

For synthetic agents, a similar process can be undertaken. Take a real study that was not used in the creation of the agents and ask the synthetic agents to create responses to the questions used in the real study.

These examples are necessary but not sufficient tests. If synthetic data is likely to be effective in your field, given the types of responses you typically work with, it should be able to pass this test; however, it does not guarantee success in future projects.

When do we expect it to work, and when do we not expect it to work?

It seems unlikely that synthetic data will work with every type of data, in every context, and for every purpose. Vendors need to be able to demonstrate when it will work and when it won’t, and explain why it works when it does and why it fails when it doesn’t.

In addition to the field, this question of when it works (and when it does not) also applies to countries, languages, and cultures.

What does ‘similar’ mean?

When comparing synthetic data with real data, we often need to look beyond the means and totals. We need to check that the underlying patterns are similar, for example, the distributions and correlations.

How do we test for statistical significance?

Conventional tests of statistical significance are not appropriate for synthetic data, so new forms of testing need to be established and shared.

How will the model be updated?

For augmented data, the issue of updating is usually simpler. We utilise current data to generate synthetic data models.

For synthetic agents, the need to update them is critical. Where will the fresh, real data come from, and how will it be updated?

Other Key Terms and Issues

This section is a compilation of terms frequently relevant to the debate about Synthetic Data and Generative Artificial Intelligence.

Ethics

Vendors and users of synthetic data must ensure that its use is always ethical. This includes, for example, vendors ensuring that buyers and users can assess the validity, trustworthiness, and limitations of the synthetic data being created.

Privacy and Security

Clients want to know their data will remain secure and that it is not used to train systems that others will use. Participants need to know that their rights will be respected, and their data kept secure.

Guardrails

Explicit rules are coded into the model creation process that limit what can be created and how the model will respond. For example, most AI systems try to prevent the creation of child pornography and other harmful content. Guard rails can have unintended consequences and can usually be circumvented via jailbreaking.

Bias

AI models reflect the bias inherent in any training data, and the presence of guardrails can exacerbate this. Vendors must explain the biases in their systems, seek to mitigate them, and ensure that buyers and users are aware of both known and potential biases.

Jailbreaking:

Jailbreaking is the practice of attempting to get an AI system to reveal information it has been instructed not to disclose. For example, if a system has a guard rail that prevents it from encouraging people to do self-harm. A jailbreaker might ask it to provide examples of things they should not ask to help them avoid asking such questions.

Cannibalism:

Feeding AI-created information into the creation process. For example, training image creation tools on AI-generated images, or feeding synthetic respondents into the process to create other synthetic respondents. This is widely regarded as a bad practice. It can lead to Model Drift and even Model Collapse.

Model Drift:

Model drift occurs when the modelled data becomes increasingly unlike the real world. The causes can be systematic, resulting from changes in available inputs, or random effects that occur over time.

Model Collapse:

When there is too much conflicting information in a model, for example, because of drift, ingesting synthetic data, or conflicting guard rails, the model can start producing wrong and inappropriate answers. That outcome is referred to as model collapse.

Hallucinations:

Hallucinations occur when a model provides an incorrect answer, often a plausible one. The most common cause of hallucinations is that the AI system does not know the correct answer, so it makes a guess. Because it is trying to be right, the guess is often wrong but believable.

Leave a Reply

Your email address will not be published. Required fields are marked *