Draft Synthetic Data Manifesto
Ray Poynter, 30 September, 2024
In this note I set out what I believe to be Synthetic Data, why we need to define Synthetic Data, and some guidelines that I think vendors and buyers should adopt. I have been involved in a wide range of discussion with a wide range of organisations, but these views are my views, they do not represent the views of anybody else.
Scope – humans only
In this note I am only talking about the situation where data has been created to replace or augment data from or about people. Synthetic data could be created for other purposes, for example to represent companies and organisations, but this note does not cover anything except the case of synthetic data in the context of information relating to humans.
Why we need to define Synthetic Data
Synthetic data is already being sold, purchased and used, so a ‘wait-and-see’ approach is not appropriate. Buyers and users of research need and want to know what data and processes are being used to create the results they are using. If the data being used is not 100% data collected and unmodified from real people, they should be told. Moreover, they should be told in ways that allow them to assess the reliability and validity of the results and advice being offered.
From a vendor’s point of view, the name used for data that has been constructed rather than collected does not matter. Indeed, a vendor might be well advised to avoid the term synthetic data from a marketing point of view. However, if a vendor is using data that is not data collected from real people, they should, in my opinion, declare it. Defining synthetic data provides a way of flagging that vendors and buyers of information are dealing with a category that needs declaring.
A Definition of Synthetic Data
Data that has been created to replace data that could or would otherwise have been collected from humans.
Note, this definition does not talk about how the data has been created, it does not assume AI, LLMs, or any particular algorithm was used to create the data. Synthetic data is data that has been created instead or in addition to collecting it from people.
Examples of Synthetic Data
This list is not exhaustive, but it is intended to be helpful.
- Synthetic Survey Responses. A data set where some or all of the data has been created. Variations include:
- 100% of the data has been created
- Some of the cases have been created, e.g. to compensate for under-sampling some groups.
- Some of the cells were created, e.g. to compensate for missing responses.
- Some of the fields were created, e.g., to add information that had not been collected from the research participants.
- Synthetic Personas. AI entities that are created so that questions, often qualitative questions, can be asked of the personas.
Guidelines for anybody providing data or findings that include or are based on Synthetic Data
- Tell the buyer/user if the data is not 100% raw, unmodified responses from real humans.
- Explain the extent and nature of the created data.
- Outline the theoretical background to the approach you have used.
- Outline the limitations, biases, and risks in this approach.
- If you have used AI to create data, draw the buyer/user’s attention to the 20 AI questions developed by ESOMAR.
- Help the user/buyer assess the validity and reliability of the data and results.
- Keep updating your estimates of the accuracy and reliability of the approaches over time.
Are there better names than Synthetic Data?
I am sure there are hundreds of better names for synthetic data. For example, people have discussed the merits of terms like synthetic respondents and virtual participants. However, at the moment, the market has settled on the term Synthetic Data, so that is the term I use when describing this approach. If and when the market moves to another term, so will I.
Draft?
This document is the result of many discussions and lots of reading. However, I am sure that others can improve it. I would love to hear your suggestions.
2 thoughts on “Draft Synthetic Data Manifesto”
Comments are closed.
Excellent article. Wish I had the time to participate in any meetings.
TGI and other media surveys have been doing it for many year/decades? – using methodology whichc is not AI but not too simplistic either!
I like this perspective and I think the definition of synthetic data as “Data that has been created to replace data that could or would otherwise have been collected from humans” is clear and straightforward.
As I think about this, it seems to signal a shift from projecting population characteristics from respondents to projecting respondents from data.