Synthetic data refers to artificially generated data that mimics the characteristics of real-world data without containing any actual information from real individuals or events.
This artificial data is created using algorithms or models to replicate the statistical properties, patterns, and distributions found in authentic datasets. Synthetic data and original data should deliver similar results from statistical analyses.
Synthetic data can be used in the place of real-world data when this data is either not readily available or in the case of experiments where the data is simulating the real world in a controlled experiment. In the realm of measuring marketing and media effectiveness, synthetic data can be a valuable tool for various purposes because of this close likeness to real-world data.
What is synthetic data in marketing?
Techniques used in media effectiveness measurement often suffer from a flaw which medium-sized enterprises may face – a lack of sufficient, valuable data. This is where synthetic data steps in. Synthetic data has its uses in marketing, as it does in many other purposes where statistical analysis is important. In fact, there are a huge number of areas in which synthetic data can support better marketing decisions, but in particular synthetic data has come into its own in both media-mix modelling and incrementality testing in the general field of marketing effectiveness measurement.
Synthetic data is so useful in this arena because of the nature of the data constraints put on marketing datasets to protect individual consumers’ privacy. It means media channel and platform testing must be done without the use of browser cookies and device identifiers. Using synthetic data also means older tactics like ‘market matching’ need no longer be an issue for controlling for A/B tests of media exposures.
Synthetic data and incrementality testing
In Incrementality Testing, synthetic data can be created to act as the control for a media experiment. A typical media experiment might be to test the incrementality of a new digital advertising campaign or channel, whereby the methodology typically would be to A/B test the exposures in two different geographies, which are similar. This is called “market matching”.
With the luxury of synthetic data, particularly that built from a long-term model with a great ‘fit’ to the actual data, the experiment can actually be run in the same geography, between the actual observations versus the synthetic control data. Whilst in-flight, the model which would also be predicting values for other geographies, can constantly be checked and tuned for great ‘fit’, therefore adding statistical rigour to the process.
The effect here is that fewer geographies face disruption during the test, and the challenges of market matching are removed. Market matching is always a challenge as there can be leaching of participants between markets, and no two markets are created equal and may not remain equal for the test duration. With synthetic data, externalities are accounted for in the model in most cases, and there’s always post-campaign analysis to check for confidence in the output results which must be taken into account when summarising the insights.
With or without synthetic data, Incrementality Testing helps to establish causal relationships, whereas a media mix model alone will allocate contributions based on the correlations it can detect. The importance of carrying out this testing means that Media Mix Models can capture casual relationships more efficiently and can be used to fine-tune parameters.
Overall, the use of synthetic data in incrementality testing enhances flexibility, scalability, and reliability, making it a valuable tool for optimising marketing strategies and decision-making processes
Synthetic data and media mix modelling (MMM)
Similarly to incrementality testing, media mix modelling also benefits from synthetic data. This can be for several reasons, such as data augmentation, extrapolation and during forecasting. Media mix models consume large amounts of data and generate the best results when provided with these data sets for many years. Oftentimes, not all data is available or complete for these periods. A word of warning should be given to say only experienced consultants should carry out this work, but gaps in data or historical extrapolations are acceptable in some cases, to complete the data set and ensure channel contributions are closer to accurate than had the data been left missing.
Media-mix modelling also itself does produce synthetic data, in that the contribution outputs it generates are designed to replicate the real-world actual contribution reports from media channel investments. This is particularly then true when media mix models are used to predict the results for new budgets and media investment mixes in the future. Synthetic contribution data can be generated with known ground truth values, facilitating the evaluation of model performance and the comparison of different testing strategies.
Synthetic data also offers flexibility in exploring hypothetical scenarios and testing different model specifications without relying solely on historical observations. Moreover, synthetic data allows for the creation of diverse datasets that capture a wide range of market conditions and consumer behaviours, enhancing the model’s ability to generalise and adapt to changing environments. Overall, leveraging synthetic data in MMM empowers analysts to overcome data scarcity issues, improve model accuracy, and make more informed decisions in marketing resource allocation.
Benefits and limitations of using Synthetic data in digital marketing
- Data Augmentation: Synthetic data allows for the creation of larger and more diverse datasets.
- Privacy Preservation: Synthetic data can help protect sensitive customer information while still enabling analysis and model development.
- Data Diversity: Synthetic data can capture a wide range of scenarios, leading to more robust models.
- Accuracy Concerns: Synthetic data may not fully reflect the complexity of real-world data, leading to inaccuracies in models and analyses.
- Bias Introduction: Synthetic data may inadvertently introduce biases, leading to skewed model outcomes if not properly accounted for.
- Computational Expense: Generating high-quality synthetic data can be resource-intensive and require specialised expertise.
Conclusion
Synthetic data is already making waves in digital marketing, as it has been in the wider field of statistics for quite some time. A follow-up piece of reading I’d recommend is by Mark Ritson on how synthetic data is shockingly accurate in predictions of perceptual mapping of brand attributes. It’s not an entirely difficult stretch to see how in the future this type of reinforced artificial learning could perhaps be applied to Segmentation, Targeting and Positioning, or the balancing of a marketing channel mix, or even effective share of voice (“ESOV”) figures.
Until then I hope this introduction to the meaning of ‘synthetic data’ for marketers has been helpful.