This summer saw the Financial Conduct Authority (FCA) wind down a consultation into the use of synthetic data in financial services. The initiative, launched in March for the attention of both incumbents and start-ups, was to understand industry views on the potential for synthetic data to support financial innovation, the requirements to be an effective tool, as well as potential limitations and risks.
Synthetic data is artificial data as generated using algorithms. An infamous type of synthetic data is so-called ‘deep fakes’, which produces uncanny-looking, computer-generated media. You may have seen such videos all over YouTube putting fake words into celebrity mouths, or being used to deage celebs to highlight poor VFX work from Hollywood studios. Other, more unsavory uses proliferate the net.
The tech is created by observing patterns and the statistical properties of real data, with algorithms used to simulate these patterns within the synthetic dataset, aiming to make it a realistic replica of the real world data. Its advantage over real world data? Simple: synthetic data simulates information without identifying specific persons. Therefore as long as no real individuals can be identified from the synthetic data, data protection obligations like GDPR do not apply.
“As organizations propel themselves towards data-driven business strategies, the opportunities to employ data analytics to make more insightful and impactful decisions based upon business and customer data continue to increase,” Alvin Tan, principal consultant for tech consultancy Capco, tells ERP Today.
“However, as adoption of using this data gathers momentum within an organization, so does an increase in risk associated with the data privacy controls that are required to be in operation and the potential to negatively impact an individual. (After all) within financial services, much of the data collected relating to data subjects such as customers would be considered highly sensitive.
“This is where synthetic data can present a potential opportunity for organizations. Synthetic data is a ‘privacy preserving’ technique that involves fabricating data in such a way as to replicate specific statistical relationships within ‘real’ data set(s). It is used in place of those real data sets to support statistical insights that can be drawn from the synthesized data, thereby protecting the privacy rights of individuals that might otherwise be identified in a real data set.
“These statistical inferences can then be used to drive analytical business insights and opportunities in a manner that does not breach privacy legislation,” Tan continues. “Unlike anonymization or pseudonymization data analysis techniques for which there is some risk that the data can be backwards-engineered such that a real person could be identified, by its very nature synthetic data does not carry this risk and so has the potential to unlock many additional avenues of data science and analytics.”
In financial services, synthetic data is being used as test data for new products and tools, for model validation, and in AI model training. As the FCA points out, many problems of modern artificial intelligence come down to insufficient data: either that the available datasets are too small, insufficiently labeled, or cannot be accessed without breaching individuals’ privacy rights.
Furthermore, as the FCA writes in its call to consultation, historical data can often be biased and unrepresentative, and algorithms trained with this data will reproduce these biases. Synthetic data can in principle offer solutions to these issues.
Aside from protecting data privacy, the tech can fill in gaps where the data required is rare, does not exist in sufficient quantities for training purposes, or does not exist at all and must be simulated for as yet unencountered conditions. Synthetic data can be used to model realistic but potentially unlikely or uncommon scenarios, for example for risk management in financial services.
To boot, large volumes of training data are needed for training accurate ML algorithms. However, it can sometimes be more efficient to generate high volumes of synthetic data than either capture or label real data.
Tan also sees synthetic data as solution to the friction between emerging technologies and the restrictions surrounding the actual production data which is to be leveraged.
“Many financial services organizations (FSOs) run costly control processes on top of analytics warehouses to mitigate the risk of privacy and protection breaches. When done properly, using synthetic data instead of real world data for analytics, reduces the inherent risk of a breach to virtually zero. The use of synthetic data is therefore a significant mitigating factor in managing privacy risk, shortcutting otherwise lengthy and highly manual governance processes.
“Unencumbered by operational control overheads, the marginal cost of analytics is significantly reduced, allowing organizations to scale their analytics ambitions, and accelerating innovation and experimentation.”
Synthetic data could help democratize data access across the financial industry by opening access to data assets with incumbents and disruptors alike. As the FCA reports, accessing data at an individual level is possible through mechanisms such as consent, for example through Open Banking infrastructure. But to truly develop new technologies requires widespread access to large data sets.
Currently barriers face third-parties that could offer innovative services to financial giants, through obstructed access to large volumes of the high-quality data required to develop and implement these types of strategies.
In the case of a RegTech for example, it is very difficult to build a new machine learning-based solution without beforehand going through complex due diligence and costly onboarding processes with an institution to access their data.
Data could potentially be made available from third-parties, such as RegTechs and B2B FinTechs, to construct better models and develop new techniques or use computational resources that might be unavailable to incumbent holders of sensitive data.
These third-parties could pool synthetic data from multiple sources, revealing trends, patterns and insights that are more accurate, or indeed only apparent in the pooled data.
Synthetic data and data generation techniques could be either shared or made publicly available, which could potentially provide an important step towards reproducibility of results. As AI becomes more common, this may become an important compliance mechanism.
Benefits here could include detecting and preventing financial crime by facilitating cooperation between multiple organizations that are prevented from efficiently co-operating at a granular data level. This though carries the risk of de-anonymization.
Other risks can also be found which may limit the appeal of synthetic data, as Tan explains. For example, synthetic data sets can still be constructed with real world biases embedded, unless accounted for during the generation process.
“(Also) when synthetic data is used in conjunction with real data, there is a risk that synthetic-to-real associations are made that are at best false, and at worst libelous or construed to mislead the market.”
“(But current) AI governance has played a role in promoting AI through implementing the required levels of regulation to not just minimize legal and ethical risk to organizations but to also build trust in the AI algorithms that they are employing.
“Similarly for synthetic data, a barrier to wider adoption of synthetic data has been trust – in questioning that the data is an accurate representation from which insights can be derived. This therefore represents an opportunity for the regulator to drive and promote adoption of synthetic data through a framework and set of standards.
“In a similar vein to the ethical risks posed by AI usage, the FCA has expressed an interest in potentially taking the responsibility of a synthetic data regulator to address these risks to the consumer and to the market.”
Tan’s colleague at Wipro-owned Capco, managing principle Stephen Brown, was quoted in June suggesting the approach of an FCA-approved standard. This benchmark would allow an organization to take its own data and create its own synthetic datasets for use in its own projects.
“This achieves the goal of driving greater adoption of the use of synthetic data at scale within an organization,” said Brown. “For the business, there is trust in that the synthetic data is representative; and from a compliance perspective, there is mitigation of risk in that the synthetic data meets a certain set of regulator-defined standards.
“Cross-collaboration with other regulators will also be fundamental to establishing standards for generating synthetic data from an organization’s own data. Without it, widespread adoption would likely fail as the investment to create locale-specific synthetic datasets would represent a high bar of investment.”