Skip to main content

IBM’s New Synthetic Data Sets Address Major Obstacle in AI Fraud Detection

Elpida Tzortzatos, IBM fellow and CTO for AI on IBM Z, explains how the data sets help AI developers overcome the difficulty of obtaining real financial transaction data

TechChannel AI

Of the many uses for AI that have been touted in the world of IT, one that is receiving significant focus from IBM is real-time fraud detection. Stopping fraudulent transactions as they happen has been one of the key selling points in the introduction of IBM’s highly anticipated Telum II and Spyre chips, part of the upcoming zNext rollout. But for the hardware to do that work, it needs models that have been trained for the task.

This training requires large amounts of data on financial transactions, but due to its sensitive nature, this resource is often hard to come by. To address this problem, IBM built a virtual world filled with law-abiding consumers, criminal actors, companies and institutions for them to interact with. Released early this year, the resulting data sets are available to organizations and independent software vendors (ISVs) seeking to build their own fraud detection models.

The Scarcity of Transaction Data

Real data on financial transactions can be used to train AI models, but only after it has gone through layers of approval and steps to cleanse it of personally identifiable information and other elements that contribute to its sensitive nature. This process can take up to six months. “That meant that businesses couldn’t move fast,” says Elpida Tzortzatos, IBM fellow and CTO for AI on IBM Z.

Due to vulnerabilities like data extraction attacks, where malicious actors craft inputs designed to extract a given AI model’s training data, these precautions must be followed even when an organization is using its own data. And ISVs seeking to build models based on other organizations’ data might be out of luck completely. “Most banks are not going to give the real data to any ISVs,” Tzortzatos says.

Better Than the Real Thing?

According to IBM, not only can synthetic data be used to train fraud detection models; it can be used more effectively. IBM Synthetic Data Sets were constructed based on statistics from the U.S. Census, U.S. Federal Reserve, U.S. Bureau of Labor Statistics and the FBI. Real data, conversely, is limited to the scope of the organization it came from, and doesn’t necessarily adhere to the truths born out in federal statistics.

Another advantage of synthetic data, Tzortzatos says, is that its simulated transactions are labeled as fraudulent or not, which is often not the case with real data. Based on the concept of “known ground truth,” these labels can reduce false positives in fraud detection and help developers validate their models, according to the IBM RedBooks publication detailing the data sets.

What About Synthetic Data Generators?

Going through IBM isn’t the only way to get synthetic transaction data. This resource can also be conjured by software known as synthetic data generators, but these tools often require real-world sample data, again eliciting concerns over vulnerabilities like data extraction attacks.

Instead of following the process used by synthetic data generators, IBM Synthetic Data Sets were built with agent-based modelling, in which the denizens of a virtual world interact based on real statistical data. 

“In this agent-based model generation of our synthetic data sets, we simulate people that transact over a period of time, making purchases and transactions just like you and me,” Tzortzatos says.

3 Sizes, 3 Uses

Downloadable as comma-separated values (CSV) and data definition language (DDL) files, the new data sets come in three sizes. The smallest one consists of 500 simulated people transacting over a period of three months, another features 15,000 people transacting over 25 months, and the largest set is home to 150,000 people transacting over 37 months.

Beyond the various sizes, the data sets come in three varieties, for three different domains—payment cards (including credit card fraud detection), core banking (including anti-money-laundering) and homeowners insurance.

“We’ll continue to update with additional data sets as we work with clients and see patterns and use cases that are relevant,” Tzortzatos says.

Advances in AI Fraud Detection

Recent improvements in AI have helped fraud detection keep up with the thousands of transactions that occur every second in large organizations. “Before AI, they used the static business rules to try to determine whether the credit card transaction was potentially fraudulent or not, and whether to approve or reject that transaction,” Tzortzatos says.

But even with AI, detecting fraud as it happens has been a challenge due to the time it takes to send data across distances for AI tasks, not to mention the vulnerabilities associated with transferring sensitive data offsite. Knowing that the three global credit card companies and many large banks run all of their transactions through IBM systems, the company aimed to address those issues in 2022 with the release of the z16 mainframe and accompanying Telum on-chip AI processor.

“We really designed z16 and that accelerator based on those challenges,” Tzortzatos says. “And we wanted to enable our clients to examine with AI 100% of the transactions in real time, and specifically for high-volume transactional workloads that had very, very stringent response time requirements and throughput requirements.”

zNext and Fraud Detection

IBM’s promotion of the upcoming Telum II and Spyre chips has focused heavily on their fraud detection capabilities. Each piece of hardware focuses on a different type of AI. The Telum II processor is meant for predictive (traditional) AI, which uses machine learning and deep learning to identify patterns in data. The Spyre accelerator, on the other hand, is meant to augment generative AI (GenAI) tasks such as those performed by large language models. In addition to its role in the next mainframe, Spyre will also be featured in Power11.

Clients can combine predictive AI and GenAI “to build more robust, more efficient, more accurate models that produce better outcomes,” Tzortzatos says.

An example of this approach can be found in tasks related to homeowners insurance. Homeowners insurance data consists of structured and numerical data like home addresses, policy numbers, deductibles and home values—the kind of data that predictive AI is made for. It also includes unstructured data, such as descriptions of damage and its causes, which is where the natural-language capabilities of GenAI come into play.

“So with zNext Spyre and Telum II, our clients can bring together the strengths of both of those AI techniques to get better outcomes,” Tzortzatos explains.

IBM is expected to announce further details about zNext during a virtual event scheduled for April 8.  


Key Enterprises LLC is committed to ensuring digital accessibility for techchannel.com for people with disabilities. We are continually improving the user experience for everyone, and applying the relevant accessibility standards.