Jonas Forman

Jonas Forsman

Director, Consulting Expert

The concept of “garbage in, garbage out” (GIGO) has never been more relevant in our increasingly AI-driven world. GIGO applies when poor quality input results in poor quality output; and some of the early experiences with GenAI are a perfect example of this. Regardless of the type of AI solution used, the solution will always require complete, accurate, and timely data to deliver trusted outcomes.

As organizations across industries increasingly invest in AI, using the right data at the right time and in the right (and responsible) way is becoming more challenging. In this blog, I share a few of these challenges and propose a way forward—the use of synthetic (or artificially produced) data for AI.

Challenges in training AI data

AI and machine learning engines require vast amounts of data to be trained so they can perform their intended tasks. While data volume typically is not an issue, data usage is another story. Three problems associated with using organic data for AI are worth discussing.

Privacy. First, data usage is subject to significant regulation focused on protecting individual privacy. A clear example is the European Union’s General Data Protection Regulation (GDPR). GDPR aims to ensure that personal information is handled in a responsible and secure manner, while also giving individuals more control over their data. It limits how data can be collected and used, as well as how long it can be stored. Because of regulations like GDPR, customer and employee data cannot be freely used to train AI engines. To legally use individual-based data, extensive anonymization is often required, which is both complicated and expensive; and data anonymization does not guarantee security.

Copyright. Second, some data is copyrighted. There is much debate about the use of copyrighted data for AI. Given regulatory discussions happening across various government entities, we anticipate new guidelines to be released soon that will require organizations to clearly indicate which public data has been used to train a specific AI function or module.

Quality. Third, data can include a range of errors and biases, which may or may not be easy to correct, even if the data is organically produced. Further, identifying high-quality data among the volumes of available data can be burdensome and costly.

How synthetic data can help

An alternative to organically produced data is synthetic data. Unlike data collected from real events, synthetic data is artificially generated. However, it offers the same statistical properties as organic data and therefore provides the same statistical conclusions. This makes it very useful for AI solutions.

Synthetic data can be generated programmatically using a variety of techniques. With machine learning, for example, it’s possible to produce synthetic data that mirrors the statistical properties of real-world data. Data also can be collected from real-life people, events, or objects via computer simulations or algorithms and converted to synthetic data. Data scientists take the real-world data, extract desired information, and convert it into synthetic datasets.

What are the benefits of using synthetic data for AI? Here are just a few:

  1. Multiple purposes and types: Synthetic data can be generated for a variety of purposes and in several different formats—from simple table data to more advanced data types like imagery, text, and speech.
  2. Ease of training: Synthetic data enables organizations to avoid many of the above-mentioned challenges associated with training data. It can be generated in the desired volume and completely anonymized to ensure regulatory compliance, while still providing the same statistical conclusions as organic data.
  3. Quality control: With synthetic data, you can control the level of quality. In some cases, such as test data for systems development, the quality can be low. Other uses, however, may require high-quality data to achieve desired outcomes. Keep in mind that assessing the quality of synthetic data is a new area of exploration, with definitions and measurements only beginning to emerge.
  4. Risk reduction. With synthetic data, organizations and researchers can perform extensive analyses and develop AI models without the risks and limitations associated with using real data that is confidential and/or sensitive. Through such risk reduction, synthetic data can advance an organization’s responsible use of AI. (To learn more, check out this blog from my colleague Dr. Diane Gutiw: Embracing responsible AI in the move from automation to creation and Guardrails for data protection in the age of GenAI.)
  5. Cost savings: Synthetic data offers potential cost savings because it can be less expensive to generate than collecting real data.

Synthetic data use cases and challenges

Because there are no limitations on the type or size of synthetic data that can be generated—either from real-world data, including images, or from scratch—potential use cases abound. Synthetic data can be generated, for example, in healthcare to support research and development without compromising real-life patient data. It can be used in industries like retail and transportation to statistically mirror customer behavior and drive product and service innovation.

The use of synthetic data, however, comes with challenges. It may not be as precise as real-world data or perfectly reflect real-world scenarios. For example, outliers and low probability events, common in real-world datasets, are difficult to reproduce in synthetic data.

Synthetic data also can pose a security risk when used to support AI models. Malicious use of synthetic datasets, for example, can lead to AI models that are more vulnerable to security attacks.

Moving forward with synthetic data

Synthetic data is an exciting area of AI that resolves some of the biggest challenges in data management, such as privacy, data availability, and quality. Synthetic data can open new opportunities for exploring AI innovations, while maintaining a high level of data protection and regulatory compliance. use

For organizations evaluating the of synthetic data, we recommend the following:

  • Clearly define your objectives (Understand what you want to do with synthetic data and the business rationale for using it.)
  • Determine the level of synthetic data quality you need (Is synthetic data that matches real-world data by 80% sufficient? Is a higher level of mirroring required?)
  • Assess the cost in producing the synthetic data you need (Will there be a clear ROI?)
  • Consider security and regulatory issues, such as GDPR requirements

The quality of synthetic data is an area in which I'm particularly interested. CGI recently partnered with Karlstad University* to find better methods for assessing synthetic data quality and to co-publish a research paper. Feel free to contact me to discuss synthetic data, data usage, or AI in general. You also can explore CGI’s AI capabilities and experience.

 

*Announcement in Swedish

About this author

Jonas Forman

Jonas Forsman

Director, Consulting Expert

Jonas Forsman has more than 20 years of experience in designing, developing, testing, and implementing advanced technology solutions across industries using artificial intelligence, big data, data analytics, and business intelligence. He also has significant experience in research and innovation project management both within and outside ...