Big Data Data Protection and Backup Data Resilience Digitalization Executive Interviews

Is Your AI Getting Poisoned? How Data Contamination Could Erode the Trustworthiness of AI

Martin Dale Bolima August 15, 2024

5 minutes read

Artificial Intelligence (AI) is not born; it’s cultivated. Nurtured by vast datasets, it learns, grows, and evolves. Much like a living organism, it relies on a healthy diet to thrive. But what if that data is corrupted? What if the very foundation of an AI system is compromised?

This is the insidious nature of data poisoning, a threat that looms larger as AI integration deepens across industries.

Is Your AI Getting Poisoned? How Data Contamination Could Erode the Trustworthiness of AI — Phillip Ivancic, Head of Solutions Strategy at Synopsys Software Integrity Group

“Data poisoning is a type of cyber attack aimed at compromising machine learning models by introducing false or misleading data into their training datasets,” explained Phillip Ivancic, Head of Solutions Strategy at Synopsys Software Integrity Group. The goal of these attacks is to corrupt the model’s outputs, leading to inaccurate predictions or classifications. As machine learning models are heavily dependent on the quality and integrity of the data they are trained on, any compromise in the data can significantly undermine the model’s reliability and performance.”

The Implications of Data Poisoning

Models trained on poisoned data are much like their human counterparts: Spotty and unreliable. They produce incorrect or nonsensical results, leading to flawed decision-making processes within an organisation. Worse, this contamination can spread throughout the organisation’s systems, resulting in widespread corruption of critical data that, according to Ivancic, can cause significant operational inefficiencies and financial losses.

Business disruption, Ivancic emphasised, is one of the primary consequences of data poisoning, along with the prospect of facing legal ramifications and losing customer trust.

“These disruptions can translate into significant financial losses, particularly if critical business processes are impacted. Beyond the immediate operational and financial effects, there are also legal and regulatory consequences to consider. Compromised data can lead to legal challenges and regulatory penalties, especially if sensitive or personal data is involved,” noted Ivancic. “Additionally, data poisoning can damage an organisation’s reputation, eroding customer and stakeholder trust in the company’s data and AI systems. This loss of confidence can have long-term repercussions for the organisation’s market position and customer relationships.”

And, just as people become sickly and more prone to illness when poisoned, tainted models may also become more vulnerable to further attacks, compounding the security risks and worsening the initial compromise. The compromised integrity of the data, Ivancic pointed out, affects the immediate outputs of the model involved and “can also erode trust in the AI systems over time.” This can make it very challenging for organisations to rely on AI for critical business decisions.

The Slippery Slope That Is Data Poisoning

But, again, it can be hard to diagnose data poisoning—at least not until something goes astray with the AI system involved. Even in cases when AI spews out questionable results, it is no guarantee that data poisoning is the root cause. It could very well be AI hallucination, a common challenge developers and researchers have yet to address entirely. It could be anything other than data poisoning.

“While specific instances of data poisoning in enterprises are not widely publicised, there have been reports of organisations experiencing unexpected and incorrect results from their AI projects. In some cases, these anomalies have led to the suspension of AI initiatives,” Ivancic told Data & Storage Asia (DSA). “Although it is not always clear if data poisoning was the cause, these incidents highlight the potential risks and disruptions that data poisoning can bring to enterprise operations. The lack of concrete examples underscores the difficulty in detecting and attributing data poisoning attacks, making it a significant concern for organisations relying on AI.”

While data poisoning may be difficult to detect, it doesn’t mean it is some urban legend.

Just this June, the Synopsys Cybersecurity Research Center (CyRC) recently exposed a data poisoning vulnerability in EmbedAI, an app that allows users to interact with documents by utilising the capabilities of Large Language Models (LLM). According to CyRC, EmbedAI is susceptible to security issues that enable data poisoning attacks delivered by a CSRF vulnerability due to the absence of a secure session management implementation and weak CORS policies.

CyRC further stated that “the exploitation of this vulnerability affects the immediate functioning of the model and can have long-lasting effects on its credibility and the security of the systems that rely on it” as it can, among other things, cause said systems to spread of misinformation, introduce biases, suffer from degraded performance, and be at great risk of denial-of-service attacks.

Such vulnerabilities, Ivancic said, can be exploited in a number of ways, with common attack scenarios involving both internal and external threats.

“Internally, malicious insiders—employees with access to data—might manipulate it for personal gain or to harm the organisation. Externally, competitors or cybercriminals might introduce poisoned data through compromised data sources or supply chains,” he explained. “Cyber adversaries typically exploit vulnerabilities in data collection, storage, and access control mechanisms. They might inject false data, modify existing data, or manipulate data processing workflows to achieve their malicious objectives, significantly degrading the performance and reliability of the affected machine learning models.”

That’s a lot of ways to poison the machines taking over the world.

If it is any consolation, there are different ways to “cure” data poisoning or prevent it entirely.

No Magic Pill, Just a Combination of “Antidotes”

Addressing data poisoning starts with implementing robust anomaly detection systems to identify unusual patterns or anomalies in data and model outputs, followed by regular audits of data sources and training datasets that can help identify potential tampering early on. Monitoring and controlling access to critical data and model training environments are also crucial steps in detecting potential data poisoning attempts.

Curing data poisoning can be a bit trickier, but it can be done. Only, it would take some time—and a lot of effort. Given that the alternative—an AI that spreads misinformation and biases—could prove more problematic, all that time and effort would certainly be worth it.

“Remediation of data poisoning attacks is a complex process that requires stringent data validation processes to ensure the integrity of training data. Organisations must maintain robust backup systems and procedures to restore data in case of an attack,” advised Ivancic. Conducting thorough threat modelling to identify and mitigate potential attack vectors is also essential. Given the complexity and scale of potential data poisoning attacks, remediation often involves detailed and time-consuming efforts to verify and correct compromised data.”

Anomaly detection. Data validation. Backups. Threat modelling. We did say data poisoning is a formidable adversary, right?

That is why technology plays a vital role in the fight against data poisoning.

Anomaly detection. Data validation. Backups. Threat modelling. We did say data poisoning is a formidable adversary, right?

That is why technology plays a vital role in the fight against data poisoning.

Incidentally, Synopsys provides more than just leading research on data poisoning. According to Ivancic, the company also offers comprehensive solutions to help organisations secure their AI initiatives against data poisoning and other threats and conducts detailed threat modelling to identify and address potential risks in AI projects. Additionally, Synopsys ensures that the software used to develop AI models is secure and free from vulnerabilities through its secure software development practices.

“We assist organisations in establishing robust governance frameworks and access controls to protect sensitive data and AI models,” Ivancic told DSA. “By securing cloud environments where AI models are often developed and deployed, Synopsys helps organisations ensure that configurations are secure and well-managed.”

Just as important, Synopsys offers training for engineers and data scientists on the security implications of large language models and data poisoning to ensure that those who build and maintain AI models are aware of the risks and how to mitigate them. In short, Synopsys offers a holistic, multifaceted approach to counter data poisoning.

Why Risk It Anyway?

Data poisoning is a silent saboteur, capable of undermining the very foundation of AI.

Given the threats involved and the accompanying risks, anything less than a holistic and multifaceted approach against data poisoning could potentially lead to disastrous results—and prevent an organisation from fully realising the immense benefits of AI.

To harness the full potential of this transformative technology, organisations must prioritise data integrity and invest in robust defence mechanisms. Ignoring this threat is not an option; it’s a gamble with potentially catastrophic consequences.