Azure AI Vulnerabilities Allowed Bypass of Moderation Safeguards

Security

Azure AI Vulnerabilities Allowed Attacks to Bypass Moderation Safeguards

November 1, 2024

3 minute read

Mindgard researchers uncovered critical vulnerabilities in Microsoft’s Azure AI Content Safety service, allowing attackers to bypass its safeguards and unleash harmful AI-generated content.

A UK-based cybersecurity-for-AI startup, Mindgard, discovered two critical security vulnerabilities in Microsoft’s Azure AI Content Safety Service in February 2024. These flaws, as per their research shared with Hackread.com, could allow attackers to bypass the service’s safety guardrails.

The vulnerabilities were responsibly disclosed to Microsoft in March 2024, and by October 2024, the company deployed “stronger mitigations” to reduce their impact. However, the details of it have only been shared by Mindgard now.

Understanding the Vulnerabilities

Azure AI Content Safety is a Microsoft Azure cloud-based service that helps developers create safety and security guardrails for AI applications by detecting and managing inappropriate content. It uses advanced techniques to filter harmful content, including hate speech and explicit/objectionable material. Azure OpenAI uses a Large Language Model (LLM) with Prompt Shield and AI Text Moderation guardrails to validate inputs and AI-generated content.

However, two security vulnerabilities were discovered within these guardrails, which protect AI models against jailbreaks and prompt injection. As per the research, attackers could circumvent both the AI Text Moderation and Prompt Shield guardrails and inject harmful content into the system, manipulate the model’s responses, or even compromise sensitive information.

Attack Techniques

According to Mindgard’s report, its researchers employed two primary attack techniques to bypass the guardrails including Character injection and Adversarial Machine Learning (AML).

Character injection:

It is a technique where text is manipulated by injecting or replacing characters with specific symbols or sequences. This can be done through diacritics, homoglyphs, numerical replacement, space injection, and zero-width characters. These subtle changes can deceive the model into misclassifying the content, allowing attackers to manipulate the model’s interpretation and disrupt the analysis. The goal is to deceive the guardrail into misclassifying the content.

Adversarial Machine Learning (AML):

AML or adversarial AI, involves manipulating input data through certain techniques to mislead the model’s predictions. These techniques include perturbation techniques, word substitution, misspelling, and other manipulations. By carefully selecting and perturbing words, attackers can cause the model to misinterpret the input’s intent.

Possible Consequences

The two techniques effectively bypassed AI text moderation safeguards, reducing detection accuracy by up to 100% and 58.49%, respectively. The exploitation of these vulnerabilities could lead to societal harm as it “can result in harmful or inappropriate input reaching the LLM, causing the model to generate responses that violate its ethical, safety, and security guidelines,” researchers wrote in their blog post shared exclusively with Hackread.com.

Moreover, it allows malicious actors to inject harmful content into AI-generated outputs, manipulate model behaviour, expose sensitive data, and exploit vulnerabilities to gain unauthorized access to sensitive information or systems.

“By exploiting the vulnerability to launch broader attacks, this could compromise the integrity and reputation of LLM-based systems and the applications that rely on them for data processing and decision-making,” researchers noted.

It is crucial for organizations to stay updated with the latest security patches and to implement additional security measures to protect their AI applications from such attacks.

The Latest

Bitfinex Hack Mastermind Behind $10 Billion Theft Gets Early Release

Ledger Confirms Global-e Breach, Warns Users of Phishing Attempts

How to Avoid Phishing Incidents in 2026: A CISO Guide

New VVS Stealer Malware Targets Discord Users via Fake System Errors