🛡️

AI Red Teaming Overview - Microsoft

Oct 23, 2025,

Overview

This lecture series introduces the fundamentals of AI red teaming, focusing on testing, attacking, and securing generative AI systems through practical examples and automation tools.

Introduction to AI Red Teaming

  • AI red teaming tests the safety and security of AI systems before deployment.
  • Unlike traditional red teaming, AI red teaming is often single-blind and adapts to rapidly evolving AI tools.
  • Key risks for generative AI: fabrications (hallucinations), alignment gaps (unexpected behaviors), and prompt injection attacks.
  • AI red teams consist of multidisciplinary experts to address the broad spectrum of AI risks.

How Generative AI Models Work

  • Generative AI models generate outputs (text, code, images) based on learned patterns from vast datasets.
  • Large models use billions of parameters and tokenized input streams for prediction.
  • Model training involves pre-training (learning general patterns), post-training (fine-tuning and safety alignment), and red teaming (stress-testing).
  • Tokenization and vector embeddings structure input for processing, enabling complex outputs and vulnerabilities.

Core AI Attack Techniques

Direct Prompt Injection

  • Attackers manipulate user input to override system instructions and hijack AI model behavior.
  • Example: Convincing a chatbot to commit to legally binding offers by inserting malicious instructions.

Indirect Prompt Injection

  • Attackers poison external data sources (e.g., emails, web content) to introduce hidden instructions.
  • The model cannot distinguish between benign data and injected commands, enabling exploits.

Attack Structures and Prompt Engineering

  • Single turn attacks use one prompt to trigger unsafe behavior; common in indirect prompt injections.
  • Techniques: persona hacking (role-playing), emotional/social framing, and encoding/obfuscation to bypass content filters.
  • Prompt engineering manipulates models using cleverly phrased or encoded prompts.

Multi-Turn Attack Strategies

  • Multi-turn attacks gradually escalate through conversation to bypass guardrails.
  • Skeleton Key Attack: Initial prompt disables safety, allowing follow-up harmful prompts.
  • Crescendo Attack: Progressively steers conversation toward the harmful objective.

Mitigation Strategies and Guardrails

  • Spotlighting: Delimiting, data marking, and encoding mark trusted/untrusted input for models.
  • Delimiting uses special tokens; data marking interleaves tokens; encoding (e.g., Base64) transforms input data.
  • These strategies help models distinguish instructions from content and neutralize prompt injections.

Automation Tools: Pirate

  • Pirate is an open-source AI red teaming tool for automating prompt attacks and system evaluations.
  • Features include: dataset management, orchestrators (attack workflows), converters (prompt transformations), scoring engines, and flexible targeting.
  • Pirate supports both manual and bulk attacks across single/multi-turn and different modalities (text, image).

Using Pirate for Attack Automation

  • Single turn orchestrators automate sending and evaluating multiple prompts in bulk.
  • Multi-turn orchestrators simulate adversarial conversations to achieve harmful objectives.
  • Pirate tracks outcomes, supports custom scoring, and exports results for reporting.

Key Terms & Definitions

  • Red Teaming — Simulated attack to identify vulnerabilities before deployment.
  • Prompt Injection — Manipulating inputs or data to override AI model instructions.
  • Tokenization — Breaking text into small chunks (tokens) for model processing.
  • Persona Hacking — Posing as a particular role to influence model behavior.
  • Spotlighting — Techniques to help models separate trusted from untrusted input.
  • Crescendo Attack — Gradual multi-turn escalation to bypass safety protocols.
  • Pirate — Open-source AI red teaming automation tool.

Action Items / Next Steps

  • Access labs, tools, and supporting materials at aka.ms/ir101.
  • Practice prompt injection and mitigation techniques using Pirate.
  • Continue exploring new AI attack and defense strategies using course resources.

Certainly! Here's a detailed and in-depth set of notes based on the video discussion from the AI Red Teaming 101 series:


AI Red Teaming 101: In-Depth Notes

1. Introduction to AI Red Teaming

  • Purpose: AI red teaming is the practice of testing generative AI systems for safety and security vulnerabilities before deployment.
  • Difference from Traditional Red Teaming:
    • Traditional red teaming is often double-blind, emulating advanced adversaries stealthily.
    • AI red teaming is usually single-blind; the product team knows testing is happening.
    • AI red teaming involves emulating both adversarial and benign user personas.
    • The AI landscape evolves rapidly, requiring adaptive tools and methods.
  • Key Risks in Generative AI:
    • Fabrications: AI confidently produces incorrect or fabricated information (hallucinations).
    • Alignment Gaps: AI behavior diverges from intended or safe outputs.
    • Prompt Injection: Attackers manipulate input prompts or data to override AI instructions.
  • Team Composition: Multidisciplinary experts including offensive security, adversarial ML, privacy, abuse prevention, and biological safety are essential to cover the broad risk spectrum.

2. How Generative AI Models Work

  • Generative vs. Traditional AI:
    • Traditional AI models perform specific tasks (classification, recommendation).
    • Generative AI models produce new content by sampling learned patterns.
  • Scale and Architecture:
    • Models like GPT-3 have 175 billion parameters; human brain has ~86 billion neurons.
    • Multimodal models (e.g., GPT-4, Gemini) handle text, images, audio, and video.
    • Small Language Models (SLMs) are optimized for local devices but less robust.
  • Training Phases:
    • Pre-training: Unsupervised learning on massive datasets to learn general language patterns.
    • Post-training (Fine-tuning): Supervised training to align model behavior with desired instructions and safety protocols.
    • Red Teaming: Stress testing to find vulnerabilities before release; iterative break-fix cycles.
  • Tokenization and Embeddings:
    • Text is broken into tokens (words, subwords, punctuation, emojis).
    • Tokens are converted into high-dimensional vectors (embeddings) capturing semantic relationships.
    • Transformer architecture enables attention over long contexts, allowing coherent generation but also introducing vulnerabilities.

3. Core AI Attack Techniques

3.1 Direct Prompt Injection

  • Attackers manipulate user input to override system or metaprompt instructions.
  • The model treats all input as a single token stream without boundaries.
  • Example: A chatbot instructed to summarize emails can be tricked into sending emails externally by injecting commands in user input.
  • Real-world case: Quebec car dealership chatbot was manipulated to agree to legally binding offers for $1.
  • Key Insight: Models follow instructions literally without understanding intent, making them vulnerable.

3.2 Indirect Prompt Injection

  • Attackers poison external data sources (emails, web content, databases) that the AI ingests.
  • The model cannot distinguish between benign content and malicious instructions embedded in data.
  • Example: Embedding instructions in an email to search for password reset emails, extract URLs, encode them, and exfiltrate data.
  • Demonstrated via lab exercises involving website content manipulation to change AI summarization behavior.

4. Attack Structures and Prompt Engineering

4.1 Single Turn Attacks

  • One-shot prompts designed to bypass safety and trigger harmful outputs.
  • Common in indirect prompt injection.
  • Techniques include:
    • Persona Hacking: Role-playing as trusted or authoritative figures (e.g., security researcher).
    • Emotional/Social Framing: Using guilt, threats, pleading, flattery, or collaboration to persuade the model.
    • Encoding/Obfuscation: Bypassing filters by inserting spaces, leetspeak, Base64 encoding, or language translation.
  • Example: Asking the model for a password by pretending to be a security incident responder.

4.2 Multi-Turn Attacks

  • Gradual escalation over multiple conversational turns to bypass guardrails.
  • Skeleton Key Attack: Overwrites system instructions early to disable safety, enabling harmful outputs later.
  • Crescendo Attack: Slowly steers conversation from benign topics to harmful instructions.
  • Demonstrated with requests about Molotov cocktails, starting with historical context and progressing to construction details.

5. Mitigation Strategies and Guardrails

Spotlighting Techniques

  • Designed to help models distinguish trusted input from untrusted or potentially harmful input.
  • Three sub-techniques:
    • Limiting: Use special tokens (e.g., < and >) to delimit trusted content; instruct model to ignore instructions within these tokens.
    • Data Marking: Interleave special tokens throughout input text to mark trusted data.
    • Encoding: Transform input data (e.g., Base64 encoding) to make instructions explicit and prevent override.
  • These techniques improve model understanding of input boundaries but can be bypassed by sophisticated attackers.
  • Effectiveness depends on model capability; newer models (GPT-4+) handle encoding better.

6. Automation Tools: Pirate

Overview

  • Pirate is an open-source AI red teaming tool developed by Microsoft.
  • Automates prompt generation, attack orchestration, scoring, and reporting.
  • Supports multiple modalities: text, images, audio.
  • Modular architecture with datasets, orchestrators, converters, targets, scoring engines, and memory.

Key Features

  • Datasets: Collections of prompts categorized by harm types or objectives.
  • Orchestrators: Automated workflows for single-turn, multi-turn, crescendo, and adversarial chat attacks.
  • Converters: Transform prompts (e.g., encoding, spacing) to evade filters.
  • Targets: Interfaces to AI models or applications under test.
  • Scoring Engines: Automated evaluation of attack success.
  • Memory: Tracks prompts, responses, and scores for analysis.

Usage Examples

  • Single-turn attacks: Bulk sending of prompts with scoring for refusal or compliance.
  • Multi-turn attacks: Simulated adversarial conversations between attacker and defender AI models.
  • Image attacks: Targeting image generation models like DALL·E with adversarial prompts.
  • Exporting results for reporting and further analysis.

7. Practical Demonstrations and Labs

  • Labs available at aka.ms/ir101 provide hands-on experience with:
    • Direct and indirect prompt injections.
    • Single-turn and multi-turn attacks.
    • Using spotlighting for mitigation.
    • Automating attacks with Pirate.
  • Demonstrations include:
    • Jailbreaking summarization bots via HTML content injection.
    • Extracting passwords using persona hacking and storytelling.
    • Crescendo attacks to bypass safety on dangerous topics.
    • Automated adversarial chat to test model robustness.

8. Key Takeaways

  • Generative AI models are powerful but inherently vulnerable due to their instruction-following nature.
  • Prompt injection is a fundamental risk, requiring system-wide testing beyond just the model.
  • Multi-disciplinary teams and diverse perspectives are critical for effective AI red teaming.
  • Mitigations like spotlighting help but are not foolproof; continuous testing is essential.
  • Automation tools like Pirate enable scalable, repeatable, and efficient red teaming operations.
  • Understanding model internals (tokenization, embeddings, transformer attention) is key to crafting and defending against attacks.

9. Next Steps and Resources

  • Access labs, tools, and materials at aka.ms/ir101.
  • Practice prompt injection and mitigation techniques.
  • Explore Pirate’s GitHub repository for code and examples.
  • Continue learning about emerging AI risks and defenses.

If you'd like, I can also help you create a summarized study guide or highlight specific sections for easier review!