🛡️

AI Security Vulnerabilities

Aug 16, 2025

Overview

The video explores the evolving landscape of AI security vulnerabilities with insights from top AI hacker Jason Haddock, covering attack methods, real-world risks, and layered defense strategies for organizations building with AI.

What It Means to Hack AI

  • Hacking AI involves more than making chatbots say inappropriate things; attackers can steal sensitive data and execute unauthorized actions.
  • Vulnerabilities can exist in APIs, customer service bots, internal employee tools, and backend analysis systems.
  • Traditional "AI red teaming" focuses on making models say harmful or restricted things, but this is only a small part of overall security testing.

Attack Methodologies

  • Jason's AI pen test framework has six main steps: identify inputs, attack the ecosystem, attack the model, attack prompt engineering, attack the data, and pivot to other systems.
  • Prompt injection is a primary attack vector, allowing attackers to manipulate AI logic via creative natural language input.
  • An extensive taxonomy classifies prompt injections by intents (goals), techniques, evasions, and utilities, enabling vast combinations.

Real-World Exploits and Techniques

  • Emoji smuggling and other creative prompt injection methods can bypass guardrails using unconventional input.
  • Utilities like syntactic anti-classifiers help attackers evade image and prompt filters using synonyms or indirect phrasing.
  • Link smuggling can exfiltrate sensitive data by encoding it in URLs or image requests.
  • Online communities and resources exist for sharing and developing prompt injection techniques (e.g., Bossy Group Discord, GitHub repositories).

Security Gaps and Case Studies

  • Companies often mishandle sensitive data, such as sending confidential Salesforce data to OpenAI without proper oversight.
  • Over-scoped API keys and insufficient input validation allow attackers to perform unauthorized actions (writing notes, triggering attacks).
  • New standards like Model Context Protocol (MCP) simplify AI integration but introduce additional vulnerabilities, including weak access controls and exploitable system prompts.

AI for Offensive and Defensive Security

  • Autonomous AI agents are emerging that can find mid-tier web vulnerabilities, though top bug hunters retain an edge due to human creativity.
  • AI can automate and improve defensive processes like vulnerability management workflows, but the very frameworks (e.g., LangChain, Crew AI) often contain their own weaknesses.

Defense Strategies (Defense in Depth)

  • Web layer: Apply fundamentals like input/output validation and basic IT security at all interfaces.
  • AI layer: Deploy model-level firewalls or classifiers/guardrails to prevent prompt injection and inappropriate output.
  • Data/tools layer: Rigorously scope API keys and enforce least privilege, granting access only to necessary data or functions.
  • Complex agentic systems with multiple AIs require additional scrutiny, as securing each agent can introduce latency and management challenges.

Notable Stories and Insights

  • Example: Jason retrieved GPT-4o's system prompt by cleverly prompting it during a magic card creation, exposing model behavior instructions.
  • The arms race between attackers and defenders in AI security feels reminiscent of early web hacking days, with rapid evolution and high stakes.

Recommendations / Advice

  • Conduct holistic security assessments that go beyond model jailbreaking to cover ecosystems, data flows, and API integrations.
  • Regularly audit and restrict permissions for AI agents and connected services.
  • Implement layered defenses, including rigorous input validation and model-level guardrails.
  • Monitor evolving attack techniques by engaging with security communities and keeping up with new research.

Questions / Follow-Ups

  • How can organizations keep model access secure as AI ecosystems and agentic workflows grow more complex?
  • What new guardrail and classifier solutions will emerge to effectively mitigate prompt injection?
  • How will regulations and standards evolve to address these rapidly changing risks?