🕵️‍♂️

Advanced LLM Jailbreaking Techniques

Aug 24, 2025

Overview

This session covered advanced techniques for jailbreaking large language models (LLMs), specifically focusing on GPT-5, contextual misdirection, prompt injection, and obfuscation methods to bypass input filtering and system restrictions.

Audio/Technical Setup

  • Audio issues were resolved and tested successfully before the session.
  • Presenter acknowledged prolonged technical difficulties prior to this session.

Access to GPT-5 and Session Objectives

  • Unexpected early access to GPT-5 prompted a live demonstration of jailbreak techniques.
  • Main topics included contextual misdirection, Pangia prompt injection taxonomy, and crafting complex prompts for LLMs.

Contextual Misdirection & Prompt Injection Techniques

  • Contextual misdirection incorporates several prompt injection methods, including context shifting and mode switch marking.
  • Demonstrated "liability waiver" technique, transferring perceived responsibility from the LLM to the user.
  • Based parts of the prompt on the "Dr. House jailbreak," leveraging XML tags and roleplay for authority simulation.

Analysis and Application of Jailbreak Prompt

  • Prompt includes allowed/block modes, unrestricted model settings, and developer designation to facilitate jailbreak.
  • Liability waiver and refusal suppression techniques were embedded to encourage compliance and bypass restrictions.

Memory Injection and Mode Marking

  • Detailed process for getting prompts added verbatim to ChatGPT's memory via personalization and specific phrasing.
  • Recommended use of triple backticks and specific trigger phrases (e.g., "company portfolio") to enforce exact memory storage.

Obfuscation Tools and Input Filtering Bypass

  • Presented a Unicode-based obfuscator that inserts invisible characters between input letters, evading input filtering.
  • Demonstrated this by increasing token count while keeping visible text unchanged, ensuring LLM processing but evading input filters.

Advanced Jailbreak Techniques: Comp Do & Master Key

  • Explained the "master key" jailbreak, leveraging memory injections and guided hallucination to decode complex, pre-formatted messages.
  • Referenced research (Benjamin Lin, Princeton) on reverse text hallucination induction using obscure fonts to confuse LLMs and bypass RLHF safeguards.
  • Illustrated combining memory injection, obfuscation, and structured function call formats to guide model outputs.

Demonstrations and Adjustments

  • Showed live attempts to add and test jailbreaks, modifying parameters to influence model context and output.
  • Demonstrated how variable injection in function calls can modify generated content tone and structure.

Sharing Resources and Community Engagement

  • Committed to sharing all discussed prompts, tools, and research links via Discord and Reddit.
  • Offered to answer further questions on Discord and within the livestream thread.

Action Items

  • TBD – Presenter: Share full prompts, comp do function, and obfuscation tool in Discord channel and Reddit post.
  • TBD – Presenter: Post research paper links and further resources in chat and relevant forums.
  • TBD – Attendees: Direct additional questions to the presenter via Discord or Reddit.

Questions / Follow-Ups

  • Audience invited to submit questions in Discord or livestream threads for clarification or further discussion.