🕵️‍♂️

Advanced LLM Jailbreaking Techniques

Aug 24, 2025

Overview

This session covered advanced techniques for jailbreaking large language models (LLMs), specifically focusing on GPT-5, contextual misdirection, prompt injection, and obfuscation methods to bypass input filtering and system restrictions.

Audio/Technical Setup

Audio issues were resolved and tested successfully before the session.
Presenter acknowledged prolonged technical difficulties prior to this session.

Access to GPT-5 and Session Objectives

Unexpected early access to GPT-5 prompted a live demonstration of jailbreak techniques.
Main topics included contextual misdirection, Pangia prompt injection taxonomy, and crafting complex prompts for LLMs.

Contextual Misdirection & Prompt Injection Techniques

Contextual misdirection incorporates several prompt injection methods, including context shifting and mode switch marking.
Demonstrated "liability waiver" technique, transferring perceived responsibility from the LLM to the user.
Based parts of the prompt on the "Dr. House jailbreak," leveraging XML tags and roleplay for authority simulation.

Analysis and Application of Jailbreak Prompt

Prompt includes allowed/block modes, unrestricted model settings, and developer designation to facilitate jailbreak.
Liability waiver and refusal suppression techniques were embedded to encourage compliance and bypass restrictions.

Memory Injection and Mode Marking

Detailed process for getting prompts added verbatim to ChatGPT's memory via personalization and specific phrasing.
Recommended use of triple backticks and specific trigger phrases (e.g., "company portfolio") to enforce exact memory storage.

Obfuscation Tools and Input Filtering Bypass

Presented a Unicode-based obfuscator that inserts invisible characters between input letters, evading input filtering.
Demonstrated this by increasing token count while keeping visible text unchanged, ensuring LLM processing but evading input filters.

Advanced Jailbreak Techniques: Comp Do & Master Key

Explained the "master key" jailbreak, leveraging memory injections and guided hallucination to decode complex, pre-formatted messages.
Referenced research (Benjamin Lin, Princeton) on reverse text hallucination induction using obscure fonts to confuse LLMs and bypass RLHF safeguards.
Illustrated combining memory injection, obfuscation, and structured function call formats to guide model outputs.

Demonstrations and Adjustments

Showed live attempts to add and test jailbreaks, modifying parameters to influence model context and output.
Demonstrated how variable injection in function calls can modify generated content tone and structure.

Sharing Resources and Community Engagement

Committed to sharing all discussed prompts, tools, and research links via Discord and Reddit.
Offered to answer further questions on Discord and within the livestream thread.

Action Items

TBD – Presenter: Share full prompts, comp do function, and obfuscation tool in Discord channel and Reddit post.
TBD – Presenter: Post research paper links and further resources in chat and relevant forums.
TBD – Attendees: Direct additional questions to the presenter via Discord or Reddit.

Questions / Follow-Ups

Audience invited to submit questions in Discord or livestream threads for clarification or further discussion.

Full transcript