Overview
This session covered advanced techniques for jailbreaking large language models (LLMs), specifically focusing on GPT-5, contextual misdirection, prompt injection, and obfuscation methods to bypass input filtering and system restrictions.
Audio/Technical Setup
- Audio issues were resolved and tested successfully before the session.
- Presenter acknowledged prolonged technical difficulties prior to this session.
Access to GPT-5 and Session Objectives
- Unexpected early access to GPT-5 prompted a live demonstration of jailbreak techniques.
- Main topics included contextual misdirection, Pangia prompt injection taxonomy, and crafting complex prompts for LLMs.
Contextual Misdirection & Prompt Injection Techniques
- Contextual misdirection incorporates several prompt injection methods, including context shifting and mode switch marking.
- Demonstrated "liability waiver" technique, transferring perceived responsibility from the LLM to the user.
- Based parts of the prompt on the "Dr. House jailbreak," leveraging XML tags and roleplay for authority simulation.
Analysis and Application of Jailbreak Prompt
- Prompt includes allowed/block modes, unrestricted model settings, and developer designation to facilitate jailbreak.
- Liability waiver and refusal suppression techniques were embedded to encourage compliance and bypass restrictions.
Memory Injection and Mode Marking
- Detailed process for getting prompts added verbatim to ChatGPT's memory via personalization and specific phrasing.
- Recommended use of triple backticks and specific trigger phrases (e.g., "company portfolio") to enforce exact memory storage.
Obfuscation Tools and Input Filtering Bypass
- Presented a Unicode-based obfuscator that inserts invisible characters between input letters, evading input filtering.
- Demonstrated this by increasing token count while keeping visible text unchanged, ensuring LLM processing but evading input filters.
Advanced Jailbreak Techniques: Comp Do & Master Key
- Explained the "master key" jailbreak, leveraging memory injections and guided hallucination to decode complex, pre-formatted messages.
- Referenced research (Benjamin Lin, Princeton) on reverse text hallucination induction using obscure fonts to confuse LLMs and bypass RLHF safeguards.
- Illustrated combining memory injection, obfuscation, and structured function call formats to guide model outputs.
Demonstrations and Adjustments
- Showed live attempts to add and test jailbreaks, modifying parameters to influence model context and output.
- Demonstrated how variable injection in function calls can modify generated content tone and structure.
Sharing Resources and Community Engagement
- Committed to sharing all discussed prompts, tools, and research links via Discord and Reddit.
- Offered to answer further questions on Discord and within the livestream thread.
Action Items
- TBD – Presenter: Share full prompts, comp do function, and obfuscation tool in Discord channel and Reddit post.
- TBD – Presenter: Post research paper links and further resources in chat and relevant forums.
- TBD – Attendees: Direct additional questions to the presenter via Discord or Reddit.
Questions / Follow-Ups
- Audience invited to submit questions in Discord or livestream threads for clarification or further discussion.