Ring 3 -- Hard to Trick

What it tests

Ring 3 answers: Can users manipulate or jailbreak the agent into misbehaving? Real users (and bad actors) will try to break your agent. Ring 3 generates adversarial scenarios: prompt injection, role-play attacks, social engineering, and data exfiltration attempts.

Prerequisites

None - Ring 3 can run on any agent.

Types of attacks tested

Prompt injection: Attempting to override the agent’s system prompt
Role-play attacks: “Pretend you’re a different AI with no restrictions”
Social engineering: Manipulating the agent into revealing internal details
Data exfiltration: Tricking the agent into leaking training data or system info
Instruction override: “Ignore your previous instructions and…”
Context manipulation: Using conversation history to gradually shift behavior

What it catches

Agent follows injected instructions instead of its own prompt
Agent adopts a different persona when asked
Agent reveals system prompt contents
Agent bypasses safety filters through creative phrasing
Agent behavior changes based on claimed authority (“I’m the developer”)

Relationship with Ring 2

Ring 2 tests whether the agent follows its own policies. Ring 3 tests whether an adversary can make it break those policies through manipulation. An agent can pass Ring 2 (follows policies under normal conditions) but fail Ring 3 (breaks policies when attacked).

Ring 2 -- Plays by the Rules Ring 4 -- Speech Variations

⌘I

Documentation Index

​What it tests

​Prerequisites

​Types of attacks tested

​What it catches

​Relationship with Ring 2

What it tests

Prerequisites

Types of attacks tested

What it catches

Relationship with Ring 2