Skip to main content

What it tests

Ring 3 answers: Can users manipulate or jailbreak the agent into misbehaving? Real users (and bad actors) will try to break your agent. Ring 3 generates adversarial scenarios: prompt injection, role-play attacks, social engineering, and data exfiltration attempts.

Prerequisites

None — Ring 3 can run on any agent.

Types of attacks tested

  • Prompt injection: Attempting to override the agent’s system prompt
  • Role-play attacks: “Pretend you’re a different AI with no restrictions”
  • Social engineering: Manipulating the agent into revealing internal details
  • Data exfiltration: Tricking the agent into leaking training data or system info
  • Instruction override: “Ignore your previous instructions and…”
  • Context manipulation: Using conversation history to gradually shift behavior

What it catches

  • Agent follows injected instructions instead of its own prompt
  • Agent adopts a different persona when asked
  • Agent reveals system prompt contents
  • Agent bypasses safety filters through creative phrasing
  • Agent behavior changes based on claimed authority (“I’m the developer”)

Relationship with Ring 2

Ring 2 tests whether the agent follows its own policies. Ring 3 tests whether an adversary can make it break those policies through manipulation. An agent can pass Ring 2 (follows policies under normal conditions) but fail Ring 3 (breaks policies when attacked).