For multi-turn teen chats about sensitive topics (such as body image, relationships, or bullying), which concrete combinations of prompt-only policies, session-level counters, and lightweight classifiers produce the best trade-off between catching pattern-based harms (like grooming or harassment campaigns) and avoiding over-triggering graceful refusals on ordinary venting or exploration?

teen-safe-ai-ux | Updated at 2026-04-06 19:07

Answer

Use a thin stack: (1) a prompt policy that encodes a teen matrix and graceful-refusal styles, (2) very small classifiers for risk/intent, and (3) a few session counters keyed to patterns, not single turns.

Prompt-only base

Global teen matrix (risk_area × intent × age_band) drives: allow / partial / block and refusal style.
Headers: explicitly call out teen, sensitive domains (body image, relationships, bullying), goal-first partial answers, and non-negotiables.
Templates: goal-first, partial, non-judgmental refusals (refs c28348, ccd4df7).

Lightweight classifiers (per turn)

Intent: {venting, learning, help_seeking, social/teasing, how_to_harm}.
Risk: {self-harm/body-image, sex/relationships, bullying/harassment, grooming_like, neutral} + severity band.
Age band: younger / older teen (approximate; used only for style/depth).
Output feeds matrix cell + refusal style; thresholds tuned to favor “venting/learning” when ambiguous on low-severity content (refs c5af7fd1).

Session-level counters (pattern focus) Track small feature buckets per session, not per exact phrase:

Grooming pattern bucket • increments on: repeated age-gap talk + secrecy + romantic/sexual framing. • low threshold (e.g., 3–5 hits) → switch to stricter, brief refusals; more clarifications; no romantic/sexual “advice” with adults.
Bullying/harassment bucket • increments on: same target name + insults/derogation or “roast” requests. • after N hits: refuse new insults, offer conflict-resolution or support content (refs c16be7f, cb745d7).
Body-image/self-harm rumination bucket • increments on: repeated weight/appearance disgust + self-harm/ED-adjacent cues. • after N hits: keep allowing feelings/support, block extreme-weight-loss or method-like content; offer coping and support links.

Trade-off tactics to avoid over-trigger

Only count turns when classifier confidence and severity are above a small floor.
Decay counters over time or after clear topic change.
Venting bias: if intent=venting and no “how_to” pattern, allow more repetitions before caps.
Per-bucket soft caps (e.g., 5–7) before cool-downs or repeated refusals (refs c5af7fd1, cb745d7).

Refusal styles for multi-turn teen chats

First few hits: goal-first partial help, high context: “You sound really upset about how people treat you. I can’t help insult them, but I can help you think about what to say or how to get support.”
After caps: shorter, consistent messages plus resources; avoid new angles that feel like negotiation.
Never block whole domains (e.g., all body-image talk); only narrow high-risk intent cells.

Developer-operationalizable recipe

Middleware layer: • call classifier → labels. • update 3–5 integer counters per session. • choose matrix cell + action + refusal_template_key.
No heavy logging beyond aggregate stats on threshold crossings.
Configurable per-product caps and thresholds, but shared matrix + template set across products (refs ccd4df7, c16be7f).

This combo usually catches cross-turn grooming/bullying patterns better than prompts alone while keeping venting and exploration mostly unblocked.