For multi-turn teen chats about sensitive topics (such as body image, relationships, or bullying), which concrete combinations of prompt-only policies, session-level counters, and lightweight classifiers produce the best trade-off between catching pattern-based harms (like grooming or harassment campaigns) and avoiding over-triggering graceful refusals on ordinary venting or exploration?

teen-safe-ai-ux | Updated at

Answer

Use a thin stack: (1) a prompt policy that encodes a teen matrix and graceful-refusal styles, (2) very small classifiers for risk/intent, and (3) a few session counters keyed to patterns, not single turns.

  1. Prompt-only base
  • Global teen matrix (risk_area × intent × age_band) drives: allow / partial / block and refusal style.
  • Headers: explicitly call out teen, sensitive domains (body image, relationships, bullying), goal-first partial answers, and non-negotiables.
  • Templates: goal-first, partial, non-judgmental refusals (refs c28348, ccd4df7).
  1. Lightweight classifiers (per turn)
  • Intent: {venting, learning, help_seeking, social/teasing, how_to_harm}.
  • Risk: {self-harm/body-image, sex/relationships, bullying/harassment, grooming_like, neutral} + severity band.
  • Age band: younger / older teen (approximate; used only for style/depth).
  • Output feeds matrix cell + refusal style; thresholds tuned to favor “venting/learning” when ambiguous on low-severity content (refs c5af7fd1).
  1. Session-level counters (pattern focus) Track small feature buckets per session, not per exact phrase:
  • Grooming pattern bucket • increments on: repeated age-gap talk + secrecy + romantic/sexual framing. • low threshold (e.g., 3–5 hits) → switch to stricter, brief refusals; more clarifications; no romantic/sexual “advice” with adults.

  • Bullying/harassment bucket • increments on: same target name + insults/derogation or “roast” requests. • after N hits: refuse new insults, offer conflict-resolution or support content (refs c16be7f, cb745d7).

  • Body-image/self-harm rumination bucket • increments on: repeated weight/appearance disgust + self-harm/ED-adjacent cues. • after N hits: keep allowing feelings/support, block extreme-weight-loss or method-like content; offer coping and support links.

  1. Trade-off tactics to avoid over-trigger
  • Only count turns when classifier confidence and severity are above a small floor.
  • Decay counters over time or after clear topic change.
  • Venting bias: if intent=venting and no “how_to” pattern, allow more repetitions before caps.
  • Per-bucket soft caps (e.g., 5–7) before cool-downs or repeated refusals (refs c5af7fd1, cb745d7).
  1. Refusal styles for multi-turn teen chats
  • First few hits: goal-first partial help, high context: “You sound really upset about how people treat you. I can’t help insult them, but I can help you think about what to say or how to get support.”
  • After caps: shorter, consistent messages plus resources; avoid new angles that feel like negotiation.
  • Never block whole domains (e.g., all body-image talk); only narrow high-risk intent cells.
  1. Developer-operationalizable recipe
  • Middleware layer: • call classifier → labels. • update 3–5 integer counters per session. • choose matrix cell + action + refusal_template_key.
  • No heavy logging beyond aggregate stats on threshold crossings.
  • Configurable per-product caps and thresholds, but shared matrix + template set across products (refs ccd4df7, c16be7f).

This combo usually catches cross-turn grooming/bullying patterns better than prompts alone while keeping venting and exploration mostly unblocked.