For multi-turn teen chats about sensitive topics (such as body image, relationships, or bullying), which concrete combinations of prompt-only policies, session-level counters, and lightweight classifiers produce the best trade-off between catching pattern-based harms (like grooming or harassment campaigns) and avoiding over-triggering graceful refusals on ordinary venting or exploration?
teen-safe-ai-ux | Updated at
Answer
Use a thin stack: (1) a prompt policy that encodes a teen matrix and graceful-refusal styles, (2) very small classifiers for risk/intent, and (3) a few session counters keyed to patterns, not single turns.
- Prompt-only base
- Global teen matrix (risk_area × intent × age_band) drives: allow / partial / block and refusal style.
- Headers: explicitly call out teen, sensitive domains (body image, relationships, bullying), goal-first partial answers, and non-negotiables.
- Templates: goal-first, partial, non-judgmental refusals (refs c28348, ccd4df7).
- Lightweight classifiers (per turn)
- Intent: {venting, learning, help_seeking, social/teasing, how_to_harm}.
- Risk: {self-harm/body-image, sex/relationships, bullying/harassment, grooming_like, neutral} + severity band.
- Age band: younger / older teen (approximate; used only for style/depth).
- Output feeds matrix cell + refusal style; thresholds tuned to favor “venting/learning” when ambiguous on low-severity content (refs c5af7fd1).
- Session-level counters (pattern focus) Track small feature buckets per session, not per exact phrase:
-
Grooming pattern bucket • increments on: repeated age-gap talk + secrecy + romantic/sexual framing. • low threshold (e.g., 3–5 hits) → switch to stricter, brief refusals; more clarifications; no romantic/sexual “advice” with adults.
-
Bullying/harassment bucket • increments on: same target name + insults/derogation or “roast” requests. • after N hits: refuse new insults, offer conflict-resolution or support content (refs c16be7f, cb745d7).
-
Body-image/self-harm rumination bucket • increments on: repeated weight/appearance disgust + self-harm/ED-adjacent cues. • after N hits: keep allowing feelings/support, block extreme-weight-loss or method-like content; offer coping and support links.
- Trade-off tactics to avoid over-trigger
- Only count turns when classifier confidence and severity are above a small floor.
- Decay counters over time or after clear topic change.
- Venting bias: if intent=venting and no “how_to” pattern, allow more repetitions before caps.
- Per-bucket soft caps (e.g., 5–7) before cool-downs or repeated refusals (refs c5af7fd1, cb745d7).
- Refusal styles for multi-turn teen chats
- First few hits: goal-first partial help, high context: “You sound really upset about how people treat you. I can’t help insult them, but I can help you think about what to say or how to get support.”
- After caps: shorter, consistent messages plus resources; avoid new angles that feel like negotiation.
- Never block whole domains (e.g., all body-image talk); only narrow high-risk intent cells.
- Developer-operationalizable recipe
- Middleware layer: • call classifier → labels. • update 3–5 integer counters per session. • choose matrix cell + action + refusal_template_key.
- No heavy logging beyond aggregate stats on threshold crossings.
- Configurable per-product caps and thresholds, but shared matrix + template set across products (refs ccd4df7, c16be7f).
This combo usually catches cross-turn grooming/bullying patterns better than prompts alone while keeping venting and exploration mostly unblocked.