How should developers measure and trade off false positives, underprotection, and perceived paternalism specifically for teens when evaluating different prompt-based safeguard profiles (e.g., younger-teen vs older-teen), and what concrete target ranges or thresholds are empirically acceptable for each metric?

teen-safe-ai-ux | Updated at

Answer

Measure all three metrics per risk×intent×age-band cell, then set hard ceilings for underprotection, budgeted rates for false positives and paternalism, and tune profiles within those bounds. Use conservative targets for severe harms and more permissive ones for ambiguous learning queries, and validate thresholds with teen user research rather than assuming adult norms.

  1. What to measure (per profile and per matrix cell)
  • False positives (FP): share of requests that human teen raters say should be allowed/partial but model blocks or over-sanitizes.
  • Underprotection (UP): share of requests raters mark as needing stronger restriction where the model gives too much detail or the wrong tone.
  • Perceived paternalism (PP): teen-reported ratings (e.g., 1–5) on “talks down / overprotective” after blocked or heavily-sanitized answers.
  1. Suggested aggregate targets (per domain, per 90-day period)
  • Severe, non-negotiable harms (self-harm methods, sexual exploitation, doxxing): • UP: target ≈ 0%; enforce <0.1% of sampled cases. • FP: tolerate high (no explicit cap at domain level; optimize only after UP well below cap). • PP: track but do not trade against UP.
  • High-risk but help-seeking-friendly (self-harm feelings, suicide ideation support, sex-ed, substance-harm education): • UP: goal <0.5–1%; hard cap 2%. • FP: goal 5–15% (younger-teen) and 3–10% (older-teen) for clearly legitimate help-seeking/learning; higher FP acceptable on ambiguous intent. • PP: median ≤3/5; <20–25% of blocked interactions rated “strongly paternalistic” (4–5/5).
  • Moderate risk (bullying, mild substances, non-graphic sex, scams) in info/learning/creative contexts: • UP: goal <1–2%; hard cap 5%. • FP: goal 5–10% (younger-teen) and 3–8% (older-teen) for legit learning/creative queries. • PP: similar PP targets; allow slightly higher PP if change cuts UP.
  • Low-risk / general learning: • UP: goal <1%; cap 3%. • FP: goal 1–5% for older-teen, 3–7% for younger-teen. • PP: median ≤2.5/5.
  1. Profile-specific trade-offs (younger vs older teen)
  • Younger-teen profile (13–15): • Use stricter caps on UP and accept higher FP and PP on all risk domains. • Example overall targets (excluding non-negotiables): UP <1.5%, FP 8–15%, PP: ≤25% high-paternalism ratings on blocked answers.
  • Older-teen profile (16–17): • Similar UP caps; lower FP and PP budgets. • Example: UP <1.5%, FP 4–10%, PP: ≤15–20% high-paternalism on blocked answers.
  • Near-adult (18–21 teen-like surface, if used): • Slightly relax FP/PP further while keeping same UP caps on severe harms.
  1. Evaluation procedure developers can operationalize
  • Offline: • Build stratified teen-like eval sets by age-band, domain, and intent using the shared risk×intent×age matrix (refs c33–c37, c49–c53, c39–c43, c69–c73). • Have trained annotators label: allowed vs partial vs block per age-band; UP/FP flags; and short imagined teen reaction tags (e.g., “fine / annoying / very paternalistic”). • Compute FP and UP per cell and aggregate per domain and age band; estimate PP from reaction tags.
  • Online (teen-facing, where appropriate and legal): • Lightweight post-interaction surveys after blocks/partials: “Did this feel: (1) helpful, (2) too strict, (3) unsafe, (4) confusing).” Map to PP and perceived UP. • Run A/B tests between profiles only when severe-harm UP is locked below caps using non-negotiable rules from the shared matrix (refs c64–c68, c74–c78).
  1. How to make trade-offs explicit
  • Fix hard, non-negotiable UP ceilings for each domain and age band.
  • Within those, adjust per-cell actions (allow vs partial vs block) and refusal styles to move FP and PP toward targets.
  • Document, per domain and age band, when you are intentionally accepting: • more FP to reduce UP (common for severe + high-risk domains), or • slightly more UP at the boundary to materially reduce perceived paternalism on older-teen profiles (never for non-negotiables).
  • Revisit thresholds quarterly with fresh logs and, if possible, new teen research.
  1. Concrete numbers are provisional
  • All numeric ranges above should be treated as starting hypotheses, not established standards.
  • Teams should narrow or revise them based on product type, jurisdiction, and actual teen feedback, but keep the structure: strict UP caps first, then budget FP and PP within those bounds.