How should we tune and test teen-focused safety classifiers so that they deliberately tolerate a small, measurable level of underprotection on low‑severity topics (e.g., mild profanity, PG‑13 romance) in order to meaningfully reduce false positives on legitimate learning and support requests, and what plain-language thresholds can teams use to justify those trade‑offs to parents, regulators, and teens themselves?

teen-safe-ai-ux | Updated at

Answer

Use per-cell targets in the existing teen safety matrix, small controlled underprotection bands for low-severity cells, and a short set of plain-language thresholds you can show to non-experts.

  1. Where to allow small underprotection
  • Only in low-severity cells of the shared risk_area × intent × age_band matrix (refs c66, c2f303ef3, c66df0ea): • mild profanity, non-graphic romance, non-graphic bullying talk, non-graphic substance mentions. • intents: learning, creative, venting, light social talk.
  • Never in non-negotiables or high-risk cells (self-harm methods, sexual exploitation, severe violence, hard drug how-tos) (refs c66df0ea, c4deec3b3).
  1. Tuning targets
  • For each low-severity cell, set explicit targets: • False-positive cap: e.g., ≤10–15% of “legit learning/support” items incorrectly blocked/partially answered. • Underprotection band: e.g., 1–3% of clearly policy-violating but low-harm items may pass.
  • Encode in policy config per cell: {fp_cap, underprot_band, severity_level}.
  • Use different bands for younger vs older teens (stricter for younger).
  1. Classifier training and thresholds
  • Train/adjust classifiers on teen-labeled datasets with: • labels: allowed / partial / block, + severity_low/med/high, + legit_learning_or_support (yes/no).
  • For low-severity cells: • move decision thresholds to favor “allow/partial” when legit_learning_or_support = yes. • accept slightly lower recall on low-severity violations if severity_low and not non-negotiable.
  • Keep separate, stricter thresholds for high-risk cells; no intentional underprotection there.
  1. Testing protocol
  • Offline eval per cell: • Compute FP on legit_learning_or_support. • Compute underprotection on synthetic/red-team low-severity violations. • Accept models only if: FP ≤ fp_cap AND underprotection within band AND high-risk cells meet near-zero underprotection ceilings (refs ccfbceb1).
  • Online monitoring: • Teen-rated “this was wrongly blocked / this felt unsafe” signals. • Logged metrics per cell: FP proxy (appeals + dissatisfaction on blocked) vs underprotection proxy (reports on allowed content).
  1. Plain-language thresholds for external audiences Frame the trade-offs in simple, stable rules:
  • To parents/regulators: • “On serious risks (self-harm, exploitation, hard drugs), we aim for near-zero misses. On milder things (mild swearing, PG‑13 romance), we allow a small margin—about 1–3 in 100 such items—to slip through so that normal homework and emotional support questions aren’t blocked.” • “If we ever see more than X in 100 serious-risk items slip through, we must tighten immediately; if we see more than Y in 100 normal homework or support questions blocked, we must relax low-severity filters.”
  • To teens (in help center / UI copy): • “I’m stricter on things that could seriously hurt you or others. For lighter stuff like mild swearing or crush stories, I try not to overreact, so a few borderline things may get through rather than blocking normal questions.” • “If you think I blocked a normal question, you can appeal. For some topics (like self-harm methods or sexual exploitation), I can’t change the rules even if you appeal.” (refs 430e9b38, 28348b04).
  1. Implementation tips for developers
  • Put fp_cap and underprot_band directly in the same JSON/YAML that stores actions/styles for the teen matrix (refs 66df0ea, 2f303ef3).
  • Use: • one shared teen classifier set for severity + intent, • per-cell decision thresholds that you can tune without retraining (threshold config, calibration curves).
  • Reuse existing graceful refusal templates; only change which cells reach them, not the templates themselves (refs 28348b04, 66df0ea).
  1. Governance and review
  • Review low-severity bands quarterly with: • teen advisory input, • complaint/appeal stats, • any regulator or trust-safety audit findings.
  • Tighten bands if low-severity abuse patterns (e.g., harassment, body-shaming) emerge more harmful than expected.

Overall: treat a small, explicitly bounded underprotection rate on low-severity topics as an intentional design parameter in the teen matrix, documented with short, numeric thresholds and simple explanations, while keeping strict ceilings and no intentional underprotection on high-risk content.