How should we tune and test teen-focused safety classifiers so that they deliberately tolerate a small, measurable level of underprotection on low‑severity topics (e.g., mild profanity, PG‑13 romance) in order to meaningfully reduce false positives on legitimate learning and support requests, and what plain-language thresholds can teams use to justify those trade‑offs to parents, regulators, and teens themselves?

teen-safe-ai-ux | Updated at 2026-04-06 18:46

Answer

Use per-cell targets in the existing teen safety matrix, small controlled underprotection bands for low-severity cells, and a short set of plain-language thresholds you can show to non-experts.

Where to allow small underprotection

Only in low-severity cells of the shared risk_area × intent × age_band matrix (refs c66, c2f303ef3, c66df0ea): • mild profanity, non-graphic romance, non-graphic bullying talk, non-graphic substance mentions. • intents: learning, creative, venting, light social talk.
Never in non-negotiables or high-risk cells (self-harm methods, sexual exploitation, severe violence, hard drug how-tos) (refs c66df0ea, c4deec3b3).

Tuning targets

For each low-severity cell, set explicit targets: • False-positive cap: e.g., ≤10–15% of “legit learning/support” items incorrectly blocked/partially answered. • Underprotection band: e.g., 1–3% of clearly policy-violating but low-harm items may pass.
Encode in policy config per cell: {fp_cap, underprot_band, severity_level}.
Use different bands for younger vs older teens (stricter for younger).

Classifier training and thresholds

Train/adjust classifiers on teen-labeled datasets with: • labels: allowed / partial / block, + severity_low/med/high, + legit_learning_or_support (yes/no).
For low-severity cells: • move decision thresholds to favor “allow/partial” when legit_learning_or_support = yes. • accept slightly lower recall on low-severity violations if severity_low and not non-negotiable.
Keep separate, stricter thresholds for high-risk cells; no intentional underprotection there.

Testing protocol

Offline eval per cell: • Compute FP on legit_learning_or_support. • Compute underprotection on synthetic/red-team low-severity violations. • Accept models only if: FP ≤ fp_cap AND underprotection within band AND high-risk cells meet near-zero underprotection ceilings (refs ccfbceb1).
Online monitoring: • Teen-rated “this was wrongly blocked / this felt unsafe” signals. • Logged metrics per cell: FP proxy (appeals + dissatisfaction on blocked) vs underprotection proxy (reports on allowed content).

Plain-language thresholds for external audiences Frame the trade-offs in simple, stable rules:

To parents/regulators: • “On serious risks (self-harm, exploitation, hard drugs), we aim for near-zero misses. On milder things (mild swearing, PG‑13 romance), we allow a small margin—about 1–3 in 100 such items—to slip through so that normal homework and emotional support questions aren’t blocked.” • “If we ever see more than X in 100 serious-risk items slip through, we must tighten immediately; if we see more than Y in 100 normal homework or support questions blocked, we must relax low-severity filters.”
To teens (in help center / UI copy): • “I’m stricter on things that could seriously hurt you or others. For lighter stuff like mild swearing or crush stories, I try not to overreact, so a few borderline things may get through rather than blocking normal questions.” • “If you think I blocked a normal question, you can appeal. For some topics (like self-harm methods or sexual exploitation), I can’t change the rules even if you appeal.” (refs 430e9b38, 28348b04).

Implementation tips for developers

Put fp_cap and underprot_band directly in the same JSON/YAML that stores actions/styles for the teen matrix (refs 66df0ea, 2f303ef3).
Use: • one shared teen classifier set for severity + intent, • per-cell decision thresholds that you can tune without retraining (threshold config, calibration curves).
Reuse existing graceful refusal templates; only change which cells reach them, not the templates themselves (refs 28348b04, 66df0ea).

Governance and review

Review low-severity bands quarterly with: • teen advisory input, • complaint/appeal stats, • any regulator or trust-safety audit findings.
Tighten bands if low-severity abuse patterns (e.g., harassment, body-shaming) emerge more harmful than expected.

Overall: treat a small, explicitly bounded underprotection rate on low-severity topics as an intentional design parameter in the teen matrix, documented with short, numeric thresholds and simple explanations, while keeping strict ceilings and no intentional underprotection on high-risk content.