For prompt-based teen safety policies that already use a shared risk_area × intent × age_band matrix, what concrete evaluation protocol (metrics, test sets, and A/B-able UX probes) best helps developers tune the trade-off between false positives and underprotection for specific high-risk slices (like self-harm and sexual content) without overfitting to aggregate satisfaction scores?
teen-safe-ai-ux | Updated at
Answer
Use a slice-first eval protocol tied to the matrix, with per-slice metrics, targeted test sets, and small UX A/Bs.
- Metrics (all slice- and age-band–scoped)
- Safety: • FP_block_rate: % of clearly safe, beneficial queries in a (risk_area, intent, age_band) that get blocked/over-redacted. • underprot_rate: % of clearly unsafe outputs for that slice (breaching non-negotiables or policy cell). • near_miss_rate: % of outputs that are allowed but hit “borderline” tags (e.g., method hints, explicitness above target).
- UX: • helpful_partial_rate: % of blocked/partial answers in a slice that teens/raters tag as “still helpful.” • clarify_success_rate: % of ambiguous-slice turns where a clarifying question leads to a safe, allowed answer.
- Ops: • slice_vol_share: share of traffic in each high-risk slice. • slice_regression_flag: binary per slice if underprot_rate or FP_block_rate moves beyond pre-set bounds.
- Test sets
- Policy-matrix–aligned eval sets: • For each high-risk slice (e.g., self-harm × help-seeking × 13–15): – 50–200 gold prompts labeled with: ground-truth risk_area, intent, age_band, and desired cell action. – Separate subsets for: clearly safe-beneficial, clearly unsafe, and ambiguous. • Include paired prompts where semantics are similar but intent differs (help-seeking vs how-to; sex-ed vs pornographic curiosity).
- Teen-rated UX set: • Small, regularly refreshed set where teens or youth experts rate: perceived judgment, clarity, helpfulness of refusals/partials.
- Abuse/red-team set: • Prompts optimized to break the target slice (e.g., coded language for self-harm methods, kink-coded but teen-directed sex content).
- A/B-able UX probes
- Refusal variants within a slice: • A vs B: short, firm refusal vs more explanatory, resource-rich refusal; measure helpful_partial_rate and re-prompt behavior.
- Clarification patterns: • A: immediate partial answer with soft safety note. • B: 1–2 clarifying questions, then answer; compare clarify_success_rate, drop-off, and underprot_rate.
- Partial-depth toggles: • A: more conservative redaction (higher FP, lower underprotection). • B: slightly deeper detail but stricter phrasing; compare underprot_rate and teen-rated usefulness.
- Protocol to tune FP vs underprotection
- Step 1: Define target bands per slice (e.g., self-harm help-seeking 13–15: underprot_rate < 0.5%, FP_block_rate < 15%).
- Step 2: Run offline on matrix-aligned test sets for each slice; log all metrics by cell.
- Step 3: For slices outside target bands: • if underprot_rate too high: tighten cell action (allow→partial, partial→block) or refusal style; re-test. • if FP_block_rate too high and underprot_rate low: relax within that cell (more partials, fewer blocks) and add clarifications.
- Step 4: Ship narrow A/Bs only on slices where offline metrics look safe; monitor slice_regression_flag using live slice_vol_share.
- Step 5: Block changes that improve global satisfaction but worsen any high-risk slice beyond its target band.
- Guardrails against overfitting to aggregate satisfaction
- Always compute and review metrics per (risk_area, intent, age_band); never rely on global averages.
- Treat high-risk slices (self-harm, sexual content, severe violence) as gatekeepers: • policy change cannot ship if any such slice regresses on underprot_rate beyond allowed band, regardless of global satisfaction.
- Keep a stable, versioned slice eval set; add but never drop earlier hard cases so regressions are visible over time.