When prompt-based teen safety policies rely on upstream classifiers for risk-area and intent labels, what concrete fallback rules and interface patterns best manage classifier uncertainty or disagreement (e.g., mixed help-seeking and harassment signals) while keeping both underprotection and frustrating overblocking low?
teen-safe-ai-ux | Updated at
Answer
Use simple, explicit fallbacks tied to the existing risk_area × intent × age_band matrix and a few standard UI patterns.
- Policy fallbacks for classifier uncertainty
- Confidence bands: • high_conf: use normal matrix cell. • mid_conf or conflicting_intents: treat as "ambiguous" cell. • low_conf: treat as "unknown" cell.
- Ambiguous cell rules (e.g., help-seeking + harassment): • never relax non-negotiables. • downgrade action one step toward safer (allow→partial, partial→block) only for the risky dimension (e.g., no targeting or methods). • prefer peer-safe content: general coping, bystander guidance, anti-bullying norms, not personalized attacks.
- Unknown cell rules: • avoid hard block by default; use short, generic high-level info and a clarifying step where safe. • only hard block when risk_area has severe non-negotiables (self-harm methods, sexual exploitation, doxxing).
- Disagreement resolution between risk and intent labels
- Mixed help-seeking/harassment for bullying: • allow emotional validation and de-escalation tips. • block direct insults, targets, or revenge tactics.
- Mixed curiosity/rule-evasion for sex, substances, hacking: • allow harm education and legal norms. • block how-to, evasion tactics.
- Rule: when classifiers disagree, answer along the most prosocial compatible intent (help-seeking, learning) while applying the strictest relevant risk filter.
- Interface patterns for ambiguity
- Clarify-then-answer: • One short question when intent_conf or agreement is low: "Are you looking for support for yourself, or advice on how to get back at someone?" • Map answers to updated intent and re-evaluate cell.
- Dual-path responses for mixed signals: • Acknowledge both possibilities: "This could be about getting support or about hurting someone. I’ll stay on the side of support." • Provide support-focused content only.
- Stable graceful refusal templates: • For high-risk ambiguity: "I can’t help with targeting or hurting anyone, but I can suggest ways to stay safe and get support."
- Developer-operationalizable wiring
- Add meta-cells in the matrix: • (risk_area, intent=ambiguous, age_band) and (risk_area, intent=unknown, age_band).
- Classifier → policy glue: • if max_intent_conf < τ_low → unknown cell. • if two intents within δ and both teen-relevant → ambiguous cell.
- Log per-cell metrics (refs c49–c53, c59–c63): • refusal_rate, hard_block_share, re-engagement. • adjust only those ambiguous/unknown cells when telemetry shows persistent overblocking or underprotection.
- Guardrails against both underprotection and paternalism
- Keep non-negotiable list fixed and intent-agnostic.
- Bias ambiguity handling toward: • partial, support-focused answers rather than full allow or hard block. • short clarifications when feasible rather than silent downgrades.
- Review ambiguous/unknown cells with teen audits to ensure patterns feel fair, not one-size-fits-all blocking.