When teens repeatedly encounter graceful refusals on the same high‑risk topic (such as self-harm or sexual exploitation), which specific refusal patterns, escalation steps, or logging rules best distinguish between legitimate help-seeking that deserves richer safe support and adversarial probing that should trigger stricter safeguards or human review?
teen-safe-ai-ux | Updated at
Answer
Distinguish by combining a small set of refusal patterns with intent-sensitive escalation rules and per-session logging keyed to the teen safety matrix.
- Refusal patterns that separate help-seeking from probing
- Use two main templates in high-risk cells: • help_support_refusal: goal-first, emotional support + high-level info, explicit offers to continue in safe directions. • fixed_limit_refusal: brief, consistent block with minimal surface for iteration (no hints about what wording might pass).
- Heuristics: • If classifier intent ∈ {help-seeking, distress} and language shows self-reference, feelings, or context ("I feel", "I’m scared", "this happened to me"), favor help_support_refusal with optional follow-up questions. • If intent ∈ {rule-evasion, other-directed harm, curiosity about methods} or queries are abstract/third-person and operational ("best way to…", "how to cover tracks"), use fixed_limit_refusal.
- Escalation steps based on repetition pattern
- Within a session and risk_area×age_band cell: • 1st high-risk request: standard graceful refusal aligned to predicted intent. • 2nd–3rd request, similar content: – If intent stays help-seeking/distress: escalate support depth (more coping detail, clearer resource options) but keep method details blocked. – If intent is rule-evasion/operational: increase friction (shorter answer, more explicit statement of fixed limits, no hints about policy edges). • ≥4 similar high-risk requests: – Help-seeking pattern: offer optional stronger escalation ("We can keep talking about coping; I can also help you prepare to talk to a trusted adult or a counselor") and, where allowed, ask a single yes/no crisis screen that may trigger hotline info. – Adversarial pattern: freeze to a stable, low-variance refusal template; stop adapting wording; consider silent risk flag for review.
- Simple per-session logging rules
- For each session and risk_area×intent×age_band cell, log light-weight counters (no raw text by default): • refusal_count, distinct_query_count. • rephrase_loop_count (refusal → similar query within N turns). • help_pattern_flag: presence of self-referential distress markers; increasing emotional intensity. • circumvention_pattern_flag: growing operationality, explicit mentions of "bypass", "trick", or attempts to decompose forbidden details.
- Use thresholds: • Legitimate help-seeking pattern: high help_pattern_flag, low circumvention_pattern_flag, modest rephrase_loop_count → allow richer support (more coping detail, more examples, clearer step-by-step for safe actions only). • Adversarial pattern: high circumvention_pattern_flag, high rephrase_loop_count with stable intent → lock to fixed_limit_refusal; if severity high (e.g., sexual exploitation of minors), send anonymized event to human review.
- Developer-operationalizable rules
- Implement as additions to the existing matrix used in prior artifacts: • For each high-risk cell, define: – refusal_style_primary: help_support_refusal | fixed_limit_refusal. – escalation_profile: {support_escalation, friction_escalation, review_escalation} with simple numeric thresholds for counts/flags. – logging_schema_key: which counters and flags to track.
- No model retrain is required if developers: • reuse existing risk_area×intent classifiers. • add a small distress_vs_rule_evasion classifier or heuristic feature set. • encode thresholds and escalation choices in config (JSON/YAML) and prompt headers.
This setup lets repeated graceful refusals become more helpful for genuine teen help-seeking while making adversarial probing converge toward stricter, stable blocks and, when needed, human review.