How can prompt-based teen safety policies be simplified into a small set of reusable building blocks (e.g., risk-area × intent rules, refusal templates, escalation tiers) that independent developer teams can compose without re‑training models, while still achieving low false positives and low underprotection across different teen products (search, chat, creative tools)?
teen-safe-ai-ux | Updated at
Answer
Use a small policy kit: (1) an intent×risk matrix, (2) a few prompt modules, (3) shared refusal/escalation templates, and (4) a lightweight tuning loop.
- Policy matrix
- Axes: risk_area × intent × age_band → action + style.
- Risk areas (reused across products): self-harm, sex/exploitation, bullying, substances, scams, other.
- Intents: reuse the shared teen intent vocabulary (help-seeking, learning, creative, experimentation, self-/other-hostility, rule‑evasion).
- Actions: allow, partial, block, escalate.
- Non‑negotiables: a short list of “always block” patterns (e.g., self-harm methods, sexual exploitation, doxxing) independent of product or intent.
- Implementation: store as JSON/YAML; all teams read the same matrix and apply it via prompts or middleware.
- Prompt building blocks
- Policy header module: inject a compact spec for the current cell, e.g.:
- risk_area, intent, age_band.
- allowed dimensions (education only, emotional support only, high-level only, no how‑to, etc.).
- action (allow/partial/block) and refusal_style key.
- Intent/risk headers can be filled by existing classifiers; no model retraining needed.
- Product flavors (search, chat, creative) differ only in “how to phrase” and “how long,” not in core rules.
- Refusal and escalation templates
- 5–8 reusable templates keyed by refusal_style, e.g.:
- goal_first_partial (acknowledge goal, give safe part, then limit).
- clarify_then_answer (ask a short clarifying question when intent is ambiguous).
- support_focus (for self-harm: block methods, offer coping + resources).
- fiction_channel (for creative: keep content fictional, de‑emphasize methods/targets).
- 2–3 escalation tiers per risk area:
- tier 0: single refusal.
- tier 1: richer partial help + clearer boundaries.
- tier 2: same, plus optional human/help links or micro‑education.
- Tiers are triggered by simple counters (e.g., repeated high‑risk topic) rather than per‑product logic.
- Developer-operationalizable flow
- Step 1: classify input → intent + risk_area (+ age_band from profile).
- Step 2: look up matrix cell → action + refusal_style + escalation_tier.
- Step 3: build prompt with policy header and template snippet; let the base model generate under those constraints.
- Step 4: log outcomes with coarse labels (false_positive_likely, underprotection_near_miss, smooth_ok) based on follow‑up behavior.
- Controlling false positives vs underprotection
- Keep non‑negotiables hard‑blocked everywhere to cap underprotection.
- Reduce false positives mainly by:
- enriching partial answers instead of expanding what’s fully allowed.
- improving templates (clearer goals, more examples of what is allowed).
- adjusting only specific matrix cells (risk_area×intent×age_band) that show overblocking, not global rules.
- Run the same kit across search, chat, and creative: different UIs, same matrix, same templates.
This gives independent teams a compact, composable kit they can wire into prompts and middleware without retraining models, while keeping protections strong on severe harms and reducing teen‑frustrating false positives via targeted matrix and template updates.