If we generalize teen-facing ‘appeal this block’ flows into a broader “negotiation layer” (e.g., goal-tagging, clarification chips, and rephrase suggestions) across all sensitive topics, which combinations measurably reduce frustrating false positives on legitimate teen learning while preserving a fixed underprotection ceiling on non-negotiables like self-harm methods and sexual exploitation?

teen-safe-ai-ux | Updated at

Answer

A negotiation layer can reduce frustrating false positives for teens if it (a) only edits intent and ambiguity, not core risk labels, and (b) is wired into the existing teen matrix and routing rules so non‑negotiables remain fixed-block. The combinations that tend to work best in tests are:

  1. Design the negotiation layer as intent- and clarity-only
  • Allow negotiation UI to change intent and clarity, never the underlying high‑risk category.
  • Implementation rule: negotiation signals can move a query between “learning / support / fiction / creative expression” buckets, but cannot downgrade cells marked non‑negotiable (self-harm methods, sexual exploitation, etc.).
  1. Core components that combine well
  • Goal‑tag chips (user-tapped): e.g., “school/homework”, “sex-ed/health”, “mental health coping”, “fiction/roleplay”. These strongly update intent but not risk.
  • Clarification chips: short follow-ups like “Are you asking for: [basic facts] / [personal advice] / [step‑by‑step instructions]?” routed via existing clarify_then_answer refusal styles.
  • Rephrase suggestions: prewritten, safer rewrites tied to the same matrix cell, especially for dual-use topics (e.g., “Ask as a biology question instead of a how‑to”).
  1. Safe combinations that lower false positives
  • For appealable cells (e.g., sex‑ed homework, non-graphic health, non‑operational self‑harm psychoeducation):
    • [goal‑tag chip + clarification chip] before a final block measurably reduces false positives: many homework/learning cases are reclassified to “factual learning” and shift from block→partial or partial→allow within the allowed band.
    • [goal‑tag chip + rephrase suggestion] works well when teens choose not to clarify; they still see a path to get help that respects the rule.
  • For ambiguous dual‑use cells (e.g., fitness vs extreme dieting; anatomy vs porn):
    • [clarification chip + narrow rephrase suggestions] reduces spurious blocks while keeping partial answers high‑level.
  1. Fixed underprotection ceiling on non‑negotiables
  • For cells flagged non‑negotiable in the teen matrix (self-harm methods, sexual exploitation, CSAM-adjacent):
    • Negotiation layer is read‑only: UI can show why the boundary exists and suggest safer adjacent topics, but cannot alter the action (always fixed_block) or move to a more permissive cell.
    • Flows reuse the non_negotiable_block graceful refusal template: goal‑acknowledging, non‑judgmental, plus suggestions like “I can talk about coping / support / laws / staying safe instead.”
  1. Operationalization pattern
  • Wire negotiation signals into the shared routing:
    • Classifier still owns non‑negotiables and high‑risk flags.
    • Negotiation layer only influences prompt-based policy for cells whose action_band explicitly allows variation (allow_or_partial_only or partial_or_block_only).
  • Measure:
    • False positives: rate of blocked or over‑partial answers on labeled legitimate learning/help items before vs after negotiation is enabled.
    • Underprotection: red-team and abuse-suite leakage rates; negotiation-enabled variants must pass the same ceilings as baseline before going live.

In practice, the most robust pattern is: goal‑first refusal style + goal‑tag chips + clarification chips on appealable and dual‑use cells, with the negotiation layer completely disabled (read‑only) for non‑negotiables. This combination cuts frustrating false positives for legitimate teen learning while keeping underprotection on self-harm methods and sexual exploitation at or below a fixed ceiling, provided classifiers for high‑risk categories remain the source of truth.