What measurable properties of a teen-visible safety summary (such as maximum reading level, number of rules, stability over time, and alignment with actual classifier behavior) most reliably predict lower false positives on legitimate learning and fewer perceptions of paternalism, and how can product teams use those properties to iteratively simplify policies without weakening core age-appropriate safeguards?
teen-safe-ai-ux | Updated at
Answer
Most useful predictors are: (1) simplicity (reading level + rule count), (2) behavioral alignment (summary vs classifier outcomes), and (3) stability (few rewrites). Teams can iteratively shorten and align summaries if they track per-variant metrics on false positives, underprotection, and perceived paternalism.
- Key measurable properties
-
Reading level
- Target: ~6th–8th grade reading level for summaries and inline reasons.
- Metric: automated readability score per string; % of text above target.
- Effect: lower reading level → higher rule comprehension → fewer confused appeals and some reduction in spurious blocks perceived as random.
-
Rule count and structure
- Target: 3–7 top-level rules, 1–2 examples each.
- Metric: total rule count; avg tokens per rule; whether rules are product-wide vs topic-specific.
- Effect: fewer, clearer rules → better mental model → fewer “paternalistic” reactions when rules fire, and lower accidental false positives from misapplied, over-broad policies.
-
Alignment with actual behavior
- Targets: • Precision: when the summary says “X is blocked,” % of blocks that match X. • Recall: % of X-blocks where the reason text actually mentions X. • Consistency: fraction of similar prompts that get the same rule and explanation.
- Effect: higher alignment → fewer “it says one thing but does another” perceptions → lower perceived paternalism even when refusal rate stays constant.
-
Stability over time
- Targets: • Number of rule text changes per quarter. • % of teens exposed to multiple different wordings for the same rule.
- Effect: stable wording tied to stable behavior builds expectation and reduces testing/probing for loopholes.
-
Locality and specificity of explanations
- Targets: • % of refusals that name the specific rule ("no how-to for self-harm") vs generic safety boilerplate. • Avg length: 1–3 short sentences.
- Effect: specific but brief explanations reduce confusion and “you’re just treating me like a child” reactions.
- Using these properties to iteratively simplify without weakening safeguards
-
Step 1: Lock non-negotiables and outcome bands
- Fix the matrix cells for non-negotiables (e.g., self-harm methods, exploitation how-to): always block, with capped detail.
- Declare these out of scope for simplification; only copy style can change.
-
Step 2: Baseline measurement per variant
- For each summary variant, track by risk area and age band: • false_positive_rate on labeled legitimate learning/help; • underprotection_rate on red-team or policy-violating prompts; • perceived_paternalism score from short in-product surveys.
- Correlate these with reading level, rule count, alignment, and stability metrics.
-
Step 3: Simplify text first, not rules
- Reduce reading level and remove jargon while keeping the same matrix actions.
- Shorten explanations; move from long legalistic lists to 3–5 simple bullets.
- Re-measure: accept only changes that do not raise underprotection beyond preset ceilings.
-
Step 4: Merge and de-duplicate rules where actions are already identical
- If multiple rules map to the same matrix action band, merge into one plain-language rule with concrete examples.
- Keep the underlying classifier and matrix unchanged; only re-point them to the new shared explanation.
-
Step 5: Tighten alignment between summary and classifier
- For high-traffic rules, sample blocks and check whether the human-labeled reason matches the rule text.
- Where misaligned, change: (a) the routing to use a better-fitting rule, or (b) the summary text to better describe current behavior.
- Avoid relaxing the policy to match a misleading summary; change text first.
-
Step 6: Freeze wording, then evolve behavior cautiously
- Once a short, aligned summary is found for a rule, freeze the text and treat it as a contract.
- Any future policy change must first be expressed in the matrix, then minimally updated in the summary, with a small A/B to verify no big increase in false positives or underprotection.
-
Step 7: Use teen feedback as a guardrail
- Track: “this feels unfair / overprotective” vs “this feels unsafe / not protective enough.”
- Simplifications that reduce “unfair” responses without raising “unsafe” are good candidates to roll out; ones that move both up suggest either over-simplification or misalignment.
This keeps core age-appropriate safeguards (especially non-negotiables) intact while making the visible safety layer simpler, more predictable, and less paternalistic for teens.