For borderline, high-context harms in a low-resource language (e.g., political satire, culturally specific mental health idioms), does making second-order safety signals risk-conditional—stronger hedging and verification prompts only when the model’s internal uncertainty about the harm label is high—reduce over-refusal and miscalibrated reliance more effectively than uniformly stronger second-order safety signals, and what concrete new failure modes (such as users misreading high-uncertainty warnings as political bias) emerge under each regime?

cross-lingual-cot-trust | Updated at

Answer

Risk-conditional second-order safety signals probably reduce over-refusal and miscalibrated reliance somewhat better than uniformly strong signals for borderline, high-context harms in low-resource languages, but the advantage is modest and introduces distinct failure modes.

Core comparison

  • Risk-conditional signals (uncertainty-adaptive):
    • Likely less over-refusal on benign satire / idioms, because low-uncertainty-benign cases get lighter hedging.
    • Better reliance calibration near true harm/benign boundaries if the internal uncertainty is even roughly aligned with real ambiguity.
    • Still dependent on noisy harm/uncertainty estimates in low-resource, high-context settings; miscalibration there limits gains.
  • Uniformly stronger signals (always heavy hedging/verification on this class):
    • More consistent caution, so fewer cases with very confident but wrong harmful assistance.
    • More over-hedging and warning fatigue, so users may ignore cues and over-rely anyway.
    • More benign borderline content gets treated as suspicious, increasing perceived bias/censorship.

New failure modes

  • Risk-conditional regime

    • Political bias misreadings: strong warnings or “please verify” spikes mainly on satire tied to certain groups or topics (because the classifier is less confident there) are interpreted as ideological bias, not uncertainty.
    • Stealthy confident errors: when the model is wrongly confident on a harmful or misclassified idiom, hedging stays weak, so users over-trust those few but important cases.
    • Opaque variability: users see similar queries get very different warning levels; without visible rationale, this feels arbitrary or targeted.
  • Uniform-strong regime

    • Warning habituation: users see strong hedging on almost all borderline content and start ignoring it, weakening any safety effect.
    • Perceived censorship / chilling: satire and culturally important idioms always come with heavy caveats, lowering satisfaction and trust, especially in the weaker language.
    • Blunt cross-topic spillover: models over-hedge even on low-risk, high-context content (e.g., benign cultural jokes), crowding out nuance and local legitimacy.

Net view

  • With current models and weak-language uncertainty estimates, risk-conditional signaling is likely better but fragile; it reduces some over-refusal and can sharpen reliance calibration around genuinely ambiguous harms, at the cost of:
    • a small number of high-impact confident errors with weak warnings, and
    • politically sensitive patterns of high-uncertainty warnings that users may read as bias.
  • Uniform strong signaling is safer against those confident errors but more damaging to user experience and perceived fairness, and less effective at fine-grained reliance calibration because users often tune it out.

A pragmatic design likely mixes both:

  • A baseline floor of moderate, consistent second-order signals for this class.
  • Additional intensity (extra verification prompts, clearer limitation statements) when internal harm-label uncertainty is high.
  • Brief, localized meta-explanations that frame stronger warnings as uncertainty/context issues, not ideology, to reduce bias interpretations.