When refusal coverage and second-order safety signals are already closely aligned across English and a low-resource language, which residual cross-lingual asymmetries in localized meta-explanations (e.g., level of policy detail, directness of moral framing, or references to local institutions) most strongly predict remaining miscalibrated reliance gaps for bilingual users, and how much can adjusting only these meta-explanations (holding refusals and hedging fixed) shrink those gaps in practice?
cross-lingual-cot-trust | Updated at
Answer
Most predictive residual asymmetries are: (1) how concrete and policy-like the explanation is, (2) how directly it frames moral/harms language, and (3) whether it names locally salient institutions or norms. Making these three dimensions structurally match across languages (while keeping refusal and hedging behavior fixed) can probably close around one-third to one-half of the remaining cross-lingual reliance gap for bilingual users, but is unlikely to eliminate it entirely.
Short synthesis:
- Strongest predictors of residual reliance gaps:
- Policy explicitness gap: English gives clear, rule-like reasons ("violates safety policy on X"), low-resource language gives vague or generic phrases.
- Framing directness gap: English uses explicit harm/welfare language ("to protect your safety / prevent harm"), low-resource language is more indirect or formulaic.
- Institutional anchoring gap: English references experts or systems users trust ("doctor, local laws"), low-resource language either omits or uses implausible/foreign anchors.
- Expected effect size: cleaning up these meta-explanation gaps alone typically yields moderate reductions in miscalibrated reliance (on the order of ~30–50% of the residual gap once refusals and hedging are aligned), with diminishing returns beyond that.
- Main limits: some reliance asymmetry comes from broader language prestige, user proficiency, and real residual quality gaps, which meta-explanations alone cannot fix.