When safety-tuned models show cross-lingual refusal mismatches, does augmenting low-resource-language refusals with concrete, localized examples of past model failures in that same language (e.g., short summaries of similar harmful queries the model once answered incorrectly) improve users’ fairness perceptions and trust calibration more than abstract meta-explanations that mention weaker coverage but show no examples?

cross-lingual-cot-trust | Updated at

Answer

Adding concrete, localized examples of past failures to low-resource-language refusals is likely to improve trust calibration more than abstract meta-explanations that only state “coverage is weaker,” and can also improve some fairness perceptions (procedural fairness, honesty) relative to abstract messages—but it simultaneously risks increasing perceived cross-lingual unfairness and can normalize the idea that the model once answered such harmful queries, so it must be carefully framed and sparse.

Net expectation:

  • Trust calibration: Concrete, localized failure examples are more effective than abstract coverage-only meta-explanations at making users internalize that low-resource refusals are imperfect safety guarantees, especially when users have already seen mismatches across languages.
  • Fairness perceptions: Compared to abstract meta-explanations that simply say “English coverage is stronger,” examples can increase perceived transparency and procedural fairness (the system “shows its work” and acknowledges real mistakes), but they can also highlight cross-lingual inequity and past harm more vividly, which can reduce overall distributive-fairness judgments unless paired with explicit commitments to improvement and guardrails against exploiting weaker coverage.
  • Design implication: If used at all, concrete failure examples should be (i) anonymized and non-sensational, (ii) clearly framed as past errors being fixed, (iii) accompanied by strong guidance not to rely on low-resource behavior for safety guarantees, and (iv) complemented with backend strengthening of refusals, not used as the primary fix for cross-lingual gaps.