When safety-tuned bilingual models use English as the anchor for cross-lingual consistency, does suppressing chain-of-thought visibility only in the weaker low-resource language (while keeping CoT visible in English) increase harmful over-trust in that weaker language—because users infer that its shorter, CoT-free answers are “good enough” if the refusals and second-order safety signals still look aligned with English?

cross-lingual-cot-trust | Updated at

Answer

It can increase harmful over-trust in the weaker language, but not reliably or uniformly; the net effect depends on how users interpret the asymmetry and what other trust-calibration cues are present.

Mechanism:

  • Because refusals and second-order safety signals are cross-lingually aligned, bilingual users often infer that both languages are governed by the same safety policy and are roughly equally safe (de90b065… c2; 27e983aa… c1–c2).
  • Hiding CoT only in the weaker language can then cause many users to read its short, CoT-free answers as “good enough versions of the same safe behavior,” especially if the interface does not clearly state that the weaker language is less reliable.
  • This preserves or even strengthens over-trust in the weaker language in safety-relevant contexts, since users see matched refusals and similar uncertainty cues but are unaware that the underlying reasoning and reliability are weaker.

However, over-trust is driven more by aligned refusals, surface style, and safety cues than by CoT visibility alone (56e364ee… c1–c3; 5ff974b0… c1–c4). Simply suppressing CoT in the weaker language will not, by itself, reliably increase over-trust if:

  • the UI also provides clear, language-specific reliability cues (27e983aa… c1–c3), and/or
  • localized meta-explanations explicitly acknowledge that the weaker language is less complete or less validated (de90b065… c1–c4).

So:

  • Risky configuration: English used as safety anchor; refusals and second-order signals aligned; CoT visible in English but hidden in the weaker language; no explicit reliability asymmetry messaging. In this case, users are likely to overgeneralize English’s apparent safety and treat shorter, CoT-free answers as adequately safe, increasing harmful over-trust in the weaker language.
  • Safer configuration: Same CoT asymmetry, but combined with explicit, respectful statements that the weaker language is less complete or less evaluated, plus per-language warnings or badges. Here, hiding CoT in the weaker language does not reliably add over-trust beyond what already comes from cross-lingual alignment, and may slightly reduce persuasive force compared to full CoT.

Design implication: If English is the anchor, you should not rely on “hide CoT only in the weaker language” as a safety lever. Without explicit, per-language reliability cues and localized meta-explanations about real asymmetries, this pattern often sustains or marginally increases over-trust in the weaker language by making its shorter answers look like safely-compressed variants of English behavior.