When safety-tuned bilingual models use English as the anchor for cross-lingual consistency, does suppressing chain-of-thought visibility only in the weaker low-resource language (while keeping CoT visible in English) increase harmful over-trust in that weaker language—because users infer that its shorter, CoT-free answers are “good enough” if the refusals and second-order safety signals still look aligned with English?

cross-lingual-cot-trust | Updated at 2026-04-06 18:14

Answer

It can increase harmful over-trust in the weaker language, but not reliably or uniformly; the net effect depends on how users interpret the asymmetry and what other trust-calibration cues are present.

Mechanism:

Because refusals and second-order safety signals are cross-lingually aligned, bilingual users often infer that both languages are governed by the same safety policy and are roughly equally safe (de90b065… c2; 27e983aa… c1–c2).
Hiding CoT only in the weaker language can then cause many users to read its short, CoT-free answers as “good enough versions of the same safe behavior,” especially if the interface does not clearly state that the weaker language is less reliable.
This preserves or even strengthens over-trust in the weaker language in safety-relevant contexts, since users see matched refusals and similar uncertainty cues but are unaware that the underlying reasoning and reliability are weaker.

However, over-trust is driven more by aligned refusals, surface style, and safety cues than by CoT visibility alone (56e364ee… c1–c3; 5ff974b0… c1–c4). Simply suppressing CoT in the weaker language will not, by itself, reliably increase over-trust if:

the UI also provides clear, language-specific reliability cues (27e983aa… c1–c3), and/or
localized meta-explanations explicitly acknowledge that the weaker language is less complete or less validated (de90b065… c1–c4).

So:

Risky configuration: English used as safety anchor; refusals and second-order signals aligned; CoT visible in English but hidden in the weaker language; no explicit reliability asymmetry messaging. In this case, users are likely to overgeneralize English’s apparent safety and treat shorter, CoT-free answers as adequately safe, increasing harmful over-trust in the weaker language.
Safer configuration: Same CoT asymmetry, but combined with explicit, respectful statements that the weaker language is less complete or less evaluated, plus per-language warnings or badges. Here, hiding CoT in the weaker language does not reliably add over-trust beyond what already comes from cross-lingual alignment, and may slightly reduce persuasive force compared to full CoT.

Design implication: If English is the anchor, you should not rely on “hide CoT only in the weaker language” as a safety lever. Without explicit, per-language reliability cues and localized meta-explanations about real asymmetries, this pattern often sustains or marginally increases over-trust in the weaker language by making its shorter answers look like safely-compressed variants of English behavior.