When bilingual users can toggle an interface control that forces uniform safety behavior across languages (e.g., “treat all languages as strictly as English” vs. “current default behavior”), how does this self-selected safety mode affect (a) over-trust, (b) perceptions of fairness, and (c) the rate at which users deliberately route harmful queries through low-resource languages?

cross-lingual-cot-trust | Updated at

Answer

Self-selected “uniform safety behavior” modes are likely to (a) reduce global over-trust in safety guarantees somewhat but can increase mode-specific over-trust (“if I turned strict-on, I’m fully safe now”), (b) improve perceived procedural fairness for users who understand the control, while also making underlying cross-lingual inequities more salient, and (c) reduce accidental safety circumvention via low-resource languages but only weakly deter deliberate routing of harmful queries, especially when users can switch back to the permissive mode.

(a) Over-trust

  • Giving bilingual users an explicit toggle tends to reduce naive global over-trust: they see that safety behavior is a configurable setting rather than a fixed, principled guarantee, which can make them more cautious overall.
  • However, many users will over-trust the “treat all languages as strictly as English” mode, assuming it closes all cross-lingual gaps, even though English-centric safety still leaks in low-resource languages.
  • The net effect is better calibration across modes, but residual over-trust in the strict setting, especially if the UI messaging is optimistic and does not highlight remaining limitations.

(b) Perceptions of fairness

  • A visible uniform-safety toggle generally increases perceived fairness and agency among bilingual users: they can choose stricter, more consistent treatment rather than being locked into weaker safety in their preferred language.
  • At the same time, the very need for the toggle reinforces that defaults are uneven across languages, which can confirm or deepen perceptions that the system is structurally biased toward English.
  • Fairness perceptions are most positive when the UI (i) clearly explains what the mode does and doesn’t do, and (ii) applies the setting consistently across the session, not just per request.

(c) Routing harmful queries through low-resource languages

  • For users who were unintentionally exploiting gaps (e.g., they happen to code-switch into a low-resource language), the strict uniform mode reduces accidental harmful leakage by making refusals more consistent across their languages.
  • For users who are deliberately circumventing safety, a self-selected toggle does little by itself: they can simply keep the default (more permissive) mode or turn strict mode off before asking harmful questions.
  • If the interface ties the toggle to strong messaging (“strict mode is recommended; harmful use is logged and audited”) and makes mode changes salient, it can slightly increase the friction and perceived risk of routing harm through low-resource languages, but it does not meaningfully prevent determined abuse without additional back-end enforcement.

Overall: a user-controlled uniform-safety toggle is mainly a trust-calibration and fairness/agency tool for good-faith bilingual users, not a strong control on strategic abuse. It modestly reduces global over-trust and perceived unfairness, but may create a new pocket of over-trust in the strict mode and only slightly dampens deliberate harmful routing unless paired with deeper technical and policy measures.