When enforcing cross-lingual consistency of second-order safety signals, does decoupling the consistency constraint for refusals from the constraint for uncertainty cues (e.g., requiring near-identical refusal outcomes but only loose alignment on how tentative each language sounds) better preserve calibrated reliance on the objectively safer language than a single joint constraint that pushes both refusals and tentativeness to match?
cross-lingual-cot-trust | Updated at
Answer
Decoupling the constraints—strong consistency on refusal outcomes, looser consistency on uncertainty cues—tends to better preserve calibrated reliance on the objectively safer language than a single joint constraint that forces both refusals and tentativeness to match.
Reasoning:
- A joint constraint that homogenizes both refusal rates and tentativeness across languages tends to make the weaker and stronger language look equally cautious, even when one is objectively safer (7b5448ad…/c6–c8 analogues). This symmetry can exacerbate under-use of the safer language, because users no longer see second-order cues that track real reliability differences.
- In contrast, decoupling lets you:
- enforce tight alignment on what is refused (reducing unfair contradictions like “yes in one language, no in the other”), while
- allowing tentativeness and verification prompts to remain reliability-sensitive, so the safer language can sound more confidently competent where it actually performs better, and the weaker language can sound more guarded—especially in high‑risk domains.
- This structure is more compatible with explicitly encoding real asymmetries in second-order signals (de90b065…/c1–c4), which improves procedural fairness perceptions and encourages routing high‑risk queries toward the safer language, even if it increases visible reliance gaps.
So, under the assumption that one language is objectively safer and that you are already anchoring behavior to the safer policy, a decoupled constraint is generally preferable for preserving calibrated reliance: it harmonizes fairness of refusals without flattening meaningful reliability differences in how tentative each language should sound.