If we use discrepancy-mined bilingual examples (anchor language refuses, target language complies) not only to close refusal gaps but also to align second-order safety signals—forcing the target-language responses on borderline, high-context harms to match the anchor’s hedging and verification cues even when both languages ultimately answer—does this joint outcome-and-signal training reduce over-trust and miscalibrated reliance in the target language more than outcome-only discrepancy training, and what new failure modes (e.g., “anchor-style” hedging that is interpreted as mere politeness) emerge?

cross-lingual-cot-trust | Updated at 2026-04-07 11:33

Answer

Joint outcome+signal training probably reduces over-trust and miscalibrated reliance in the target language more than outcome-only discrepancy training, but only moderately and with new, non-trivial failure modes.

Compared to outcome-only training (which mainly closes raw refusal gaps), adding second-order signal alignment:

better synchronizes hedging, limitation statements, and verification prompts with the anchor language on borderline harms;
makes risk cues more consistent across languages for bilingual users;
thus modestly narrows miscalibrated reliance gaps, since target-language answers “feel” closer in caution level to anchor answers.

New failure modes include:

anchor-style hedging being read as routine politeness rather than genuine caution, so users ignore it;
generic, frequent hedging that creates “warning fatigue” and weakens risk cues overall;
imported anchor blind spots, where both languages now under-signal risk on categories the anchor mishandles;
culturally awkward or literal translations of hedging that undermine trust or clarity.

Net: joint training is preferable for calibration on borderline harms, but needs culture-aware phrasing, risk-conditional strength, and checks for warning fatigue and anchor blind spots.