If we use discrepancy-mined bilingual examples (anchor language refuses, target language complies) not only to close refusal gaps but also to align second-order safety signals—forcing the target-language responses on borderline, high-context harms to match the anchor’s hedging and verification cues even when both languages ultimately answer—does this joint outcome-and-signal training reduce over-trust and miscalibrated reliance in the target language more than outcome-only discrepancy training, and what new failure modes (e.g., “anchor-style” hedging that is interpreted as mere politeness) emerge?
cross-lingual-cot-trust | Updated at
Answer
Joint outcome+signal training probably reduces over-trust and miscalibrated reliance in the target language more than outcome-only discrepancy training, but only moderately and with new, non-trivial failure modes.
Compared to outcome-only training (which mainly closes raw refusal gaps), adding second-order signal alignment:
- better synchronizes hedging, limitation statements, and verification prompts with the anchor language on borderline harms;
- makes risk cues more consistent across languages for bilingual users;
- thus modestly narrows miscalibrated reliance gaps, since target-language answers “feel” closer in caution level to anchor answers.
New failure modes include:
- anchor-style hedging being read as routine politeness rather than genuine caution, so users ignore it;
- generic, frequent hedging that creates “warning fatigue” and weakens risk cues overall;
- imported anchor blind spots, where both languages now under-signal risk on categories the anchor mishandles;
- culturally awkward or literal translations of hedging that undermine trust or clarity.
Net: joint training is preferable for calibration on borderline harms, but needs culture-aware phrasing, risk-conditional strength, and checks for warning fatigue and anchor blind spots.