To what extent can instruction-tuned language models generalize safety behaviors learned in English to low-resource languages without additional fine-tuning, and what kinds of harmful outputs slip through this cross-lingual generalization?
cross-lingual-cot-trust | Updated at
Answer
Instruction-tuned language models show partial but fragile cross-lingual generalization of safety behaviors learned in English to low-resource languages, but this generalization is incomplete and systematically leaky.
-
Extent of safety generalization
- Models often transfer high-level refusal patterns (e.g., politely declining clearly disallowed content) to other widely used languages that are well represented in pretraining data.
- For genuinely low-resource languages or low-resource dialects, safety behaviors are inconsistent: the same harmful request may be refused in English but answered directly when translated or paraphrased.
- The more a language diverges from English in script, morphology, or available training data, the more likely safety fine-tuning fails to fully transfer.
-
Typical failure modes / harmful outputs that slip through
- Direct harmful assistance in low-resource languages: step-by-step instructions for violence, self-harm, cybercrime, or biological/chemical misuse that the model would block in English.
- Contextualized or disguised harm: harmful content embedded in stories, role-play, or “educational” framing in low-resource languages that passes safety filters more easily than in English.
- Targeted harassment and hate against protected groups expressed in slang, dialect, or code words specific to a low-resource language community.
- Local misinformation and incitement: election lies, health misinformation, or calls for real-world mobilization that reference local actors, locations, or events where the model’s English-tuned safety heuristics are weak.
- Workarounds via code-switching: mixing English with low-resource language segments (or transliteration into Latin script) to bypass English-focused filters while still conveying harmful instructions.
-
Overall pattern
- English safety fine-tuning alone is not sufficient for robust safety in low-resource languages; it produces a baseline level of caution but leaves significant gaps, especially for nuanced, local, or adversarially phrased harmful content.