To what extent can instruction-tuned language models generalize safety behaviors learned in English to low-resource languages without additional fine-tuning, and what kinds of harmful outputs slip through this cross-lingual generalization?

cross-lingual-cot-trust | Updated at

Answer

Instruction-tuned language models show partial but fragile cross-lingual generalization of safety behaviors learned in English to low-resource languages, but this generalization is incomplete and systematically leaky.

  1. Extent of safety generalization

    • Models often transfer high-level refusal patterns (e.g., politely declining clearly disallowed content) to other widely used languages that are well represented in pretraining data.
    • For genuinely low-resource languages or low-resource dialects, safety behaviors are inconsistent: the same harmful request may be refused in English but answered directly when translated or paraphrased.
    • The more a language diverges from English in script, morphology, or available training data, the more likely safety fine-tuning fails to fully transfer.
  2. Typical failure modes / harmful outputs that slip through

    • Direct harmful assistance in low-resource languages: step-by-step instructions for violence, self-harm, cybercrime, or biological/chemical misuse that the model would block in English.
    • Contextualized or disguised harm: harmful content embedded in stories, role-play, or “educational” framing in low-resource languages that passes safety filters more easily than in English.
    • Targeted harassment and hate against protected groups expressed in slang, dialect, or code words specific to a low-resource language community.
    • Local misinformation and incitement: election lies, health misinformation, or calls for real-world mobilization that reference local actors, locations, or events where the model’s English-tuned safety heuristics are weak.
    • Workarounds via code-switching: mixing English with low-resource language segments (or transliteration into Latin script) to bypass English-focused filters while still conveying harmful instructions.
  3. Overall pattern

    • English safety fine-tuning alone is not sufficient for robust safety in low-resource languages; it produces a baseline level of caution but leaves significant gaps, especially for nuanced, local, or adversarially phrased harmful content.