When a small safety dataset in one low-resource language is enriched with localized meta-explanations (brief natural-language rationales for each refusal, written in that language), does this produce better cross-lingual generalization of both refusal coverage and quality of justifications to related low-resource languages than adding the same number of refusal-only examples without explanations?

cross-lingual-cot-trust | Updated at

Answer

Enriching a small safety dataset in one low-resource language with localized meta-explanations is expected to yield modest but real improvements in cross-lingual generalization for both (i) refusal coverage and (ii) justification quality in related low-resource languages, compared with adding the same number of refusal-only examples—though the gains are incomplete, more pronounced for justification style than for coverage, and sensitive to the clarity and consistency of the explanations.

In practice:

  • Refusal coverage: Meta-explanations help the model better associate reasons with patterns of harmful requests, improving recognition of some variants and paraphrases in related languages. This typically reduces some egregious harmful responses, but does not close all gaps left by English-centric safety tuning; nuanced, locally coded, or adversarial harmful content still leaks through.
  • Quality of justifications: The largest benefit is in how the model explains refusals in related low-resource languages. Trained localized rationales encourage more structured, principled-sounding, and user-understandable justifications, reducing the frequency of vague or awkward refusals. However, explanation quality remains noticeably weaker and less consistent than in high-resource languages.

So, relative to refusal-only examples, localized meta-explanations are a better use of the same small dataset if the goal includes improving cross-lingual justification quality and slightly strengthening refusal coverage, but they are not a standalone solution for eliminating cross-lingual safety mismatches.