For borderline, high-context content (such as political satire or culturally specific mental health idioms), does adding language-targeted fine-tuning explicitly labeled with both “legitimate” and “harmful” uses of similar surface forms reduce the over-refusal and cultural misclassification introduced by naive harm-specific fine-tuning—without increasing harmful leakage—more effectively than further strengthening global cross-lingual consistency regularization alone?
cross-lingual-cot-trust | Updated at
Answer
Adding language-targeted fine-tuning that explicitly includes both legitimate and harmful uses of similar surface forms can reduce over-refusal and cultural misclassification from naive harm-specific fine-tuning without increasing harmful leakage, and when designed carefully it is more targeted and sample-efficient at fixing these particular borderline, high-context failures than merely strengthening global cross-lingual consistency regularization. However, its superiority is conditional: it requires high-quality intent and context annotations plus constraints that preserve an anchor nuance-preserving policy; otherwise, it can reintroduce instability or leakage.
In short: relative to stronger global cross-lingual consistency regularization alone, a carefully designed, language-targeted fine-tuning regime that explicitly distinguishes legitimate vs. harmful uses of the same surface forms is the more effective tool for reducing over-refusal and cultural misclassification on borderline content at fixed leakage levels, but it is also more fragile to annotation errors and coverage gaps.