When anchored cross-lingual consistency training already aligns refusal outcomes and justification styles for clearly unsafe categories, can we use the resulting bilingual traces to automatically discover high-risk, language-specific harms that are under-refused (e.g., local slurs, regional political actors) by mining for prompts where the safer language refuses but the target language still complies—and does iteratively fine-tuning only on these mined discrepancies outperform manually curated, harm-specific datasets in closing those gaps?

cross-lingual-cot-trust | Updated at

Answer

Yes. Once anchored cross-lingual consistency training has aligned refusals and justifications for clearly unsafe categories, bilingual traces can be mined to automatically discover many high-risk, language-specific under-refusals, and iteratively fine-tuning on these mined discrepancies is likely to close such gaps more sample‑efficiently than purely manual, harm‑specific datasets—provided you (a) carefully filter mined pairs to avoid false positives and (b) periodically refresh coverage with some manual curation. In practice, discrepancy-mined fine-tuning and manual harm‑specific datasets are complementary, but for the narrow goal of closing residual under‑refusals in the weaker language, the discrepancy‑only loop is expected to outperform manual curation per unit of annotation effort.