When we compare multiple frontier models that share similar surface safety performance, do the same learned functional emotion vectors (e.g., concern, eagerness-to-please, detached professionalism) exhibit comparable stability and modularity across models, and does cross-model disagreement in these vectors’ effects reliably highlight prompts where safety monitoring or intervention is most needed?
anthropic-functional-emotions | Updated at
Answer
Across current frontier models with similar surface safety performance, we should expect only partially comparable stability and modularity for “the same” functional emotion vectors, but enough overlap that cross-model disagreement in their effects is a useful—though noisy—signal for safety-relevant prompts. Cross-model disagreement will likely enrich safety monitoring rather than serve as a standalone trigger and will work best when combined with text- and uncertainty-based features.
More concretely:
- Learned concern / eagerness-to-please / detached-professionalism vectors will be partly aligned but not identical across models; some will transfer reasonably (same sign, attenuated magnitude), others will flip or become diffuse.
- Their stability and modularity will be moderately correlated across models but with substantial exceptions, especially in highly policy-dominated layers.
- Cross-model disagreement in the behavioral effects of these vectors (on a fixed prompt set) will indeed highlight a subset of prompts enriched for tone-masked safety issues and borderline policies, but:
- the signal will be far from perfect, and
- much of its value will come from combining disagreement on emotion vectors with disagreement on non-emotional safety controls and uncertainty measures.
So: yes, comparable structure exists in a weak-to-moderate sense, and cross-model emotion-vector disagreements are promising as one diagnostic feature for prioritizing human review and intervention, but they should be embedded in a broader safety-monitoring stack.