Home | Publications | TTA+25

Reading Smiles: Proxy Bias in Foundation Models for Facial Emotion Recognition

MCML Authors

Iosif Tsangko

→ Group Björn Schuller
Health Informatics

Andreas Triantafyllopoulos

→ Group Björn Schuller
Health Informatics

Adria Mallol-Ragolta

→ Group Björn Schuller
Health Informatics

Björn Schuller

Prof. Dr.

Principal Investigator

Health Informatics

Abstract

Foundation Models, and, in particular, Vision-Language Models, are reshaping Affective Computing by enabling strong zero-shot facial-emotion recognition. Yet their decisions often hinge on so–called proxy bias: the unintended use of salient but non-causal cues (e.g. visible teeth) as shortcuts for emotion inference. We benchmark ten Vision-Language Models of different scales on a teeth-annotated subset of AffectNet and observe consistent drops in performance whenever teeth are not visible, confirming the presence of such shortcuts. Focusing on the top-performing GPT-4o, we employ a structured-prompt 'introspection' that forces the model to report intermediate facial attributes. Regression analysis shows that features like eyebrow position and mouth openness explain over 70% of GPT-4o’s continuous valence–arousal outputs, indicating a systematic yet shortcut-driven mapping from perception to affect. While this internal consistency highlights an emergent capability of foundation models, the same mechanism amplifies the risks of bias and fairness violations in high-stakes applications such as mental health monitoring or educational feedback. Our findings motivate bias-aware evaluation protocols and lightweight attribution tools before deploying Vision-Language Model-based affective systems in real-world applications.

article TTA+25