Home | Publications | APG+24

ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37 Emotion Datasets

MCML Authors

Shahin Amiriparian

Dr.

→ Group Björn Schuller
Health Informatics

Filip Packań

→ Group Björn Schuller
Health Informatics

Maurice Gerczuk

→ Group Björn Schuller
Health Informatics

Björn Schuller

Prof. Dr.

Principal Investigator

Health Informatics

Abstract

Foundation models have shown great promise in speech emotion recognition (SER) by leveraging their pre-trained representations to capture emotion patterns in speech signals. To further enhance SER performance across various languages and domains, we propose a novel twofold approach. First, we gather EmoSet++, a comprehensive multi-lingual, multi-cultural speech emotion corpus with 37 datasets, 150,907 samples, and a total duration of 119.5 hours. Second, we introduce ExHuBERT, an enhanced version of HuBERT achieved by backbone extension and fine-tuning on EmoSet++. We duplicate each encoder layer and its weights, then freeze the first duplicate, integrating an extra zero-initialized linear layer and skip connections to preserve functionality and ensure its adaptability for subsequent fine-tuning. Our evaluation on unseen datasets shows the efficacy of ExHuBERT, setting a new benchmark for various SER tasks.

inproceedings APG+24

INTERSPEECH 2024

25th Annual Conference of the International Speech Communication Association. Kos Island, Greece, Sep 01-05, 2024.

Authors

S. Amiriparian • F. Packań • M. Gerczuk • B. W. Schuller

Links

DOI

Research Area

B3 | Multimodal Perception

BibTeXKey: APG+24

#p-schuller