Investigating speaker pronunciation variability in speech embeddings: speaker and L1 effects on French as a Second Language

Speech variation between native and non-native speakers of French is addressed with a low-resource method based on a frame-wise comparison of wav2vec2 acoustic embeddings, using fine-grained phonetic transcriptions by expert annotators as baseline. z-normalisation and t-normalisation are explored to assess what the embeddings contain in terms of phonetically analysable information. We explore non-supervised methods for solving basic speech-related research questions. Adapting Dynamic Time Warping to speech embeddings, we compare phonologically similar recordings of sentences read-aloud by native vs. non-native speakers of French. The question is whether XLSR-53 embeddings are more robust than MFCCs to inter-speaker vs. intra-speaker variability for different occurrences of the same words. Then we investigate whether native speaker productions are more or less stable than those of non-native speakers. Results suggest that the model allows phonetically meaningful correlative analyses. Working on the raw embeddings shows however that the representations are not speaker-independent, so with a view to address issues in relationship with L2 pronunciation variability, we show that t-normalisation brings us a way to separate fluency and accuracy effects in L2-speech. This shows that wav2vec2 encapsulates time-dependent phonetic information in the embeddings, including speaker accent which can not easily be disentangled from other speaker-specific characteristics.

Mots clés

Domaines

Fichier principal

30_Paper-finalfinal.pdf (1.69 Mo)

Origine	Fichiers produits par l'(les) auteur(s)
licence	CC BY-NC-SA 4.0 - Attribution - Utilisation non commerciale - Partage dans les mêmes conditions

Connectez-vous pour contacter le contributeur

https://hal.science/hal-05577663

Soumis le : jeudi 2 avril 2026-14:50:25

Dernière modification le : samedi 4 avril 2026-03:13:52

Dates et versions

hal-05577663 , version 1 (02-04-2026)

Licence

CC BY-NC-SA 4.0 - Attribution - Utilisation non commerciale - Partage dans les mêmes conditions

Identifiants

HAL Id : hal-05577663 , version 1

Citer

Maxime Fily, Martine Adda-Decker, Guillaume Wisniewski. Investigating speaker pronunciation variability in speech embeddings: speaker and L1 effects on French as a Second Language. Speech Language Models in Low-Resource Settings: Performance, Evaluation, and Bias Analysis (SPEAKABLE, collocated in LREC), ELRA Language Resources Association, May 2026, Palma De Majorque, Spain. ⟨hal-05577663⟩

Exporter

Collections

324 Consultations

16 Téléchargements