The Effect of Spoken Language and Diversity on Voice Gender Recognition by Machine Learning Models
Submitted by:
Enrique Díaz-Ocampo MA
Tecnológico Nacional de México / CENIDET
Presenter(s):
Enrique Díaz-Ocampo
Yael Bensoussan MD Msc., Jeremy Pinto B.Sc., Noah Millman, Desiree McCutcheon
Andrea Magadán-Salazar Ph.D. , Raúl Pinto-Elías Ph.D., Máximo Lopez-Sánchez Ph. D.
Abstract
Background: Gender recognition systems based on voice analysis use machine learning (ML) algorithms analyzing various acoustic features of human voices. Pitch or fundamental frequency (f0), and its derived statistical measures, has been described as a common acoustic feature to determine voice gender. However, many intrinsic variables such as age, ethnicity and spoken language amongst others can impact f0 and its ability to predict voice gender. Although various studies have analyzed voice gender recognition in single languages, there is no data comparing acoustic measures related to f0 in multiple languages and their effect on gender recognition by ML models.
Objective: The objective of this study was to analyze the effect of spoken language on statistical features of fundamental frequency (f0) and its impact on gender recognition by machine learning models. The secondary objective was to determine which features had greatest impact on gender recognition by ML models in a consistent manner across languages.
Methods: A human voice dataset was extracted from the Mozilla Common Voice Corpus 10.0. Eight acoustic measures related to pitch (minimum f0, first quartile f0, mean f0, median f0, third quartile f0, maximum f0, interquartile range f0 and standard deviation f0) were extracted and analyzed using two ML algorithms and an improve logistic regression model. Voice data was stratified by age, gender (male/female) and language spoken (English, Spanish, French, Chinese, and German). Acoustic features and their derivatives were calculated based on the acoustic periodicity detection algorithm of Boersma implemented in PRAAT. Waikato Environment for Knowledge Analysis Software (WEKA, GNU General Public License version 3.8.6) was used for machine learning analysis with three machine learning (ML) classifier being a multilayer perceptron (MLP), J48 decision tree (J48), and a logistic regression model with a ridge estimator using the seven statistical features and the label age. For each language, balanced and unbalanced scenarios were constructed. Furthermore, a Correlation-based Feature Subset Selection for Machine Learning implemented in the WEKA software, was used to identify the impact of isolated acoustic features on gender recognition to investigate which features were consistent across multiple languages.
Results: A total of 1,784,244 voice samples from 21,785 unique speakers were analysed (80.3 % of male and 19.7% of female). Spoken language distribution consisted of English speakers (51.6%) Spanish speakers (15.5 %), German speakers (14.9%), French speakers (13.9%) and Chinese speakers (4.1%). The ML models reached high accuracy (for balanced cases) and F1 measure (for unbalanced datasets) in detecting voice gender (above 85% in all cases). Balanced voice datasets showed higher accuracy for Spanish speakers (96.27%, while the English speakers had the lowest accuracy in gender recognition (93.41%). When using the unbalanced dataset, Spanish speakers achieved the highest F1 measure (Spanish 92.1%, Chinese 91.3%, English 91.0%, German 89.5% and French 87.9%).
Across all 8 acoustic features studied, features that impacted accuracy the most across all languages were first quartile f0, mean f0, median f0, and third quartile f0.
Conclusion: This study demonstrates the effect of spoken language on accuracy of gender recognition by machine learning algorithm. As voice is becoming a biomarker and used for identification, this work emphasizes the need for balanced and diverse datasets for voice AI models that have the potential to become multilingual.
Objectives
The objective of this study was to analyze the effect of spoken language on statistical features of fundamental frequency (f0) and its impact on gender recognition by machine learning models.
The secondary objective was to determine which features had greatest impact on gender recognition by ML models in a consistent manner across languages.
–