In today’s globalized world being able to communicate in English is crucial to many people. Computer assisted pronunciation training (CAPT) systems can help students achieve English proficiency by providing an accessible way to practice, offering personalized feedback. However, phone-level pronunciation scoring is still a very challenging task, with performance far from that of human annotators.
In this paper we compare and present results on the Spanish subset of the L2-ARCTIC corpus and the new Epa-DB database, both containing non-native English speech by native Spanish speakers and intended for the development of pronunciation scoring systems. We show the most frequent errors in each database and compare performance of a state-of-the-art goodness of pronunciation (GOP) system. Results show that both databases have similar error patterns and that performance is similar for most phones, despite differences in recording conditions. For the EpaDB database we also present an analysis of the errors per target phone. This study validates the EpaDB collection and annotations, providing initial results and contributing to the advancement of a challenging low-resource task.
doi: 10.21437/Interspeech.2021-745