Automatic systems for speech processing learn on huge quantities of data that are necessary to reach state-of-the-art performances and allow industrial exploitation of this technology. This work package tackles the specific aspects of limited resources that can affect the development of reliable systems.
The first limitation is the lack of annotated corpora for many languages and tasks. In a wider consideration, the lack of data can be seen as a discrimination issue that must be addressed by the community with regards to its different aspects. Indeed, existing corpora don’t include all languages and even for languages that are well covered, corpora are seldom gender balanced nor include regional variations.
The second limitation is related to specific tasks where human interaction or perceptive evaluation is required (human assisted learning, evaluation of explanation quality, human robot interaction, evaluation of language learning.
The goal of this WP is twofold:
Recent developments in machine learning enable automatic systems to learn and generalize from large quantities of data and have brought outstanding improvements in many speech related tasks. Those systems are however far from replacing human expertise for several reasons.
This work package aims at developing automatic systems integrating human assisted learning. Those systems should be able to merge heterogeneous information coming from the processed data and from a human operator.
Explainability and interpretability of intelligent systems is currently in the spotlight with a number of research programs worldwide. The wide deployment of speech technologies and the growing expectations from the general public create a need for explainability of intelligent speech processing systems. Explaining decisions made by AI systems is crucial for trust and social acceptance of these systems.
Speech is a complex signal conveying numerous information about the message but also various characteristics of the speaker: identity, age, accent, language. Automatic speech processing is thus used for many applications including health, forensics or education. In those domains, the role of automatic systems is not to make decisions but to provide relevant information to the human experts in order to motivate their decisions. Outputs of the automatic systems are used by domain experts who don’t have expert knowledge in machine learning but still need to analyze this information. AI systems have to return a good prediction jointly with an appropriate representation of domain relevant features and biases when interacting with experts .
Explainability usually tries to understand the internal mechanisms of machines or deep learning systems and explain them in human terms. Meanwhile interpretability tends to present the mechanics in understandable terms without necessarily knowing why they occur. In both cases, the characterization of the information to be fed into the system and returned by the system is a real challenge.
This work package will address three tasks that will lead to better and more explainable systems.
Speech processing classic tasks like speech recognition, speaker recognition, speaker diarization, speech understanding or speech translation all have standard and widely used evaluation metrics and protocols that have been developed and discussed within the community for years. Those metrics and protocols allow the evaluation of technologies but are not sufficient to evaluate automatic systems including more functionalities or interaction capacities.
In industry, today’s systems are automatic pipelines integrating several technological bricks to achieve a service; for instance, speaker diarization, language identification, automatic speech recognition and spoken language understanding are used in many call centers to analyze customers satisfaction. Researchers currently focus on systems integrating basic speech processing tasks together with human assisted learning or explainability. Evaluation of such composed systems is not satisfactory when relying only on basic metrics. The analysis of more complex tasks and pipelines requires new metrics, protocols and scenarios that will enable meaningful analyses of systems by disentangling the many factors involved in complex tasks.
This work package aims at deriving and generalizing evaluation processes in order to catalyze the development of intelligent systems by the community. This WP led by LNE will benefit from the expertise of the National Institute of Standards and Technology (NIST - USA) in order to open perspectives for international standard development.
Research is based on collaborations, exchanges and training, especially important for ESRs and post-doc researchers. Over the last decades, a tremendous increase in shared resources amongst machine learning experts has led to an exponential growth of AI achievements. The members of the ESPERANTO consortium have a strong history in collaborative research and share a will of supporting the European and global speech community among three axes:
While the machine learning community is producing a lot of training material, it appears that most of the material is dedicated to image and text processing. In order to support the growth of industrial needs for speech technologies that concerns a variety of tasks (ASR, speaker recognition, language recognition, speaker diarization, speech enhancement, separation, translation, understanding…) this work package will support and coordinate the production of training material, tutorials, baselines and documentations for standard software frameworks (like SpeechBrain and Kaldi8) and resources.