Work-packages

WP2: Working with low resources

Automatic systems for speech processing learn on huge quantities of data that are necessary to reach state-of-the-art performances and allow industrial exploitation of this technology. This work package tackles the specific aspects of limited resources that can affect the development of reliable systems.

The first limitation is the lack of annotated corpora for many languages and tasks. In a wider consideration, the lack of data can be seen as a discrimination issue that must be addressed by the community with regards to its different aspects. Indeed, existing corpora don’t include all languages and even for languages that are well covered, corpora are seldom gender balanced nor include regional variations.

The second limitation is related to specific tasks where human interaction or perceptive evaluation is required (human assisted learning, evaluation of explanation quality, human robot interaction, evaluation of language learning.

The goal of this WP is twofold:

create and extend corpora for under-resourced languages and tasks,
develop automatic systems that learn from limited training data.

WP3: Human Assisted Learning

Recent developments in machine learning enable automatic systems to learn and generalize from large quantities of data and have brought outstanding improvements in many speech related tasks. Those systems are however far from replacing human expertise for several reasons.

First, automatic systems learn according to a given cost function (Loss) that might not entirely reflect the complexity of the task or lack the deployment context of the system.
Second, deep learning systems learn their knowledge on large quantities of data and thus miss granularity to process outliers which might be very valuable from the human point of view. By interacting with the systems, a human operator can indicate specific areas of interest in the data for the system to learn from but current systems have difficulties to balance the knowledge learnt on the large quantity of training data with a few examples highlighted by the human expert.
Third, incoming data distribution evolves across time and automatic systems need to adjust to new events and might need guidance from the human expert in order to learn the appropriate behavior regarding the new events.

This work package aims at developing automatic systems integrating human assisted learning. Those systems should be able to merge heterogeneous information coming from the processed data and from a human operator.

WP4: Explainability

Explainability and interpretability of intelligent systems is currently in the spotlight with a number of research programs worldwide. The wide deployment of speech technologies and the growing expectations from the general public create a need for explainability of intelligent speech processing systems. Explaining decisions made by AI systems is crucial for trust and social acceptance of these systems.

Speech is a complex signal conveying numerous information about the message but also various characteristics of the speaker: identity, age, accent, language. Automatic speech processing is thus used for many applications including health, forensics or education. In those domains, the role of automatic systems is not to make decisions but to provide relevant information to the human experts in order to motivate their decisions. Outputs of the automatic systems are used by domain experts who don’t have expert knowledge in machine learning but still need to analyze this information. AI systems have to return a good prediction jointly with an appropriate representation of domain relevant features and biases when interacting with experts .

Explainability usually tries to understand the internal mechanisms of machines or deep learning systems and explain them in human terms. Meanwhile interpretability tends to present the mechanics in understandable terms without necessarily knowing why they occur. In both cases, the characterization of the information to be fed into the system and returned by the system is a real challenge.

This work package will address three tasks that will lead to better and more explainable systems.

First partner experts in the different tasks will contribute to the characterization of what explainability is when considering speech processing.
The criteria listed and described in this first task will then be used to explain behavior of existing automatic systems in different tasks (a posteriori explainability) in order to benefit speech technology users in a short term.
Eventually, a third task will focus on developing systems that are initially designed to maximize explainability by taking into account the needs of human users.

WP5: Evaluating intelligent systems

Speech processing classic tasks like speech recognition, speaker recognition, speaker diarization, speech understanding or speech translation all have standard and widely used evaluation metrics and protocols that have been developed and discussed within the community for years. Those metrics and protocols allow the evaluation of technologies but are not sufficient to evaluate automatic systems including more functionalities or interaction capacities.

In industry, today’s systems are automatic pipelines integrating several technological bricks to achieve a service; for instance, speaker diarization, language identification, automatic speech recognition and spoken language understanding are used in many call centers to analyze customers satisfaction. Researchers currently focus on systems integrating basic speech processing tasks together with human assisted learning or explainability. Evaluation of such composed systems is not satisfactory when relying only on basic metrics. The analysis of more complex tasks and pipelines requires new metrics, protocols and scenarios that will enable meaningful analyses of systems by disentangling the many factors involved in complex tasks.

This work package aims at deriving and generalizing evaluation processes in order to catalyze the development of intelligent systems by the community. This WP led by LNE will benefit from the expertise of the National Institute of Standards and Technology (NIST - USA) in order to open perspectives for international standard development.

WP6: Community fostering

Research is based on collaborations, exchanges and training, especially important for ESRs and post-doc researchers. Over the last decades, a tremendous increase in shared resources amongst machine learning experts has led to an exponential growth of AI achievements. The members of the ESPERANTO consortium have a strong history in collaborative research and share a will of supporting the European and global speech community among three axes:

the release of open-source software that enable a fast and efficient technological transfer to new comers and industry. On this aspect, ESPERANTO gathers partners with a long experience in this domain with software like ALIZE , SIDEKIT , S4D , KALDI , and others;
the public release of speech resources with baseline systems and tutorials that enable reproducible research;
the organization of international open challenges that emulate and catalyze research by attracting and leading international actors on common tasks (REPERE, ALBAYZIN4, ALLIES, VoicePrivacy).

While the machine learning community is producing a lot of training material, it appears that most of the material is dedicated to image and text processing. In order to support the growth of industrial needs for speech technologies that concerns a variety of tasks (ASR, speaker recognition, language recognition, speaker diarization, speech enhancement, separation, translation, understanding…) this work package will support and coordinate the production of training material, tutorials, baselines and documentations for standard software frameworks (like SpeechBrain and Kaldi8) and resources.

Published on April 2, 2021