Meta’s Multilingual Speech Project: Advancing Accessibility and Preserving Language Diversity

Meta, the company behind the innovative technologies shaping the digital landscape, has made significant strides in advancing speech recognition and generation capabilities for a wide range of languages. In a recent blog post, Meta announced its breakthrough project called Massively Multilingual Speech (MMS). The MMS project aims to tackle the challenges of limited language coverage in speech technology, making information accessible to a broader audience and preserving endangered languages.

New work! The Massively Multilingual Speech (MMS) project scales speech technology to 1,100-4,000 languages using self-supervised learning with wav2vec 2.0.
Paper: https://t.co/C4Uhk4Q4m5
Blog: https://t.co/XXBQFcj086
Code/models: https://t.co/6mOhKPXy1X pic.twitter.com/cBAD1Z8kB8

— Michael Auli (@MichaelAuli) May 22, 2023

Challenges in Multilingual Speech Technology

The field of speech recognition and generation heavily relies on labeled data, which has been a significant obstacle in expanding coverage for numerous languages. The scarcity of labeled audio data hinders the development of accurate and robust machine-learning models. Most existing speech recognition models cover only around 100 languages, leaving thousands of languages without effective speech technology support. Moreover, approximately half of these languages are at risk of disappearing within our lifetime.

The Massively Multilingual Speech Solution

To address these challenges, Meta’s MMS project combines wav2vec 2.0, their self-supervised learning framework, with a vast dataset encompassing over 1,100 languages. This dataset includes labeled data for more than 1,100 languages and unlabeled data for nearly 4,000 languages. To compile this extensive dataset, Meta leveraged religious texts, such as the Bible, which have been translated into numerous languages. These translations provided publicly available audio recordings of people reading the texts in different languages, resulting in an average of 32 hours of data per language.

The Process and Results

Meta’s team enhanced the dataset’s quality by employing alignment models and an efficient forced alignment algorithm. Multiple rounds of preprocessing and cross-validation filtering were performed to ensure accuracy and remove misaligned data. The use of wav2vec 2.0, a self-supervised learning model, significantly reduced the need for labeled data. The models were fine-tuned for specific tasks such as multilingual speech recognition and language identification.

The evaluation of the Massively Multilingual Speech models on benchmark datasets showcased impressive performance. Increasing the number of supported languages from 61 to 1,107 resulted in a minimal increase in the character error rate but expanded language coverage by over 18 times. When compared to OpenAI’s Whisper, Meta’s models achieved a substantially lower word error rate while covering 11 times more languages.

Text-to-speech systems were also developed for over 1,100 languages. Despite the limited number of speakers available for some languages in the dataset, the quality of synthesized speech was found to be of a high standard.

Moving Forward and Future Prospects

Meta acknowledges that further improvements are necessary for their models. They are aware of potential risks, such as mistranscribed words or offensive language, and emphasize the importance of collaboration within the AI community to ensure responsible development.

The Massively Multilingual Speech project represents a significant step toward a future where information access and technology usage are facilitated in diverse languages. Meta aims to expand language coverage, including dialects, and enhance speech technology for various applications, from virtual reality and augmented reality to messaging services.

Meta envisions a future where a single model can handle multiple speech tasks for all languages, resulting in improved overall performance. By pushing the boundaries of multilingual speech technology, Meta is committed to empowering individuals to access information and use technology in their preferred languages, thereby fostering language diversity and preservation.

Github: https://github.com/facebookresearch/fairseq/tree/main/examples/mms