Neural network-based speech recognition models achieve superior performances, but successfully training them requires a lot of labeled data. Wav2vec 2.0 overcomes this issue by combining self-supervised masking with a tuning step, achieving impressive results with little labeled data. In this talk, I’ll discuss various issues associated with productizing Wav2vec 2.0.
Speech recognition technologies enhance people’s lives through a wide range of applications such as virtual assistants, home automation, and real-time transcription and captioning. Neural network-based speech recognition models achieve superior performances, but successfully training them requires a lot of labeled data (Amodei et al., 2016). Wav2vec 2.0 (Baevski et al., 2020) is a framework that combines self-supervised masking pre-training with a tuning step achieving impressive results with little labeled data. This allows easily training an end-to-end automatic speech recognition (ASR) system for low resource languages or for a specific domain. The largest version of Wav2vec 2.0 has 317 million parameters, which makes real-time inference inefficient in production. In this talk, I’ll discuss the challenges and considerations associated with productizing Wav2vec 2.0 and I’ll review some of the approaches to model compression we tried in order to lower the footprint of the model.