Two minutes NLP — Speech Recognition options with Python
DeepSpeech, SpeechBrain, SpeechRecognition, Speech-to-Text APIs
Speech-related tasks overview
Automatic Speech Recognition (ASR) is the task of transforming speech to text. Other common speech-related tasks are:
- Spoken Language Understanding: speech-to-semantics.
- Speaker Recognition: identifying or verifying speaker identities from speech recordings.
- Speech Enhancement: improving the quality of the speech signal by removing noise.
- Speech Separation: separating multiple speakers speaking at the same time.
- Speaker Diarization: detecting who spoke when.
- Multi-microphone signal processing: combining the information recorded by multiple microphones.
Open-source Speech Recognition
The biggest drawback of open-source solutions is that the computing power required to do speech recognition will have to come from your hardware. Another important consideration is that open-source speech recognition options are usually less accurate than cloud-based API options. You’re probably better off with a cloud solution if accuracy is important to your project.
- CMU Sphinx: collects over 20 years of CMU research. Some advantages of this library: CMUSphinx tools are designed specifically for low-resource platforms, flexible design, and focus on practical application development and not on research.
- DeepSpeech: was originally a paper about speech recognition techniques produced by Baidu’s research team. DeepSpeech can run offline and on devices. DeepSpeech works on a wide range of devices from Raspberry Pi devices to actual GPUs that are used to train models in the industry.
- SpeechBrain: it’s an open-source and all-in-one speech toolkit. It is designed to make the research and development of neural speech processing technologies easier by being simple, flexible, user-friendly, and well-documented. Integrates with HuggingFace transformers.
- SpeechRecognition: open-source wrapper of various speech recognition APIs, both open-source and closed-source cloud solutions.
You can find more comparisons of open-source speech recognition libraries here.
Cloud-based Speech Recognition
Cloud solutions for building a speech recognition project have the big advantage of being easy to use, more accurate than open-source options, and don’t require you to host any models on your own hardware. The main drawback of some cloud solutions is the cost.
Examples of closed-source cloud solutions are Google Cloud Speech-to-Text API, Wit.ai, Microsoft Azure Speech, Houndify API, and IBM Speech to Text.
Two minutes NLP related posts

Stay up to date with the latest stories about applied Natural Language Processing and join the NLPlanet community on LinkedIn, Twitter, Facebook, and Telegram.
