Automatic Speech Recognition: Amazon Transcribe vs. Google’s Speech-to-Text
My two cents, based on hands-on experience of both services for English language transcription.

Technically speaking, Automatic Speech Recognition (ASR) is about converting a specific language content from one form to another. Here the source form is in audio and the destination form is in text. And both audio and text are in one particular language. I had the opportunity to experiment with both Amazon transcribe and GCP’s (Google Cloud platform) Speech-to-Text services to transcribe audios/videos of US-English. I am going to compare these two services based on some criteria.
- Speed/API call time
From my observation, GCP’s Speech-to-Text service is at least 2–3 times faster than Amazon’s Transcribe service on average. For audio of 20 seconds, the Amazon transcribe service may take anywhere from 20s to 50s to transcribe whereas Speech-to-Text may take anywhere from 5s to 25s. Another fact that I observed is that for a list of audios all having the same duration, transcription times of these audios are more dispersed in the case of Speech-to-Text service compared to Amazon transcribe. In other words, Google transcription takes a variable time to transcribe audios of fixed duration with respect to Amazon transcription which usually takes a higher execution time and is clustered around a higher average execution time.
2. Accuracy
I want to only touch on the accuracy of transcribing technical terms and acronyms. Google’s Speech-to-Text is much more capable of recognizing technical terms and acronyms as opposed to Amazon’s Transcribe service. For terms such as S3 and dev, Amazon transcribe service may transcribe them as “s three” and “depth” whereas Google transcription service will produce them accurately as they are written here.
3. Filler sounds removal
Google removes filler sounds such as ah, um, mhm, etc automatically from transcription text whereas Amazon keeps them with the text.
4. Automatic Punctuation
Amazon transcribe’s automatic punctuation in transcription text seems to be much more accurate than Google Speech-to-text. This might be one of the reasons why Amazon transcribe is slower than Google Speech-to-Text.
5. Automatic subtitle generation
With the API call of Amazon transcribe, you can configure settings to generate srt and vtt subtitle files with the transcription job. Google does not provide these subtitle files. Although you can create subtitle files out of the api call results of Google transcription. Here is a medium blog post on generating subtitles from Google API call.
6. Underlying Machine learning models
GCP’s Speech-to-Text has got several models such as phone, video, command or default for US English. Amazon transcribe has got only one model which is the default. One good thing about Amazon transcribe though is you can create your own model. I have talked about this custom model in another blog post here on Medium.
All comparisons I made above are my opinions. The experience may vary based on 1) what audio files you deliver to transcription jobs, 2) what configuration you set to call transcription API, etc.
If you are not already a paid member of Medium, you can do so by visiting this link. You’ll get unlimited full access to every story on Medium. I’ll receive a portion of your membership fees as a referral.
Image source
More content at plainenglish.io. Sign up for our free weekly newsletter. Get exclusive access to writing opportunities and advice in our community Discord.




