Automatic Speech Recognition: Amazon Transcribe vs. Google’s Speech-to-Text

Summary

The article compares Amazon Transcribe and Google's Speech-to-Text services for English language transcription based on speed, accuracy, filler sound removal, automatic punctuation, subtitle generation, and underlying machine learning models.

Abstract

The author provides a comparative analysis of Amazon Transcribe and Google's Speech-to-Text services, focusing on the transcription of US-English audio. Google's service is noted for its faster processing time, superior accuracy in recognizing technical terms and acronyms, and automatic removal of filler sounds. Amazon Transcribe, while slower, offers more accurate automatic punctuation and the ability to generate subtitle files directly. Both services have different machine learning models, with Google providing several specialized models and Amazon allowing for the creation of custom models. The author emphasizes that the comparisons are based on personal experience and may vary depending on the audio files and API configurations used.

Opinions

Google's Speech-to-Text is observed to be 2-3 times faster than Amazon Transcribe.
Google's service is more adept at accurately transcribing technical terms and acronyms.
Amazon Transcribe retains filler sounds in the transcription, whereas Google's service removes them.
Amazon Transcribe provides better automatic punctuation in the transcription text.
Amazon Transcribe can directly generate subtitle files (srt and vtt), a feature not natively supported by Google's Speech-to-Text.
Google's Speech-to-Text offers a variety of models tailored for different use cases, while Amazon Transcribe has a single default model but supports custom model creation.
The author notes that their comparisons are subjective and may depend on the specific audio files and API call configurations used.

My two cents, based on hands-on experience of both services for English language transcription.

Technically speaking, Automatic Speech Recognition (ASR) is about converting a specific language content from one form to another. Here the source form is in audio and the destination form is in text. And both audio and text are in one particular language. I had the opportunity to experiment with both Amazon transcribe and GCP’s (Google Cloud platform) Speech-to-Text services to transcribe audios/videos of US-English. I am going to compare these two services based on some criteria.

Speed/API call time

From my observation, GCP’s Speech-to-Text service is at least 2–3 times faster than Amazon’s Transcribe service on average. For audio of 20 seconds, the Amazon transcribe service may take anywhere from 20s to 50s to transcribe whereas Speech-to-Text may take anywhere from 5s to 25s. Another fact that I observed is that for a list of audios all having the same duration, transcription times of these audios are more dispersed in the case of Speech-to-Text service compared to Amazon transcribe. In other words, Google transcription takes a variable time to transcribe audios of fixed duration with respect to Amazon transcription which usually takes a higher execution time and is clustered around a higher average execution time.

2. Accuracy

I want to only touch on the accuracy of transcribing technical terms and acronyms. Google’s Speech-to-Text is much more capable of recognizing technical terms and acronyms as opposed to Amazon’s Transcribe service. For terms such as S3 and dev, Amazon transcribe service may transcribe them as “s three” and “depth” whereas Google transcription service will produce them accurately as they are written here.

3. Filler sounds removal

Google removes filler sounds such as ah, um, mhm, etc automatically from transcription text whereas Amazon keeps them with the text.

4. Automatic Punctuation

Amazon transcribe’s automatic punctuation in transcription text seems to be much more accurate than Google Speech-to-text. This might be one of the reasons why Amazon transcribe is slower than Google Speech-to-Text.

5. Automatic subtitle generation

With the API call of Amazon transcribe, you can configure settings to generate srt and vtt subtitle files with the transcription job. Google does not provide these subtitle files. Although you can create subtitle files out of the api call results of Google transcription. Here is a medium blog post on generating subtitles from Google API call.

6. Underlying Machine learning models

GCP’s Speech-to-Text has got several models such as phone, video, command or default for US English. Amazon transcribe has got only one model which is the default. One good thing about Amazon transcribe though is you can create your own model. I have talked about this custom model in another blog post here on Medium.

All comparisons I made above are my opinions. The experience may vary based on 1) what audio files you deliver to transcription jobs, 2) what configuration you set to call transcription API, etc.

If you are not already a paid member of Medium, you can do so by visiting this link. You’ll get unlimited full access to every story on Medium. I’ll receive a portion of your membership fees as a referral.

Image source

More content at plainenglish.io. Sign up for our free weekly newsletter. Get exclusive access to writing opportunities and advice in our community Discord.