avatarJennifer Fu

Summary

The content provides an overview of OpenAI's Whisper, a state-of-the-art speech recognition model, and demonstrates its capabilities through Hugging Face and Google Colab applications, while also discussing its performance in transcribing and translating multilingual audio and video recordings.

Abstract

OpenAI Whisper is a versatile speech recognition model that has been trained on a diverse dataset of 680,000 hours of multilingual audio. It is designed to handle various tasks such as speech translation, language identification, and voice activity detection. The article illustrates the use of Whisper through Hugging Face, where it can transcribe audio recordings and videos into text, including mixed-language scenarios. Despite some inaccuracies, particularly with mixed languages and certain proper nouns like "DALL-E," Whisper represents a significant advancement in automatic speech recognition technology. Additionally, the article guides users through accessing and running Whisper demos on Google Colab, emphasizing the model's ease of use and cloud-based convenience. The performance of Whisper is showcased with examples, and its limitations are acknowledged, suggesting that while the technology is impressive, there is still room for improvement.

Opinions

  • The author suggests that Whisper is a significant improvement over existing speech recognition technologies due to its multitasking capabilities and the scale of its training data.
  • There is an acknowledgment that while Whisper performs well in transcribing and translating speech, it occasionally struggles with mixed-language audio and proper nouns, indicating areas for future development.
  • The use of Hugging Face and Google Colab is presented as a user-friendly way to access and experiment with Whisper, highlighting the accessibility of AI technologies for a broader audience.
  • The article implies that the ability to generate accurate transcriptions and summaries from audio and video has important implications for the field of automatic speech recognition and its applications.
  • The author encourages further exploration of Whisper's capabilities, suggesting that it has the potential to amplify creative potential when used in conjunction with human imagination.

Exploring OpenAI Whisper on Hugging Face

OpenAI Whisper is a general-purpose speech recognition model

Photo by Sander Sammy on Unsplash

Introduction

Whisper is another OpenAI product. It is a general-purpose speech recognition model, which is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. It is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

OpenAI released Whisper on September 2022. Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. It is a leap from the existing speech recognition technologies.

The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation. The details are demonstrated by the following diagram:

Image from https://github.com/openai/whisper

Hugging Face is an open-source and platform provider of machine learning technologies. It is an AI community that builds, trains, and deploys state of the art models powered by referencing open source in machine learning. It transforms complicated machine learning models into simple applications.

We use Hugging Face apps to explore OpenAI Whisper.

Whisper Demo on Hugging Face

openai/whisper is a Whisper demo on hugging face, which cuts audio after around 30 seconds. Here is the user interface:

Image by author

This English recording has been perfectly transcribed to “It is a beautiful world where bold and innovative artwork flourishes in art renaissance, despite some legal and social challenges.”

Image by author

This Chinese recording has been transcribed to “黃河之水天上來,風流到海不復迴。” There is one word wrong: 風 should be 奔.

Image by author

Here is mixed English and Chinese recording, “Hello, today is January 21st, 2023. It is Chinese New Year Eve. 新年快樂!” The English part is transcribed well, but not the Chinese part.

Image by author

Whisper still needs to improve for mixed languages. The following image shows the supported languages, as well as Word Error Rate (WER) breakdown by languages of Fleurs dataset. Smaller number is better.

Image from https://github.com/openai/whisper

Whisper Demo on Google Colab

On the above app, there is a button, Open in Colab. It is a link to https://colab.research.google.com/drive/1WJ98KHgZxFGrHiMm4TyWZllSew_Af_ff?usp=sharing. Clicking on it opens Google Colaboratory (Colab) that is running on public cloud. The file, whisper-gradio.ipynb, is opened.

Image by author

As we have explained in a previous article, .ipynb stands for Interactive PYthon NoteBook. There are a number of ways to view and execute a notebook file — Jupyter Notebook, Visual Studio Code, and Google Colaboratory. Lately, Colab becomes a popular working environment for AI computing. It is a free Jupyter notebook version that runs entirely in the cloud. In addition, it does not require a setup and runs with the cloud resource. It also enables collaboration among team members.

Select the menu, Runtime → Run all:

Image by author

It works similarly to run the Whisper Demo on Hugging Face. It has perfectly transcribed the Chinese audio to “新年快樂”.

Image by author

Whisper Demo From Video to Text

SteveDigital/free-fast-youtube-url-video-to-text-using-openai-whisper is another Whisper demo on hugging face. Enter the link of any YouTube video, it can generate a text transcript of the video. Transcription takes 5–10 seconds per minute of the video. Here is the user interface:

Image by author

Type the YouTube video line, https://www.youtube.com/watch?v=qTgPSKKjfVg, and click the Transcribe button:

Image by author

This Whisper demo generates the transcript and summary.

Here is the original transcript on YouTube:

Have you ever seen a polar bear playing bass? Or a robot painted like a Picasso? Didn’t think so. DALL-E 2 is a new AI system from OpenAI that can take simple text descriptions like, “a koala dunking a basketball” and turn them into photorealistic images that have never existed before. DALL-E 2 can also realistically edit and retouch photos. Based on a simple natural language description, it can fill in or replace part of an image with AI-generated imagery that blends seamlessly with the original. It’s called “in-painting”. In January 2021, OpenAI introduced DALL-E, a system that could generate images from text, like this “Avocado Armchair”. DALL-E 2 takes the technology even further with higher resolution, greater comprehension, and new capabilities like in-painting. It can even start with an image as an input and create variations with different angles and styles. DALL-E was created by training a neural network on images and their text descriptions. Through deep learning, it not only understands individual objects, like koala bears and motorcycles, but learns from relationships between objects. And when you ask DALL-E for an image of a koala bear riding a motorcycle, it knows how to create that or anything else with a relationship to another object or action. The DALL-E research has three main outcomes: First, it can help people express themselves visually in ways they may not have been able to before. Second, an AI-generated image can tell us a lot about whether the system understands us, or is just repeating what it has been taught. Third, DALL-E helps humans understand how advanced AI systems see and understand our world. This is a critical part of developing AI that’s useful and safe. The technology is constantly evolving, and DALL-E 2 has limitations. If it’s taught with objects that are incorrectly labeled, like a plane labeled “car”, and a user tries to generate a car, DALL-E may create…a plane. It’s like talking to a person who learned the wrong word for something. DALL-E can also be limited by gaps in its training. For example, if you type “baboon” and DALL-E has learned what a baboon is through images and accurate labels, it will generate a lot of great baboons. But if you type “howler monkey” and it hasn’t learned what a howler monkey is, DALL-E will give you its best idea of what it thinks it could be: like a “howling monkey”. What’s exciting about the approach used to train DALL-E is that it can take what it learned from a variety of other labeled images and then apply it to a new image. Given a picture of a monkey, DALL-E can infer what it would look like doing something it’s never done before. Like paying its taxes, while wearing a funny hat. DALL-E is an example of how imaginative humans and clever systems can work together to make new things — amplifying our creative potential.

Here is the generated transcript:

Have you ever seen a polar bear playing bass? Or robot painted like a Picasso? Didn’t think so. Dolly too is a new AI system from OpenAI that can take simple text descriptions like a Koala-Dunking Abaskable and turn them into photo realistic images that have never existed before. Dolly too can also realistically edit and retouch photos. Based on a simple natural language description, it can fill in or replace part of an image with AI-generated imagery that blends seamlessly with the original. It’s called in-painting. In January 2021, OpenAI introduced Dolly, a system that could generate images from text, like this avocado armchair. Dolly too takes the technology even further with higher resolution, greater comprehension, and new capabilities, like in-painting. It can even start with an image as an input and create variations with different angles and styles. Dolly was created by training a neural network on images and their text descriptions. Through deep learning, it not only understands individual objects like koala bears and motorcycles, but learns from relationships between objects. And when you ask Dolly for an image of a koala bear riding a motorcycle, it knows how to create that or anything else with a relationship to another object or action. The Dolly research has three main outcomes. First, it can help people express themselves visually in ways they may not have been able to before. Second, an AI-generated image can tell us a lot about whether the system understands us or is just repeating what it’s been taught. Third, Dolly helps humans understand how AI systems see and understand our world. This is a critical part of developing AI that’s useful and safe. The technology is constantly evolving, and Dolly too has limitations. If it’s taught with images that are incorrectly labeled like a plain labeled car, and a user tries to generate a car, Dolly may create a plane. It’s like talking to a person who learned the wrong word for something. Dolly can also be limited by gaps in its training. If you type that boon and Dolly has learned what a bad boon is through images and accurate labels, it will generate a lot of great bad boons. But if you type howler monkey, and it hasn’t learned what a howler monkey is, Dolly will give you its best idea of what it thinks it could be, like a howling monkey. What’s exciting about their approach used to train Dolly is that it can take what it learned from a variety of other labeled images, and then apply it to a new image. Given a picture of a monkey, Dolly can infer what it would look like doing something it’s never done before, like paying its taxes, while wearing a funny hat. Dolly is an example of how imaginative humans and clever systems can work together to make new things, amplifying our creative potential.

Do you notice the issue?

It did not recognize the term, DALL-E. Instead, it is transcribed to Dolly.

We use Diffchecker to compare the differences:

Image by author

The transcription is generally good, except the term, DALL-E.

The same conclusion is true for the summary:

Dolly is a new AI system from OpenAI that can take simple text descriptions like a Koala-Dunking Abaskable and turn them into photo realistic images. Dolly too can also realistically edit and retouch photos based on a simple natural language description.

Conclusion

We have explained Whisper, a general-purpose speech recognition model. OpenAI released Whisper on September 2022. Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web.

We have explored two examples on Hugging Face:

  • Transcribe an audio recording to text.
  • Generate a text transcript of the video, along with a summary.

Whisper is a leap from the existing speech recognition technologies. It is going to bring impact to the ASR world.

Thanks for reading.

Want to Connect? 

If you are interested, check out my directory of web development articles.

More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.

Interested in scaling your software startup? Check out Circuit.

AI
Machine Learning
Whisper
OpenAI
Speech Recognition
Recommended from ReadMedium