Introduction
In this article, we will do a complete deep dive into the OpenAI Whisper tutorial by covering both its API and the open-source version available in the GitHub repository. We will first understand what is OpenAI Whisper, then see the respective offerings and limitations of the API and open-source version. Finally, we will cover detailed examples of Whisper models to showcase their variety of features and capabilities.
What is OpenAI Whisper
OpenAI Whisper is an Automated Speech Recognition (ASR) model trained on 680,000 hours of multilingual data consisting of 98 languages. The model allows you to transcribe audio recordings of various languages and can also translate other languages into English while transcribing.
Supported Language
At the time of writing this, OpenAI Whisper supports the following languages –
Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.
Supported Input File Format
OpenAI Whisper supports the following file formats for audio –
mp3, mp4, mpeg, mpga, m4a, wav, webm
Supported Output File Format
Whisper supports the following output file format –
Text, JSON, SRT, VTT, and TSV (only in open source)
Available Models
OpenAI Whisper provides 9 models in total. Out of this, the 4 models support only the English language (suffixed with .en), whereas the other 5 models support multiple languages.
The smaller the model size, the faster it gives output, but the accuracy may suffer. On the other hand, the larger models give more accurate outputs but may take a longer time for inference.
Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
tiny | 39 M | tiny.en |
tiny |
~1 GB | ~32x |
base | 74 M | base.en |
base |
~1 GB | ~16x |
small | 244 M | small.en |
small |
~2 GB | ~6x |
medium | 769 M | medium.en |
medium |
~5 GB | ~2x |
large | 1550 M | N/A | large |
~10 GB | 1xUs |
Using Prompts with Whisper
A good thing about Whisper is that it allows you to use prompts to slightly improve the quality of the transcription. Prompts can be really useful in scenarios like –
- The default transcription by the model is missing some punctuation in between.
- The model is skipping the filler words but you want to retain them in transcription.
- To correct the spelling mistakes of words or acronyms that model is failing to understand.
- The model is failing to understand certain styles of speaking or writing, the prompt may make it understand.
- When you split a large audio file into chunks and you want to preserve the overall context while transcribing individual audio files.
How to use OpenAI Whisper
There are two main ways, using which you can work with OpenAI Whisper –
- OpenAI Whisper API
- Open Source Repository
Both OpenAI API and Open Source Repository offer the same features and capabilities but they have their own pros and cons. Let us discuss them in detail in the below section.
OpenAI Whisper API
Just like Dall-E 2 and ChatGPT, OpenAI has made Whisper available as API for public use. However, it is a paid API that costs $0.006 / minute of audio transcription or translation. But since the API is hosted on OpenAI’s infrastructure, it is optimized for speed and performance to give faster inference results.
Whisper API exposes two endpoints, one for transcription and one for translation. Under the hood, the API uses the large-v2 model of Whisper and support all the file format that we saw in the earlier section. However, the audio file size is limited to 25 MB for upload. There is also a rate limit of 50 requests per minute for calling the API.
Pros
- Whisper APIs are optimized for speed and performance with the large-v2 model.
- No need to have your own GPU-enabled machines (to run open source version).
- APIs are quick and easy to integrate with your application.
Cons
- The pricing associated with the API can be a deal breaker for many, especially individuals.
- There is no option to select other Whisper models. The API is only built on the large-v2 Whisper model.
- The APIs cannot be modified or customized as per your requirement.
- There is a limit of 25 MB on audio file upload through API. For file sizes greater than 25 MB, you have to chunk your audio file externally before calling the API.
Recommendations
- If you are an enterprise or a startup entrepreneur and want to harness the power of Automated Speech Recognition in your applications, OpenAI Whisper API will be a clean solution for you at this reasonable pricing point. You don’t need to worry about maintaining the open-source code base or procuring GPU-enabled servers for performance. However, if you are looking for finer control or customization, the API will not serve this need for you.
- For individuals, if you don’t have money to spend but are just curious then you can still explore OpenAI Whisper API with the free credit amount that you get while signing up for OpenAI.
Setup of OpenAI Whisper API in Python
In this section, we will see how to set up OpenAI Whisper API in Python.
Install OpenAI Python Package
To work with Whisper API in Python, it is required to install the OpenAI package by using the PIP command as shown below –
In [0]:
pip install openai
Load OpenAI Package
Next, let us import the OpenAI package.
In [1]:
import openai
Set OpenAI API Key
For accessing Whisper API, you are required to create an API Key in your OpenAI account dashboard. Once created, you then need to set API Key as shown below. (Make sure you replace <your API key> with the actual API key that you have generated.)
In [2]:
openai.api_key = "<your API key>"
Example of OpenAI Whisper API in Python
Example 1 – Transcription of English Audio with Whisper API
In this section, we will see examples of the basic use of OpenAI Whisper API and how we can generate the output in various available formats.
Input Audio
For this example, we are going to use the below English Audio.
Transcription to Text Format
To begin with we open the audio file in binary read mode. Next, we call openai.Audio.transcribe module and pass the audio file, model as “whisper-1” and response_format as “text“.
It should be noted that the underlying model of “whisper-1” is large-v2 at the time of writing this. Finally, we print the transcript result which is very accurate with the input audio.
In [3]:
audio_file= open("/content/english.mp3", "rb") transcript = openai.Audio.transcribe(model="whisper-1", file = audio_file, response_format = "text") print(transcript)
Out[3]:
So I'm handed the calculus book and I opened a page in the front cover and there are these equations. And I said, I will never learn this. That's what I said, that's what I felt. Is that any worse than you coming upon a book of Mandarin? And you don't know any Chinese characters, you say, I don't know any of this. Yeah, except one and a half billion people in the world speak it, so it can't be impossible to learn. So you put in the time and slowly some of the characters reveal, oh, that means a human or that's a home or that's a food. And all of a sudden the characters start making sense. It's not really any different from that.
Transcription to JSON Format
In this example, everything remains the same as above except the response_format uses “json” to produce the transcript result in JSON format.
In [4]:
audio_file= open("/content/english.mp3", "rb") transcript = openai.Audio.transcribe(model="whisper-1", file = audio_file, response_format = "json") print(transcript)
Out[4]:
{ "text": "So I'm handed the calculus book and I opened a page in the front cover and there are these equations. And I said, I will never learn this. That's what I said, that's what I felt. Is that any worse than you coming upon a book of Mandarin? And you don't know any Chinese characters, you say, I don't know any of this. Yeah, except one and a half billion people in the world speak it, so it can't be impossible to learn. So you put in the time and slowly some of the characters reveal, oh, that means a human or that's a home or that's a food. And all of a sudden the characters start making sense. It's not really any different from that." }
Transcription to VTT Format
This OpenAI Whisper example uses response_format as “vtt” to generate output in vtt format.
In [5]:
audio_file= open(audio_file_path, "rb") transcript = openai.Audio.transcribe(model="whisper-1", file = audio_file, response_format = 'vtt') print(transcript)
Out[5]:
WEBVTT 00:00:00.000 --> 00:00:05.800 So I'm handed the calculus book and I opened a page in the front cover and there are these equations. 00:00:05.800 --> 00:00:09.300 And I said, I will never learn this. 00:00:09.300 --> 00:00:10.900 That's what I said, that's what I felt. 00:00:10.900 --> 00:00:14.300 Is that any worse than you coming upon a book of Mandarin? 00:00:14.300 --> 00:00:17.600 And you don't know any Chinese characters, you say, I don't know any of this. 00:00:17.600 --> 00:00:22.300 Yeah, except one and a half billion people in the world speak it, so it can't be impossible to learn. 00:00:22.300 --> 00:00:28.800 So you put in the time and slowly some of the characters reveal, oh, that means a human or that's a home or that's a food. 00:00:28.800 --> 00:00:30.900 And all of a sudden the characters start making sense. 00:00:30.900 --> 00:00:33.100 It's not really any different from that.
Transcription to SRT Format
This example uses response_format as “srt” to generate output in srt format.
In [6]:
audio_file= open("/content/english.mp3", "rb") transcript = openai.Audio.transcribe(model="whisper-1", file = audio_file, response_format = "srt") print(transcript)
Out[6]:
1 00:00:00,000 --> 00:00:05,800 So I'm handed the calculus book and I opened a page in the front cover and there are these equations. 2 00:00:05,800 --> 00:00:09,300 And I said, I will never learn this. 3 00:00:09,300 --> 00:00:10,900 That's what I said, that's what I felt. 4 00:00:10,900 --> 00:00:14,300 Is that any worse than you coming upon a book of Mandarin? 5 00:00:14,300 --> 00:00:17,600 And you don't know any Chinese characters, you say, I don't know any of this. 6 00:00:17,600 --> 00:00:22,300 Yeah, except one and a half billion people in the world speak it, so it can't be impossible to learn. 7 00:00:22,300 --> 00:00:28,800 So you put in the time and slowly some of the characters reveal, oh, that means a human or that's a home or that's a food. 8 00:00:28,800 --> 00:00:30,900 And all of a sudden the characters start making sense. 9 00:00:30,900 --> 00:00:33,100 It's not really any different from that.
Example 2 – Transcription of Spanish Audio with Whisper API
In this example of OpenAI Whisper, we will show how it can transcript another language besides English.
Input Audio
We are going to use the below Spanish Audio for this example.
Transcription
Just like in previous examples, we open the Spanish audio file and pass it to openai.Audio.transcribe and it is able to generate a pretty good transcription in the Spanish language.
In [7]:
audio_file= open("/content/spanish.mp3", "rb") transcript = openai.Audio.transcribe(model="whisper-1", file = audio_file, response_format = "srt") print(transcript)
Out[7]:
1 00:00:00,000 --> 00:00:03,000 Ojalá nuestro viaje hubiera durado un par de días más. 2 00:00:03,000 --> 00:00:29,000 Seguramente voy a atesorar los hermosos recuerdos que tuve durante este pequeño viaje.
Example 3 – Using Prompts in Whisper API
In the below example, we will see how prompts can make the Whisper model include filler words in English which otherwise it would skip.
Input Audio
In this example of using prompts in OpenAI Whisper, we are going to use the below English Audio that has several fillers “umm.. ” throughout the sentence.
Without Prompt
In this example, we did not give any prompt, and as you can see the Whisper model has excluded all the filler words “umm..” in the resulting transcription.
In [8]:
audio_file= open("/content/english_fillers.mp3", "rb") transcript = openai.Audio.transcribe(model="whisper-1", file = audio_file, response_format = 'text') print(transcript)
Out[8]:
So I was like I'm very anxious in the beginning then maybe I should have been more focused with my approach, but anyways what has happened has happened.
With Prompt
In this example, we gave a prompt sentence that included the filler word “umm”, and as you can see from the output the model retained the filler words at all places as it is while transcribing.
In [9]:
audio_file= open("/content/english_fillers.mp3", "rb") transcript = openai.Audio.transcribe(model="whisper-1", file = audio_file, response_format = 'text', prompt = "Umm, let me think like, umm... Okay, here's what I'm, like, thinking.") print(transcript)
Out[9]:
So I was like, I'm very anxious in the beginning then, umm... Maybe I should have, umm... Been more focused with my approach, but umm... Anyways, what has happened, umm... Has happened.
Example 4 – Hindi to English Translation in Whisper API
In this section, let us see the example of the Translation API of Whisper.
Input Audio
For this example, we are going to use the below Hindi Audio to translate it into English.
Translation
To translate we have to use openai.Audio.translate module as shown below. It was able to translate Hindi audio into English text with pretty good accuracy.
In [10]:
audio_file= open("/content/hindi.mp3", "rb") transcript = openai.Audio.translate(model="whisper-1", file = audio_file, response_format = "vtt") print(transcript)
Out[10]:
WEBVTT 00:00:00.000 --> 00:00:04.800 I had to face many such moments in my life where I had to struggle. 00:00:06.200 --> 00:00:11.000 Once, I was disheartened and went to meet Pooja Babuji. 00:00:11.640 --> 00:00:15.000 I told him that Babuji, there is lot of struggle in life. 00:00:16.640 --> 00:00:18.640 He told me a very good thing. 00:00:18.640 --> 00:00:30.640 Son, as long as there is life, there is struggle.
OpenAI Whisper Open Source
The open-source version of Whisper is available on GitHub and it offers similar transcription and translation experience to the API version. You just need to download the Whisper package from the repository and set it up locally to start inferencing the model. However, if you do not have GPU in your system, you may experience slowness with larger models. As a workaround you may use the smaller models but at the cost of lower accuracy.
Pros
- Being open source it is free to use.
- You have the flexibility to choose from the available models.
- You have finer control over the code and can customize it if required.
Cons
- If you don’t have GPU, you will experience slowness with large models.
- You will have to set up and maintain the code on your own which may be cumbersome for many.
Recommendations
- The good thing is since open source version of Whisper is free, there is no entry barrier, and can be used by anyone who knows to work with codes – enterprises, startups, individuals, hobbyists, etc.
- If you are an enterprise or startup and looking for customizations, then you should use the open-source version. Even if you are not looking for customization but are already invested in GPU-enabled servers and have a good team to set up and maintain code, you should really consider the open-source version of Whisper.
Setup of OpenAI Whisper Open Source in Python
In this section, we will see how to set up the open-source version of Whisper in Python.
Install Whisper Python Package
The Whisper Python package can be installed by using the pip command as shown below –
In [0]:
pip install -U openai-whisper
Install FFMPEG Tool
Next, we have to install ffmpeg which is a command-line tool for handling the multimedia files like audio, video, etc.
Depending on your system’s operating system, you can download & install by using the respective package manager.
In [1]:
# on Ubuntu or Debian sudo apt update && sudo apt install ffmpeg # on Arch Linux sudo pacman -S ffmpeg # on MacOS using Homebrew (https://brew.sh/) brew install ffmpeg # on Windows using Chocolatey (https://chocolatey.org/) choco install ffmpeg # on Windows using Scoop (https://scoop.sh/) scoop install ffmpeg
Example of Whisper Open Source Command Line
Example 1 – Transcription of English Audio
In this section, we will see examples of the basic use of the Whisper command line to transcribe an English audio.
Input Audio
We are going to use the below English Audio.
Transcription
To transcribe any audio, just type whisper in the command prompt along with the path of the audio file. By default, it uses the small model that it downloads during runtime if it gets used for the first time.
Whisper uses the first 30 seconds to detect the language of the audio and then processes the complete audio to generate transcription.
Besides the output in the console (shown below), it creates 5 transcription files in the working directory in the following format – Text, JSON, SRT, VTT, and TSV. (You can restrict the generated file by passing the specific format to the –output_format parameter.)
Tip: You can transcribe/translate multiple files in one go by specifying their names with space in between on the command line
In [2]:
whisper "/content/english.mp3"
Out[2]:
100%|███████████████████████████████████████| 461M/461M [00:07<00:00, 63.2MiB/s] Detecting language using up to the first 30 seconds. Use `--language` to specify the language Detected language: English [00:00.000 --> 00:04.240] So I'm handed the calculus book and I opened a page in the front cover and there these [00:04.960 --> 00:07.480] Equations and I said I will never [00:08.100 --> 00:10.680] Learn this that's what I said. That's what I felt [00:10.680 --> 00:16.280] Is that any worse than you coming upon a book of Mandarin and you don't know any Chinese characters? [00:16.280 --> 00:20.120] I don't know any of this yeah except one and a half billion people in the world speak it [00:20.120 --> 00:25.440] So it's can't be impossible to learn so you put in the time and slowly some of the characters reveal [00:25.440 --> 00:30.680] That means a human or that's a home or that's a food and all of a sudden the characters start making sense [00:30.680 --> 00:32.920] It's not really any different from that
Example 2 – Transcription of Audio with Specific Model
In this example, we pass the specific model “large” in the command line to transcribe the same audio file that we used in the previous example. Compared to the small model, the large model applies proper punctuation in the output but takes a bit more time to execute.
In [3]:
whisper "/content/english.mp3" --model large
Out[3]:
100%|██████████████████████████████████████| 2.87G/2.87G [00:26<00:00, 116MiB/s] Detecting language using up to the first 30 seconds. Use `--language` to specify the language Detected language: English [00:00.000 --> 00:06.000] So I'm handed the calculus book, and I opened a page in the front cover, and there are these equations. [00:06.000 --> 00:09.000] And I said, I will never learn this. [00:09.000 --> 00:11.000] That's what I said, that's what I felt. [00:11.000 --> 00:14.000] Is that any worse than you coming upon a book of Mandarin? [00:14.000 --> 00:17.000] And you don't know any Chinese characters, you say, I don't know any of this. [00:17.000 --> 00:22.000] Yeah, except one and a half billion people in the world speak it, so it can't be impossible to learn. [00:22.000 --> 00:28.000] So you put in the time, and slowly some of the characters reveal, oh, that means a human, or that's a home, or that's a food. [00:28.000 --> 00:31.000] And all of a sudden the characters start making sense. [00:31.000 --> 00:33.000] It's not really any different from that.
Example 3 – Specifying the Language of Model
In the previous two examples, we saw that Whisper uses the first 30 seconds of the audio file to detect its language. This takes a bit of time, and you can cut this down by upfront specifying the language name to the model. Let us see this in the below example.
Input Audio
We are going to use the below Spanish Audio for this example.
Transcription
In this example, we pass the language code for Spanish ‘es’ in the command line, and as you can see from the output the model does not waste its time in detecting the language and directly transcribes the audio.
(Also, by default small model is used here, and since it was already downloaded in the first example, it is not downloaded again)
In [4]:
whisper '/content/spanish.mp3' --language es
Out[4]:
[00:00.000 --> 00:03.940] Ojalá nuestro viaje hubiera durado un par de días más. [00:03.940 --> 00:08.600] Seguramente voy a atesorar los hermosos recuerdos que tuve durante este pequeño viaje.
Example 4 – Translating Audio into English Text
In this section, we will see how to translate audio of other supported languages into English text with the Whisper model.
Input Audio
We are going to use the below Hindi Audio to translate it into English for this example.
Translation
To translate audio, we have to specify the task as “translate” (the default is “transcribe”). In this example, we used the “large” model to generate the best possible translation of Hindi audio into English text.
In [5]:
whisper '/content/hindi.mp3' --task translate --model large
Out[5]:
Detecting language using up to the first 30 seconds. Use `--language` to specify the language Detected language: Hindi [00:00.000 --> 00:06.200] There were many such moments in my life where I had to struggle. [00:06.200 --> 00:16.640] Once when he was alive, I went to Pooja Babuji and told him that there is a lot of struggle in life. [00:16.640 --> 00:24.400] He told me a very good thing. He said that as long as there is life, there is struggle.
Example 4 – Using Prompt with Whisper Model
In this example, we will see how you can use prompts to persuade the Whisper model to include filler words in the audio, which it generally tends to skip.
Input Audio
For this example, we are going to use the below English Audio that has multiple fillers “umm.. ” throughout the sentence.
Without Prompt
In this example, we did not give any prompt, and as you can see the Whisper model has excluded all the filler words “umm..” in the resulting transcription.
In [6]:
whisper "/content/english_fillers.mp3" --model large
Out[6]:
Detecting language using up to the first 30 seconds. Use `--language` to specify the language Detected language: English [00:00.000 --> 00:09.000] So I was like, I'm very anxious in the beginning and maybe I should have been more focused with my approach, but anyways, what has happened has happened.
With Prompt
Now we use an initial prompt sentence that included the filler word “umm”, and as you can see from the transcription the model retained the filler words at all places.
In [7]:
whisper "/content/english_fillers.mp3" --model large --initial_prompt "Umm, let me think like, umm... Okay, here's what I'm, like, thinking."
Out [7]:
Detecting language using up to the first 30 seconds. Use `--language` to specify the language Detected language: English [00:00.000 --> 00:09.000] So I was, like, umm, very anxious in the beginning, then, umm, maybe I should have, umm, been more focused with my approach, but, umm, anyways, what has happened, umm, has happened.
Example of Whisper Open Source in Python Code
Import Whisper Package
To begin with, we have to import the Whisper Python package as shown below. (This is already installed using pip in the set up section above)
In [0]:
import whisper
Example 1 – Transcription with Whisper Python Package
Input Audio
In this example, we are going to use the below English Audio that has the filler word “umm.. ” in many places in the sentence.
Transcription
First, we have to initialize the whisper model by loading one of the available models. In this example, we loaded the “large” model. Next, we use the transcribe module and pass the audio file path as a parameter. Finally, we print the transcribed text as output.
It should be noted that the model skipped the filler words “umm..” in the output, for which we will use prompt in the next example.
In [1]:
model = whisper.load_model("large") result = model.transcribe(audio="/content/english_fillers.mp3") print(result["text"])
Out[1]:
So I was like I'm very anxious in the beginning then maybe I should have been more focused with my approach but anyways what has happened has happened.
Example 2 – Using Prompt in Whisper Python Package
In this example. we make use of the initial_prompt parameter to pass a prompt that includes filler words “umm..”. This tells the model not to skip the filler word like it did in the previous example. And as evident from the output this trick works and it preserves the filler word “umm..” in the output.
In [2]:
model = whisper.load_model("large") result = model.transcribe(audio= "/content/english_fillers.mp3", initial_prompt="Umm... let me think like, umm... Okay, here's what I am like umm... thinking.") print(result["text"])
Out[2]:
So I was like umm... very anxious in the beginning then umm... maybe I should have umm... been more focused with my approach but umm... anyways what has happened umm... has happened.
Example 3 – Translation in Whisper Python Package
Input Audio
We are going to use the below Spanish Audio for this example and translate it to English text.
Translation
To translate the audio, we just need to pass a parameter task with the value ‘translate’ in the transcribe function as shown in the below example.
In [3]:
model = whisper.load_model("small") result = model.transcribe(audio="/content/spanish.mp3", task = 'translate') print(result["text"])
Out[3]:
I hope our trip would have lasted a couple of more days. Surely I'm going to treasure the beautiful memories I had during this little trip.