Whisper Deployment Decisions: Part I — Evaluating Latency, Costs, and Performance Metrics

Published in

ML6team

3 min readJul 21, 2023

Introduction to Whisper

In September 2022, OpenAI introduced Whisper, a robust speech recognition model trained on diverse audio data. Its multitasking capabilities support tasks like multilingual speech recognition, speech translation, and language identification. With 10 models of varying sizes and capabilities, Whisper offers a versatile and effective solution for speech recognition requirements.

Whisper models and their details (Source: https://github.com/openai/whisper)

While numerous articles cover training, fine-tuning, and explaining the paper behind Whisper, there is a scarcity of resources focusing on running Whisper in a production environment.

Deploying into production requires a thorough consideration of three major factors:

💪 Performance metrics

⏱️ Latency

💰 Deployment cost

Embarking on a two-part blog series, we delve into the practicalities of implementing Whisper in a production environment. In the first part, we navigate the tradeoffs between model sizes and GPUs, shedding light on optimal choices. The sequel takes a closer look at the transformative effects of tools and techniques like JAX, ONNX, and KernlAI on these metrics.

Benchmarking Performance Metrics of Whisper Models

Fig.1. Word-Error rate for Whisper models on librispeech_asr (test-split)

According to Fig. 1, it is evident that larger models exhibit superior performance, as indicated by their lower word error rate. Therefore, if your primary objective is to attain optimal performance without concerns about latency and cost, opting for the largest model would be the ideal choice.

While larger models typically offer better performance, their impact on latency and cost must be taken into account. Finding the right balance is crucial for an informed decision.

Benchmarking Inference Speed of Whisper Models

Using HuggingFace’s 🤗 Whisper implementation, we benchmarked multilingual models across different batch sizes (1,2,4,8,and 16) on CPUs and GPUs (T4,V100,and A100) to evaluate inference speed. All the benchmarks were done on the test split of the HuggingFace 🤗 dataset: librispeech_asr.

Note: Missing bars/points means that we were not able to run the inference because of Out-of-Memory issue.

Fig.2. Mean Inference Time Comparison over 600 samples (each 30-seconds long)

Fig.3. Mean Inference Time Comparison over 600 samples (each 30-seconds long) with different batch-sizes

The following observations can be easily derived from Fig. 2 and Fig. 3:

As the size of the model increases, the inference time becomes slower since larger models have more parameters.
Running Whisper on CPUs is noticeably slower compared to using GPUs.
Irrespective of the model size, the inference time is the quickest on A100.
For a batch size of 1, Whisper models are faster on T4 than P100. However, for batch sizes > 1, only tiny and small models are faster on T4.
Batching data for inference results in faster processing compared to a batch size of 1, where no batching is involved.

Benchmarking Cost of Running Whisper Models

We benchmarked the cost of transcribing 5 hours of audio using different Whisper models on different GPUs which is presented below:

Fig. 4. Cost of running different Whisper models on 5 hours of speech data (600 samples, each 30 seconds)

Fig. 4 highlights a positive correlation between faster GPUs and increased cost.

Fig. 5. Cost of running different Whisper models on 5 hours of speech data (600 samples, each 30 seconds) on A100

Similar trends were observed in the cost of batching vs. non-batching inference, regardless of the GPU used. To gain insights into cost and inference speed, Fig. 6 illustrates the tradeoff between the average inference time for a 30-second audio sample and the cost of running it on a virtual machine in Google Cloud.

Fig. 6. Inference time for a 30s audio for Whisper Models vs Cost (per hour) of GPUs

Fig. 6. shows that the Tesla T4 GPU has the lowest cost per hour, while its inference speed is only slightly slower than the Tesla P100 GPU. The Tesla A100 GPU is the fastest, but its price is significantly higher than the other two GPUs.

Conclusion

Fig. 7. Best choice of GPU/CPU based on Latency and Deployment Cost for online setting.

To sum it up: The T4 GPU emerges as the optimal choice for supporting any Whisper model (except Whisper large-v2) in online (Batch-size = 1) and batch settings. It offers a cost-effective solution compared to the P100 and A100 GPUs. Despite the P100’s superior speed in batch settings compared to the T4, the tradeoff of higher costs makes it a less economical choice.

What’s next? Stay tuned for the highly anticipated second part of this blog post series, where we will delve deep into different optimizations. Discover how these optimizations can significantly impact crucial aspects such as latency, deployment costs, and overall performance metrics.