Blazing Fast Inference with Quantized ONNX Models

Tarun Gudipati
5 min readOct 14, 2023

--

We all would like to squeeze more performance out of our code right ?

In our modern era, filled with complex machine learning algorithms that demand intense computing resource, it’s vital that we are able to squeeze every last bit of performance.

Traditionally, machine learning algorithms are trained on GPUs with capabilities to support massive amounts of parallelized computation.
But when it comes to deploying our trained models for inference it’s not practical for everything to be deployed on machines which have access to such high end GPUs

Practical example

Let’s take an example to better understand the problem that we are tackling.
Consider that we are doing a machine learning component for Neural Machine Tranlsation.
While you could do this any machine learning model, lets take Google’s T5
model.

It’s a text to text transformer model that’s trained on large corpus for multiple tasks.

Google T5 Multi Task Diagram

The sample code for performing this task would look like below

  1. First we import all the necessary classes and functions from transformers package
  2. Then we instantiate the tokenizer and the model
  3. We pass the tokenizer and model to the pipeline helper function,
  4. On a high-level the transformer architecture can be broken down into two halfs the encoder and decoder half.
  5. We convert our textual input sentence(s) to their respective token representations and feed these tokens to the encoder part of the T5 model, then these outputs are fed to the decoder part and finally we convert back the output tokens to their textual representation using the same tokenizer.
  6. All the above complexity is nicely wrapped in the pipeline function we used above

Now, let’s measure the performance of this code

We use python’s built-in time-it module to measure the time it takes for our code translate and we do this for 5 times to get an average estimate.

Because we are not using any GPU for inference,
the numbers would certainly depend on the raw CPU power of the machine.
However we can still extract a relative measure.

(venv-meta) tarun@Taruns-MacBook-Pro ML % /Users/tarun/ML/venv-meta/bin/python /Users/tarun/ML/
t5_torch.py
It took about 0.4970941249999996 seconds

It takes about 0.5 seconds

Now let’s check the memory usage

Running the script we would find the following result

(venv-meta) tarun@Taruns-MacBook-Pro ML % python -m memory_profiler t5_torch_memory_profile.py
Filename: t5_torch_memory_profile.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
10 847.9 MiB 847.9 MiB 1 @profile
11 def perform_inference():
12 859.2 MiB 11.3 MiB 1 return pipe('Translate English to French: Hi, How are you ?')

We could see that it takes about 11MB of memory for the inference task

While these numbers are not bad on a beefy computer, it can be much more expensive on a low powered mobile device, what if there is a way where the deployment cost of these models demand much lower compute requirements and perform faster while mostly retaining their accuracy.

Meet ONNX

ONNX stands for Open Neural Network eXchange, it’s an open-sourced technology developed by Microsoft Research, to help accelerate machine learning inference across frameworks and languages.

The best part is ONNX is cross-platform and cross-language as well.
What this means is you could export your machine learning to the ONNX format and you can use them in any language you like.

Apart from this ONNX also comes with many built-in optimizations that can leverage modern hardware level features, to speed-up the inference.

Let’s start by exporting our model to ONNX representation.
The official way of exporting transformers models to ONNX runtime equivalents is to use optimum
Run the following command to perform the export

(venv-meta) tarun@Taruns-MacBook-Pro ML % optimum-cli export onnx \
--model t5-small \
--optimize O3 \
t5_small_onnx

In the above command we use the optimum-cli , which is a command line wrapper over the optimum library.

We start by specifying the model we want to export and we also specify the level of optimizations that ONNX should do.

But we are not done yet, we can do better!

Quantization

What if I say that I am like 85.90123456789% sure about a certain outcome versus 85.88% sure about the same outcome, would the decimal precision here make a difference ?

Probably not, this is what quantizing does to machine learning models.
When we train these models, we typically train them with higher precision for example floating point precision that uses up-to 64-bits.

However in practice we could lower this precision to 32-bits, 16-bits or even 8-bits, depending on our preferences.

Quantization is the process of reducing the precision of the weights, biases, and activations such that they consume less memory and run much faster!

To quantize our newly exported model, let’s run the optimum-cli again

(venv-meta) tarun@Taruns-MacBook-Pro ML % optimum-cli onnxruntime \
quantize \
--onnx_model t5_onnx_small \
--arm64 \
--output t5_onnx_small_quantized

...

Saving quantized model at: t5_onnx_small_quantized (external data format: False)
Configuration saved in t5_onnx_small_quantized/ort_config.json
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Now we are ready with our quantized model.
Lets check the time consumption again

We can see that our logic remains same for the most part, only change is instead of loading the hugging face transformers T5 model backed by PyTorch we load the ONNX equivalent from optimum.onnxruntime

(venv-meta) tarun@Taruns-MacBook-Pro ML % /Users/tarun/ML/venv-meta/bin/python /Users/tarun/ML/
t5_onnx.py
It took about 0.09925400000000018 seconds

That’s 5x performance improvement for very little code change.

Similarly let’s check how we fare in terms of memory

Again the code looks mostly similar other than the model initialization part.

(venv-meta) tarun@Taruns-MacBook-Pro ML % python -m memory_profiler t5_onnx_quantized_memory_profile.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
10 933.9 MiB 933.9 MiB 1 @profile
11 def perform_inference():
12 938.7 MiB 4.9 MiB 1 return pipe('Translate English to French: Hi, How are you ?')

We can clearly see that we end up consuming much less memory.
In fact its nearly 2.2x less memory consumption than what we had before.
As a wise programmer once said

High Performance

“Optimization is the name of the game, when it comes to scaling” 😎

All the code samples for this story can be found on my github page

Thought I’d share something I found very interesting, that’s it for this story, thanks for making it till here 😄

Connect with me on LinkedIn and X
for more such interesting reads.

--

--