Microsoft Hummingbird Library – Converts your Traditional ML Models to Deep Learning Tensors

Microsoft Hummingbird Library
Microsoft Hummingbird Library

Introduction

We have seen a lot of advancements in the field of deep learning where computationally heavy neural network models that are built on powerful hardware like GPU for accelerating the training process. At the moment only deep learning frameworks like Tensorflow, PyTorch, Keras can harness the power of GPU computation. But the other traditional machine learning models that are built using Scikit Learn models cannot leverage GPU power for faster processing. Recently Microsoft has released one of its kind open-source library called Hummmingbird that tries to address this gap somewhat.

What is Microsoft Hummingbird?

Microsoft Hummingbird is an open-source library that can be used for converting already trained traditional ML Models (that are not neural networks) into tensor-based computational models.

Tensors, which is a vectored matrix, are widely used to create neural network models in deep learning frameworks like PyTorch, Tensorflow, and Keras due to its fast computational abilities. Microsoft Hummingbird aims to use tensors for the faster processing of the inference program of pre-trained traditional ML models.

Capabilities and Features of Hummingbird

Hummingbird has come up with some unique features. Let’s look at them and understand how we can be benefitted from this innovation of Microsoft.

  1. Optimizing the model through neural network frameworks.
  2. Accelerating the model building process with advanced hardwares.
  3. Hummingbird has this unprecedented quality where it supports both traditional and neural network models.
  4. Re-Scaling and Re-Engineering of the models have been made easier.

Traditional and Neural Network Models supported by Hummingbird

Since hummingbird is at its inception stage, currently it supports PyTorch Framework as the backend to convert our traditional models to PyTorch based models. We can expect the inclusion of other deep learning frameworks as well in the near future.

Ad
Deep Learning Specialization on Coursera

Currently, we can use this open-source library for converting tree-based classifiers and regressors traditional models only which are as follows –

  • Decision Trees
  • Extra Trees
  • Gradient Boosting
  • HistGradient Boosting
  • Random Forest
  • LightGBM
  • XGBoost

The developers of Hummingbird are also looking to add Linear Classifiers such as Linear Regression, Logistic Regression, etc. Along with this, Feature Selectors, Matrix Decomposition Methods, Feature Pre-Processing, and Text Featurizers are also planned to be added to this library.

Syntax of Hummingbird library

Well, this library doesn’t have a lot of functions that we have to interact with, we only have to deal with the convert function which is found in the hummingbird.ml.convert module.

Now let’s look at the convert function.

def convert(model,backend,test_input=None, extra_config={})

Through this function, an input tradition ML model can be converted to a tensor model. Currently, the convert function is able to work with Sklearn, LightGBM, and XGBoost models.

Arguments

  • model: The input model that has to be converted
  • backend: The output model.
  • test_input: This input data is mostly used when model execution is tracked.
  • extra_config: These extra configurations are used by individual operator converters. Generally, the number of features and tree implementation i.e. the depth of the tree is specified through these variables.

Hands-on Example of Hummingbird Library

Let us now analyze the performance of this hummingbird library. For this, we will be building a Random Forest Classifier(Tradition ML Model) using sklearn library. We’ll perform a binary classification using this model and then will review the time and memory consumption by this model. Furthermore, we’ll convert the above-constructed model to PyTorch based model(Neural Network Model or DNN Framework based). After this, the model built using PyTorch will be analyzed for its time and memory usage. At last, we’ll compare the results.

Installing Microsoft hummingbird library

In [1]:
!pip install hummingbird-ml
Collecting hummingbird-ml
  Downloading https://files.pythonhosted.org/packages/c6/15/a30aa78d60338c492ca765b4b764c3606a9768aa3fc8cbae7cfb4433a69f/hummingbird_ml-0.0.2-py2.py3-none-any.whl
Requirement already satisfied: Cython in /usr/local/lib/python3.6/dist-packages (from hummingbird-ml) (0.29.19)
Collecting scikit-learn==0.21.3
  Downloading https://files.pythonhosted.org/packages/a0/c5/d2238762d780dde84a20b8c761f563fe882b88c5a5fb03c056547c442a19/scikit_learn-0.21.3-cp36-cp36m-manylinux1_x86_64.whl (6.7MB)
     |████████████████████████████████| 6.7MB 12.3MB/s 
Requirement already satisfied: numpy>=1.15 in /usr/local/lib/python3.6/dist-packages (from hummingbird-ml) (1.18.5)
Requirement already satisfied: lightgbm>=2.2 in /usr/local/lib/python3.6/dist-packages (from hummingbird-ml) (2.2.3)
Requirement already satisfied: xgboost==0.90 in /usr/local/lib/python3.6/dist-packages (from hummingbird-ml) (0.90)
Collecting onnxconverter-common>=1.6.0
  Downloading https://files.pythonhosted.org/packages/fe/7a/7e30c643cd7d2ad87689188ef34ce93e657bd14da3605f87bcdbc19cd5b1/onnxconverter_common-1.7.0-py2.py3-none-any.whl (64kB)
     |████████████████████████████████| 71kB 11.3MB/s 
Requirement already satisfied: torch>=1.4.0 in /usr/local/lib/python3.6/dist-packages (from hummingbird-ml) (1.5.0+cu101)
Requirement already satisfied: scipy>=0.17.0 in /usr/local/lib/python3.6/dist-packages (from scikit-learn==0.21.3->hummingbird-ml) (1.4.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn==0.21.3->hummingbird-ml) (0.15.1)
Collecting onnx
  Downloading https://files.pythonhosted.org/packages/36/ee/bc7bc88fc8449266add978627e90c363069211584b937fd867b0ccc59f09/onnx-1.7.0-cp36-cp36m-manylinux1_x86_64.whl (7.4MB)
     |████████████████████████████████| 7.4MB 31.6MB/s 
Requirement already satisfied: protobuf in /usr/local/lib/python3.6/dist-packages (from onnxconverter-common>=1.6.0->hummingbird-ml) (3.10.0)
Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from torch>=1.4.0->hummingbird-ml) (0.16.0)
Requirement already satisfied: typing-extensions>=3.6.2.1 in /usr/local/lib/python3.6/dist-packages (from onnx->onnxconverter-common>=1.6.0->hummingbird-ml) (3.6.6)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from onnx->onnxconverter-common>=1.6.0->hummingbird-ml) (1.12.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from protobuf->onnxconverter-common>=1.6.0->hummingbird-ml) (47.1.1)
Installing collected packages: scikit-learn, onnx, onnxconverter-common, hummingbird-ml
  Found existing installation: scikit-learn 0.22.2.post1
    Uninstalling scikit-learn-0.22.2.post1:
      Successfully uninstalled scikit-learn-0.22.2.post1
Successfully installed hummingbird-ml-0.0.2 onnx-1.7.0 onnxconverter-common-1.7.0 scikit-learn-0.21.3

Importing the libraries

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import time
import warnings
from hummingbird.ml import convert
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
In [3]:
%matplotlib inline 
warnings.filterwarnings("ignore")

Function for calculating Runtime

Since we are focusing on reviewing the performance of the models through analyzing their time and memory consumption, we have built a function for calculating the time.

In [4]:
def timeit(method):
    def timed(*args,**kw):
        ts = time.time()
        result = method(*args,**kw)
        te = time.time()
        if 'log_time' in kw:
            name = kw.get('log_name',method.__name__.upper())
            kw['log_time'][name] = int((te - ts) * 1000)
        else:
            print('%r %2.2f ms' % \
                  (method.__name__,(te - ts) * 1000))
        return result
    return timed

Specifying the parameters for the Model

Here we are specifying the variables that will be used for building and training of the models.

In [5]:
num_classes = 2
num_of_samples = 10000000
num_of_features = 50

Generation of Dataset

With the help of numpy, we are generating a random dataset with parameters specified before for the model.

In [6]:
X = np.array(np.random.rand(num_of_samples,num_of_features),dtype=np.float32)
y = np.random.randint(num_classes,size=num_of_samples)
In [7]:
X[0:2]
Out[7]:
array([[0.9762268 , 0.29319492, 0.85840464, 0.6357587 , 0.24715969,
        0.13317758, 0.46990094, 0.3366895 , 0.3003061 , 0.8288483 ,
        0.5962162 , 0.22993757, 0.35542256, 0.6761605 , 0.91740674,
        0.04850541, 0.7587646 , 0.11225111, 0.42340794, 0.7942645 ,
        0.7284668 , 0.06991154, 0.33893657, 0.5700308 , 0.46123004,
        0.9752665 , 0.06585807, 0.08365367, 0.779082  , 0.5351648 ,
        0.13717858, 0.68279546, 0.73082423, 0.32866567, 0.3001047 ,
        0.24776706, 0.4709378 , 0.8451081 , 0.00346786, 0.43015707,
        0.9275357 , 0.11031681, 0.8299589 , 0.6603755 , 0.21291417,
        0.16095254, 0.01639881, 0.9435599 , 0.43779802, 0.27848265],
       [0.11475412, 0.5034502 , 0.32898515, 0.8293197 , 0.6025453 ,
        0.8268043 , 0.711728  , 0.86084515, 0.6701547 , 0.24660422,
        0.7951682 , 0.56461084, 0.65788424, 0.03880611, 0.9256486 ,
        0.905324  , 0.03170864, 0.39293587, 0.41250595, 0.07979824,
        0.24069901, 0.41136283, 0.11159769, 0.26164684, 0.6407254 ,
        0.32729763, 0.05043138, 0.6675321 , 0.9197083 , 0.07347824,
        0.67132455, 0.5740934 , 0.7489321 , 0.42839432, 0.4001747 ,
        0.25427487, 0.0868222 , 0.1624688 , 0.3432402 , 0.01040289,
        0.15307996, 0.3117931 , 0.5804445 , 0.4709584 , 0.18058434,
        0.7919111 , 0.32333922, 0.23778768, 0.25149363, 0.2220051 ]],
      dtype=float32)
In [8]:
y[0:2]
Out[8]:
array([1, 0])

Splitting the dataset into training and testing subsets

We will be using 25% of dataset for testing purposes.

In [9]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=42)

Building a Random Forest Classifier Model using Sklearn library

In [10]:
sk_model = RandomForestClassifier(n_estimators=10,max_depth=9)
sk_model.fit(X_train,y_train)
Out[10]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=9, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

Function for loading the model to a GPU

In this custom-built function, we are loading the model to a GPU.

In [11]:
@timeit
def output_prediction(inp_model,test_x,test_y,is_humming):
    if is_humming == 1:
        inp_model.to('cuda')
    return inp_model.predict(test_x)

Converting sklearn model to Pytorch model

In [12]:
model = convert(sk_model,'pytorch')

Performing prediction through Sklearn Model

We can see that the time taken for producing the result using RandomForest Classifier built with the help of sklearn library is 1965.80 ms

In [13]:
y_pred_sklearn = output_prediction(sk_model,X_test,y_test,0)
'output_prediction' 1965.80 ms

Prediction with PyTorch Model built using Hummingbird Library

When we look at PyTorch Model built using Hummingbird Library, the time taken is significantly less, 326.11 ms to be precise.

In [14]:
y_pred_humming = output_prediction(model,X_test,y_test,1)
'output_prediction' 326.11 ms

Reviewing the Results

Memory Usage

As we can see the Random Forest Classifier has used up 5.90 GB of RAM (on CPU).

Random Forest Memory Usage
Random Forest Memory Usage

 

When we look at the memory usage of PyTorch based model, it has used 6.15 of RAM and 1.91 GB of GPU.

PyTorch Model from Microsoft Hummingbird - Memory Usage
PyTorch Model from Microsoft Hummingbird – Memory Usage

Runtime

We can see that sklearn’s random forest classifier model completes its execution in 1965.80 ms, whereas PyTorch based model built through hummingbird library takes only 326.11 ms

Random Forest Runtime
Random Forest Runtime

 

Runtime of PyTorch Model from Microsoft Hummingbird
Run-time of PyTorch Model from Microsoft Hummingbird

Conclusion

We hope this article gave you a good insight int the new open-sourced library Hummingbird released by Microsoft. We looked at the features of this library and its main functionality and cover an example where we converted the ML model into the PyTorch model using hummingbird library and compared their performances.

You can find more details and the latest update on the official GitHub page of Microsoft Hummingbird Library.

Like and Comment section (Community Members)

Create Your ML Profile!

Don't miss out to join exclusive Machine Learning community

Comments

No comments yet