What’s new with Microsoft in open-source and Kubernetes at KubeCon North America 2024
At Microsoft, we are committed to innovation in the cloud-native ecosystem through…
With the growing trend towards deep learning techniques in AI, there are many investments in accelerating neural network models using GPUs and other specialized hardware. However, many models used in production are still based on traditional machine learning libraries or sometimes a combination of traditional machine learning (ML) and DNNs. We’ve previously shared the performance gains that ONNX Runtime provides for popular DNN models such as BERT, quantized GPT-2, and other Huggingface Transformer models. Now, by utilizing Hummingbird with ONNX Runtime, you can also capture the benefits of GPU acceleration for traditional ML models.
This capability is enabled through the recently added integration of Hummingbird with the LightGBM converter in ONNXMLTools, an open source library that can convert models to the interoperable ONNX format. LightGBM is a gradient boosting framework that uses tree-based learning algorithms, designed for fast training speed and low memory usage. By simply setting a flag, you can feed a LightGBM model to the converter to produce an ONNX model that uses neural network operators rather than traditional ML. This Hummingbird integration allows users of LightGBM to take advantage of the GPU accelerations typically only available for neural networks.
Hummingbird is a library for converting traditional ML operators to tensors, with the goal of accelerating inference (scoring/prediction) for traditional machine learning models. You can learn more about Hummingbird in our introductory blog post, but we’ll present a short summary here.
The integration of Hummingbird with ONNXMLTools allows users to take advantage of the flexibility and performance benefits of ONNX Runtime. ONNX Runtime provides a consistent API across platforms and architectures with APIs in Python, C++, C#, Java, and more. This allows models trained in Python to be used in a variety of production environments. ONNX Runtime also provides an abstraction layer for hardware accelerators, such as Nvidia CUDA and TensorRT, Intel OpenVINO, Windows DirectML, and others. This gives users the flexibility to deploy on their hardware of choice with minimal changes to the runtime integration and no changes in the converted model.
While ONNX Runtime does natively support both DNNs and traditional ML models, the Hummingbird integration provides performance improvements by using the neural network form of LightGBM models for inferencing. This may be particularly useful for those already utilizing GPUs for the acceleration of other DNNs. Let’s take a look at this in action.
import numpy as np
import lightgbm as lgb
import timeit
import onnxruntime as ort
from onnxmltools.convert import convert_lightgbm
from onnxconverter_common.data_types import FloatTensorType
max_depth = 8
num_classes = 2
n_estimators = 1000
n_features = 30
n_fit = 1000
n_pred= 10000
X = np.random.rand(n_fit, n_features)
X = np.array(X, dtype=np.float32)
y = np.random.randint(num_classes, size=n_fit)
test_data = np.random.rand(n_pred, n_features).astype('float32')
model = lgb.LGBMClassifier(n_estimators=n_estimators, max_depth=max_depth, pred_early_stop=False)
model.fit(X, y)
input_types = [("input", FloatTensorType([n_pred, n_features))] # Define the inputs for the ONNX
onnx_ml_model = convert_lightgbm(model, initial_types=input_types)
lgbm_time = timeit.timeit("model.predict_proba(test_data)", number=7,
setup="from __main__ import model, test_data")
print("LightGBM (CPU): {}".format(num_classes, max_depth, n_estimators, lgbm_time))
sessionml = ort.InferenceSession(onnx_ml_model.SerializeToString())
onnxml_time = timeit.timeit("sessionml.run( [sessionml.get_outputs()[1].name],
{sessionml.get_inputs()[0].name: test_data} )",
number=7, setup="from __main__ import sessionml, test_data")
print("LGBM->ONNXML (CPU): {}".format(num_classes, max_depth, n_estimators, onnxml_time))
The result is the following:
LightGBM (CPU): 1.1157575770048425
LGBM->ONNXML (CPU) 1.0180995319969952
Not bad! Now let’s see Hummingbird in action. The only change to the conversion code above is the addition of without_onnx_ml=True
input_types = [("input", FloatTensorType([n_pred, n_features))] # Define the inputs for the ONNX
onnx_model = convert_lightgbm(model, initial_types=input_types, without_onnx_ml=True)
We can now pip install onnxruntime-gpu and run the prediction over the onnx_model:
sess_options = ort.SessionOptions()
session = ort.InferenceSession(onnx_model.SerializeToString(), sess_options)
onnx_time = timeit.timeit("session.run( [session.get_outputs()[1].name], {session.get_inputs()[0].name:
test_data} )", number=7, setup="from __main__ import session, test_data")
print("LGBM->ONNX (GPU): {}".format(onnx_time))
And we get:
LGBM->ONNXML->ONNX (GPU): 0.2364534509833902
There is an approximate 5x improvement over the CPU implementation. Additionally, the ONNX model can take advantage of any additional optimizations available in future releases of ORT, and it can run on any hardware accelerator supported by ORT.
Hummingbird currently supports converters for ONNX, scikit-learn, XGBoost, and LightGBM. In the future, we plan to provide similar features for other converters in the ONNXMLTools family, such as XGBoost and scikit-learn. If there are additional operators or integrations you would like to see, please file an issue. We would love to hear about how Hummingbird can help speed-up your workloads and we look forward to adding more features!