This week's OctoML platform release contains some exciting new features!
Identifies the optimal batch size to maximize throughput and reduce cost per inference while still meeting your latency constraints! To access, select a Model located in a Project and click on the Settings tab. There you will be able to click on the “Enable batching” button.
Continue by indicating the dimension in each input that corresponds to batch size. For example, a sample BERT Transformer model downloaded from HuggingFace in our tutorial has 3 inputs: input_ids, attention_mask, and token_type_ids. All three inputs are in the shape
[batch_size, maximum_sequence_length], so the batch size dimension for each input is 0. After you’ve enabled batching for the model, click on the Package button on the top right.
An “Explore batch size” section is available in the middle of the page. Here, you can indicate different batch sizes you’d like to explore (typically, batch sizes come in powers of two). If exploring multiple batch sizes is not of interest, simply provide a single value. Once you’re done filling out all the required fields on this page (hardware and dynamic input shapes), click Package and the platform shall kick off the batch size exploration for you.
Once the workflows are complete, you can view the best performance achieved for each hardware target in the Packages tab. Now, click on the “My Project” gray breadcrumb to navigate back to the project overview.
Select models on the left panel to see the best performance achieved after exploration. In this example, notice how the “transformers.onnx” model outperforms the other model “test_hf_2.onnx” in terms of both throughput and latency. Throughput is higher at 120RPS and latency is lower at 120.66RPS, and the most optimal batch size out of all the ones explored is 4.
Python Wheels for GPU
Python wheel packages are now available for GPUs. Select a Model located in a Project and click on the Packages tab. Simply click the Download button, and you’ll see both the wheel and existing Docker tarball package options for each hardware explored.
Model coverage and performance have been improved with the latest version upgrades.
All models on or before PyTorch version 1.12 and TensorFlow/TFLite 2.8.0 that are convertible to ONNX-RT should be covered.
We have also upgraded our versions of ONNX-RT to 1.10, CUDA to 11.6, cuDNN to 22.214.171.124, and TensorRT to 126.96.36.199.
Azure Ampere ARM64 CPUs
Azure Ampere Altra ARM64 CPUs are now available (dpsv5 and epsv5 instances) in the OctoML platform. These devices can deliver up to 50% better cost per inference than comparable x86 VMs for scale-out workloads.
We are suspending support for 32-bit Raspberry Pis, Jetson, and AMD v2000 edge devices until 2023. If you need access to these devices, please contact your Customer Success representative here.