We've just released new improvements!
OctoML regularly improves TVM performance on FP32 and INT8 models. This week, we’re releasing a 10%-15% performance improvement on FP32 MobileNet and similar model architectures.
Further Large Model Support
We are releasing innovations in TVM’s memory planning functionalities that have reduced peak memory usage on large models by up to 10x (e.g. BERT). This means OctoML will be able to support TVM autotuning, benchmarking, and packaging on models up to 128 GB.
Versioning and Dependencies
OctoML regularly upgrades dependency libraries to get you the best-in-class performance on both TVM and ONNX. We have updated our versioning to CUDA 11.1 and TensorRT 7.2 (previously CUDA 10.2 and TensorRT 7.0), on all hardware targets except the Jetson family. We will upgrade versioning on the Jetson family once Jetpack enables support for CUDA 11 in end of Q1. We expect to further upgrade our versioning to CUDA 11.6 and TensorRT 8.2 in upcoming weeks.
We have also upgraded our ONNX version from 1.8.0 to 1.9.0. ONNX 1.9.0 is still using ONNX-RT 1.8.0.
We have added ONNX-RT benchmarking for all models and TFLite benchmarking for TFLite models to the 32-bit ARM Cortex A-72 CPUs.