We've just released new improvements!
We have upgraded our Compare UX to make it easier for you to compare latencies across different hardware targets. By default, we show you the best possible latency for each hardware after exploring multiple optimization strategies (TVM, ONNX-RT, TensorRT, and others). If you want to see results of the next-best optimization strategies we’ve also explored, simply toggle the “All packages” button.
We have upgraded the Accelerate UX so that you no longer have to pick between and learn about different engine options (TVM, ONNX-RT, TensorFlow, TFLite, and PyTorch). Simply select the hardware target(s) you’d like to deploy your model to and decide whether you’d like to use Express Mode or Normal Mode. We’ll automate the rest for you. Express Mode returns a deployable package as soon as possible and does not explore all possible optimizations. Normal Mode can take a few hours to complete, but it explores all possible optimizations to give you the best possible latency.
Our documentation site has a new look! Check out our tutorials on how to generate deployable packages with minimal latency for your ML models and read our SDK/API documentation for details on each endpoint.
We have published a new version of our Python SDK. Past versions refer to the our key result in the
ModelVariant.benchmark()command— inference latency— as “runtime_ms_mean” and “runtime_ms_std.” Because the word runtime also refers to engines like “ONNX-Runtime,” we have renamed our SDK command to refer to “latency_mean_ms” and “latency_std_ms” instead.
New AWS Instances Available
ARM-based Graviton2 AWS instances are now available in the platform! Kick off a model acceleration request to see how these instances perform on your model compared to x86 and AMD CPUs. If you’d like to see additional instances in the Graviton family on our platform, please contact your Customer Success representative to submit a feature request.