The latest OctoML platform release contains powerful new features:

Hardware Sweep

Upload your model and let OctoML suggest a broad sweep of hardware targets and compiler choices, performing dozens of benchmarking experiments with only a few clicks.

At the end of this sweep, which requires approximately 30min, you will receive a map of how your model performs of a cost vs. latency basis for all of the above hardware targets:

Once you find the hardware target that offers the best performance on latency and cost per million inferences, you can click on that dot to access a summary of the performance and get access to a hardware-optimized package ready for deployment:

FP16 Optimization for GPUs

When optimizing models for GPU targets, the OctoML Platform now automatically tests downcasting models to FP16, reducing the computational intensity of models and allowing for higher throughput and lower latency on NVIDIA hardware, which is built to optimally handle FP16 operations. This can reduce cost by as much as 50%, and improve customer experiences for your model applications.

You can toggle FP16 on and off in the “Explore UX” below:

FP16 downcasting is generally considered to be accuracy preserving, but if your use case is especially sensitive to small deviations in model output, we recommend toggling this feature off.

Did this answer your question?