I understand the motivation behind strong typing, but the TensorRT 11 FP16 migration raises a few concerns.
1. Loss of hardware-aware precision optimization
In TensorRT 10, enabling FP16 was an optimization opportunity, not a hard requirement. TensorRT could choose FP16 or FP32 implementations based on the target GPU, available tactics, and measured performance.
With TensorRT 11, the recommended workflow is to transform the ONNX graph beforehand (e.g. with ModelOpt AutoCast), making FP32/FP16 decisions before TensorRT sees the model.
This seems to move precision assignment away from the component that actually knows the target hardware and may therefore reduce optimization opportunities that previously existed in TensorRT.
2. Reduced ONNX portability
Previously, a single FP32 ONNX model could be optimized differently depending on the deployment target.
Now, precision policy is embedded into the ONNX graph itself through explicit FP16 types and casts. This feels at odds with the idea of ONNX as a hardware-agnostic model representation, especially when the eventual deployment target may not be known at export time (e.g. running the model on a CPU).
3. Significantly increased dependency footprint
In TensorRT 10, mixed-precision optimization was built into TensorRT.
In TensorRT 11, the recommended replacement requires installing ModelOpt, which in turn brings in substantial dependencies including, including onnxruntime, onnxruntime-gpu, torch.
As a result, users who only need inference must now install a full deep learning framework and an additional inference stack just to regain functionality that previously existed inside TensorRT itself.
Is there any plan to provide a lightweight TensorRT-native alternative for mixed-precision graph transformation, or to restore some form of hardware-aware precision optimization during engine building?
I understand the motivation behind strong typing, but the TensorRT 11 FP16 migration raises a few concerns.
1. Loss of hardware-aware precision optimization
In TensorRT 10, enabling FP16 was an optimization opportunity, not a hard requirement. TensorRT could choose FP16 or FP32 implementations based on the target GPU, available tactics, and measured performance.
With TensorRT 11, the recommended workflow is to transform the ONNX graph beforehand (e.g. with ModelOpt AutoCast), making FP32/FP16 decisions before TensorRT sees the model.
This seems to move precision assignment away from the component that actually knows the target hardware and may therefore reduce optimization opportunities that previously existed in TensorRT.
2. Reduced ONNX portability
Previously, a single FP32 ONNX model could be optimized differently depending on the deployment target.
Now, precision policy is embedded into the ONNX graph itself through explicit FP16 types and casts. This feels at odds with the idea of ONNX as a hardware-agnostic model representation, especially when the eventual deployment target may not be known at export time (e.g. running the model on a CPU).
3. Significantly increased dependency footprint
In TensorRT 10, mixed-precision optimization was built into TensorRT.
In TensorRT 11, the recommended replacement requires installing ModelOpt, which in turn brings in substantial dependencies including, including
onnxruntime,onnxruntime-gpu,torch.As a result, users who only need inference must now install a full deep learning framework and an additional inference stack just to regain functionality that previously existed inside TensorRT itself.
Is there any plan to provide a lightweight TensorRT-native alternative for mixed-precision graph transformation, or to restore some form of hardware-aware precision optimization during engine building?