This article provides suggested workflows for converting MobileNet SSD models optimized for inference using VisionPack. We provide a reference DeepViewRT Model output along with labels.txt and coco_people samples as a convenience when converting COCO trained SSD models.
For TensorFlow 1.x pre-trained models, you will need to get the original SavedModel (.pb) to perform model conversion. The provided TFLite models will provide lower performance as the box decoding portion of the model, when quantized, cannot be embedded into the DeepViewRT model due to some missing parameters. When converting from the full SavedModel the DeepView converter is able to retrieve all required parameters to generate a fully quantized model with embedded box decoder.
The following command has been tested with the MobileNet V2 SSD from the TensorFlow 1.x Object Detection Model Zoo. The labels.txt and coco_people samples are attached to this article for convenience.
deepview-converter --input_names Preprocessor/sub --output_names Postprocessor/convert_scores,Postprocessor/ExpandDims_1 --input_shape 1,300,300,3 --quantize --quant_normalization signed --samples coco_people --input_type uint8 --output_type int8 --quant-tensor --labels labels.txt mobilenet_ssd_v2.pb mobilenet_ssd_v2.rtm
When quantizing it is strongly recommended to use some sample images from your dataset to improve the quantization parameters. We find that 10-100 images are ideal, less than 10 sees a loss in accuracy while more than 100 has strong diminishing returns.
When quantizing it is suggested to use per-tensor quantization (--quant-tensor) for optimal NPU performance. This can lead to greater loss in accuracy compared to per-channel when the model is not fully trained, please make sure to verify results as the impact varies by use case.
The ideal input type when working with images is uint8 as it provides the most optimal image input pipeline. The ideal output type is int8 as it requires the least output processing to read results as it is the native internal representation of DeepViewRT for quantized models.
The DeepView converter allows the model to be trimmed at conversion time which is helpful if portions of the model cannot be converted. For MobileNet SSD models it was previously suggested to trim the decoding but improvements to the converter now allow the full decoder to be embedded which provides better end-to-end performance. The command above shows the correct output names to embed the decoder, it may need to be adapted depending on your specific model implementation.
Using these conversion instructions the resulting model has been benchmarked on the NXP i.MX 8M Plus EVK with the OV5640 camera taking about 2ms for load_frame when the camera is configured for 1920x1080 followed by under 12ms for inference time and 4-8ms for matrix nms depending on the number of candidate objects.
It can be tested using the VisionPack Detection Demo using the following parameters. The -T parameter is optional and disables threading since this model is much faster than the camera's framerate.
./detectgl -T mobilenet_ssd_v2.rtm