Note: This page contains legacy documentation for a previous version of ModelPack. Please refer to the ModelPack 2: Overview for latest information.
Introduction
ModelPack is a higher optimized package capable of solving Semantic Segmentation and Object Detection on NPU. This package is exposed through eIQ Portal and docker containers. When using eIQ Portal, it is limited to Object Detection. However, when using the docker interface it exposes both, segmentation and detection approaches. To start working with ModelPack it is needed to request a license. Licenses can be requested for working over eIQ Portal or for docker containers (License Requesting for docker containers).
While eIQ Portal is limited to Object Detection problems and dataset formats, our docker containers exposes a more general SDK capable of working with Custom Datasets in two different formats: JSON and TFRecord. So far, there exists one docker container capable of exporting dataset from SageMaker manifest file to TFRecord or JSON (see Convert SageMaker Input Manifest file to TFRecord).
In this article we will expose the main functionalities needed to train ModelPack for solving Object Detection problems at the edge.
Training ModelPack
To train ModelPack we must configure our environment first.
- Request License from Au-Zone Technologies (Docker License Request)
- To have the dataset in the right format (SageMaker Export Guide)
- Augmentation Techniques and Loss Weighting (Advanced Model Tuning)
Points 1 and 2 are mandatory while point 3 is optional as the default options are already set to provide the best performance.
The docker container in charge of training ModelPack is distributed under the name:
deepview/modelpack:latest
ModelPack Regular Command Line Parameters
In this section we will describe the main parameters needed to train ModelPack from the command line. Note that advanced command line arguments (augmentation and losses) will be described in point 3 (Advanced Model Tuning).
- --task [-t]: This parameter is needed to define the task we are intending to solve. It could be either detection or segmentation. For the purpose of this tutorial just use --task=detection or -t detection.
-
--shape [-i]: Set the resolution of the model. height, width and channels of the input images.
- e.g. --shape=320,320,3 or -i 270,480,3.
-
--dataset [-d]: Path to the dataset info file. For training ModelPack, it is recommended to use either TFRecord or JSON formats. The dataset is described in the dataset_info.yaml and this parameter defines the path to the dataset info file. The path should be relative to the docker container since don't have access to the host-PC when running the container.
- e.g. --dataset= /data_out/dataset_info.yaml or -d /data_out/dataset_info.yaml
-
--checkpoints [-c]: While training the model, after each epoch is completed, the trainer will save a checkpoint with the current weights of the model (last.h5). Also, the trainer will evaluate the model on the validation dataset to check for improvements in a given --metric from the previous epoch. If any improvement was made, then the trainer will save the current weights in best.h5 file. At the end of the training process, the checkpoint folder will contain two files, last.h5 and best.h5.
- e.g. --checkpoints=/checkpoints or -c /checkpoints.
-
--logs [-l]: Path for TensorBoard logs. The trainer has a connection with TensorBoard and will log data to a new timestamp folder
- e.g. --logs=/logs or -l /logs. To inspect the training logs, use the ModelPack Logging docker container
-
--epochs [-e]: Number of epochs for training the model.
- e.g. --epochs=100 or -e 100
-
--batch-size [-b]: Number of images per batch. Note, the batch-size is highly conditioned to the GPU capabilities (Memory) and input dimension.
- e.g. --batch-size=10 or -b 10.
-
--initial_lr: Common learning rate parameter: Example: 0.001. With this parameter the optimizer will work to update model weights. The optimal optimizer for this model and loss function is Adam.
- e.g. --initial_lr-0.001
-
--warmup_lr: Initial learning rate when using the warmup strategy. Usually it is the smallest learning rate you want your optimizer to use.
- e.g. --warmup_lr=0.000001
-
--warmup-epochs [-w]: Number of epochs used to warm-up the optimizer.
- e.g. --warmup-epochs=3.
-
--metric: When training the model, after each epoch the model is evaluated on the validation set. The metric parameter is used to measure the performance of the model. Allowed values:
- acc - integrates True Positives and False Positives as a negative ratio. It is a very confident metric should be used over map (Mean Average Precision).
- map - metric only focuses on positive detection rates.
- recall - metric focuses on number of bad predictions.
- e.g. --metric=acc, --metric=recall or --metric map. Best checkpoint will be stored attending to the defined metric. Default is acc.
-
--iou_threshold: At evaluation time, it is needed to set some post processing parameters. The Intersection Over the Union threshold is given by this parameter.
- e.g. --iou_threshold=0.5
- value between 0 and 1. Default is 0.5.
-
--score_threshold: At evaluation time, it is needed to set some post processing parameters. The score threshold is given by this parameter. This threshold is used to filter lower scored predictions.
- e.g. --iou_threshold=0.45
- value between 0 and 1. Default is 0.5.
-
--weights: Used to restore weights from another previously trained session. The parameter accepts two different options.
- coco - initialize the model with weights trained from COCO dataset.
- path to a keras weights/model file.
- e.g. --weights coco or --weights=/data_out/best.h5
- --display: Determines how many images are going to be shown in TensorBoard during the logging process. By default, this parameter is 0 which means no images will be shown in the logs. Otherwise, TensorBoard will display the minimum between the --display value and the number of samples within the validation set.
- --exponential_decay: If this parameter is used, the warmup strategy will not be taken into consideration. The optimizer will start with the initial_lr value and apply the exponential decay after the number of epochs specified in this parameter. The parameter is expecting an integer larger than 0.
- --object_size: This parameter can be used to make the model pay more attention on the dominant size of the objects within the dataset. In other words, if we have a dataset where the objects tend to be small, the we can use the small value. medium value tends to be more conservative since will take the larger anchors from the smaller objects, the medium anchors and the smaller anchor from the larger objects. Finally, the large value can be used to make our model get more focused on large objects. The parameter is expecting and string value that could be either, small, medium or large. By default is set to large.
Warmup Scheduler
The warmup strategy is a very good technique to start training a model. Notice at the beginning of the training process the model does not have any knowledge about the dataset. If at this point we have a large learning rate, our model will apply larger updates on the weights. The thing is these large updates will harm the training process at the initial steps. In other words, the model is a bit disoriented and will start jumping from one solution to another.
The idea of the warmup strategy is to start with a tiny learning rate (near to 0) and increase it during a number of steps/epochs. By doing this, our optimizer will apply smaller updates at the beginning and more aggressive updates after obtaining more knowledge about the data. After finishing the warmup steps/epochs, the learning rate will be stable near to a initial learning rate value (--initial_lr). Once the optimizer reaches this learning rate value, we use a cosine function to auto-decrease the learning rate until a given value (--warmup_lr).
Training Command
In this section we will use a sample command to guide the reader through the command line options and values from the docker perspective.
docker run -it --gpus=all
(1) --mac-address 02:42:ac:**:**:**
(2) -v /data/au-zone/licenses/modelpack:/licenses
(3) -v /data/datasets/tfrecord/:/data_out
(4) -v /data/outputs:/training
(5) deepview/modelpack:latest
(6) --task=detection
(7) --shape=270,480,3
(8) --batch-size=10
(9) --epochs=100
(10) --checkpoints=/training/checkpoints
(11) --logs=/training/tensorboard
(12) --dataset=/data_out/dataset_info.yaml
- MAC address definition. It should be activated in Au-Zone license servers
- The modelpack.lic file should be in /data/au-zone/licenses/modelpack directory
- The dataset should be stored into /dataset/tfrecord directory. There should be the dataset_info.yaml file, tfrecord_train and tfrecord_val folders as well.
- Directory /data/outputs will be mounted into /training within the docker container to store checkpoints and TensorBoard logs there
- container's name
- Starts a detection task model training
- the model will be trained using the input resolution [height=270, width=480, channel=3]
- each batch will contain 10 images
- training will be running for 100 epochs
- checkpoints will be stored into /training/checkpoints/<time-stamp>
- TensorBoard logs will be stored into /training/tensorboard/<time-stamp>
- dataset is expected to be in /data_out/dataset_info.yaml.
Once the trainer starts running it will print the following information:
- training batch samples: 40 # number of batches per-epoch
- Saving best iteration checkpoint at : /training/checkpoints/2022-12-20_21-02-46.226630/best.h5
- Saving last iteration checkpoint at : /training/checkpoints/2022-12-20_21-02-46.226630/last.h5
Notice that 2022-12-20_21-02-46.226630 is the time-stamp for current training session. In the host computer, checkpoints will be under /data/outputs/checkpoints/2022-12-20_21-02-46.226630/*.h5 location while TensorBoard log will remain into /data/outputs/tensorboard/2022-12-20_21-02-46.226630/*
Conclusions
In this article we have shown how to train ModelPack from the docker container. Also, we have explained the intention of warmup strategy as well as basic training parameters. For advanced training parameters read Advanced Model Tuning. Despite the options, we have provided a training sample command integrating all the basic training features and tested on a custom dataset. Remember the checkpoints will store the best evaluation metric (best.h5) as well as the last trained epoch (last.h5).
Previous Step | Home | Next Step |
Convert SageMaker Input Manifest file to TFRecord | ModelPack Overview |
Comments
0 comments
Please sign in to leave a comment.