Introduction
In this article we show how to convert a dataset based on a SageMaker input manifest file into a TFRecord format [1]. The SageMaker input manifest file stores multiple JSON objects (one per line) in order to define image references and annotations. Each annotation line contains a source-ref key that makes reference to the image name, which can be in any format (PNG, JPEG, etc.). In this tutorial we showcase the Playing Cards (v7) dataset, which counts with multiple JPG image files and one manifest file (output.manifest). The local representation of the dataset on the computer looks like the picture below (multiple images and a single manifest file that contains all the annotations).
The resulting TFRecord dataset follows the schema defined in TFRecord Format for ModelPack.
Requirements
- Successfully installed ModelPack with Docker
- Registered the ModelPack License File
- Download Playing Cards Dataset in SageMaker Format
Dataset Export
To export the dataset into TFRecord format we have developed a new docker container under the name
deepview/dataset-converter. The main entry-point of this container will read the SageMaker manifest file, parse it and create TFRecord representations for the dataset [2] within the output location (given as a parameter). The resulting dataset will be represented as follows:
- dataset_info.yaml: This file contains the path to the folder where the training and validation samples are. The name of the classes will also be included.
- tfrecord_train: This folder contains all the TFRecord files corresponding to the training set.
- tfrecord_val: This folder contains all the TFRecord files corresponding to the validation set.
This container has all the utilities needed to go from SageMaker annotations format to a TFRecord dataset compatible with ModelPack training pipelines. By using the command line interface it is possible to configure the dataset in two different ways:
- Splits: How many samples will be assigned to validation and training sets.
- Partitions: How many images with their annotations will be stored into each TFRecord file. Each partition will be equipped with multiple images and their respective annotations [4] in order to optimize the training pipeline.
Splits
To define dataset splits we use the command line option --split or -s followed by a float number between 0 and 1. Where 0 means not validation steps and 1 means the whole dataset will be used as validation set. The idea of splitting the dataset into train/validation samples is to measure the model performance after each epoch.
Partitions
The partition parameter will define how many samples we want to store within each TFRecord file. Notice that the optimal size [4] of the resulting TFRecords file should be around 100 MB to take advantage of parallelism and data sharing buffers. The number of samples per TFRecord file will be conditioned by two factors, the number of annotations per image and the image resolution.
How to run the container
In this section will be explained in details the command line options for the exporter container as well as some docker features that will help you to play with the container and be successful when converting the dataset.
Exporter Parameters
- -a [--annotations]: path to the SageMaker manifest file
- -i [--images]: Path to the folder that contains the images. Usually it is the same directory as the manifest file.
- -o [--output]: Path to the output folder. In this the TFRecord dataset for ModelPack will be built.
- -f [--format]: Either tfrecord or json. By default this parameter is tfrecord. The JSON format is a special format that also has support for semantic segmentation as well as object detection
- -s [--split]: It should be a value between 0 and 1. This value will be taken as a percentage of the whole dataset size. Also, the entire dataset is shuffled before doing the split. Notice, this process is random and it will create new training and validation splits every time. Be careful when training/validating on the sets because results could deviate if validation samples are used during training.
- -p [--partition]: This parameter will define the maximum size of each TFRecord file (in MB), by default it is 100 MB
Note: It is recommended to use the same validation and training datasets for training the model. Thus do not run the split command between training runs as the command may move pictures from the training dataset to the validation dataset and this will cause errors in the training results.
Calling the Docker container
before running the docker container, a few requirements must be met. The first requirement is related to the input and output pipeline.
- We have to use -v option to mount a local directory with the images into /data_in. (docker run -it -v D:/data/images:/data_in). By default, this container has an internal reference to the folder containing the images pointing to /data_in
- We have to use the -v option to mount the a local directory we want to use to store the final dataset (TFRecord format) into /data_out. (docker run -it -v D:/data/tfrecord:/data_out). By default, this container has an internal reference to the folder we want to use to store the resulting dataset pointing to /data_out.
docker run -it \
-v /data/datasets/sagemaker/playing_cards_train_v7:/data_in \
-v /data/datasets/tfrecord:/data_out \
deepview/dataset-converter \
--annotations=/data_in/output-manifest \
--split 0.2 \
--partition=100
By running the above commands, the TFRecord dataset will be stored into the /data/datasets/tfrecord folder.
Notice that dataset_info.yaml content will be relative to the docker container.
classes:
- ace
- three
- two
- jack
- five
- ten
- king
- seven
- queen
- eight
- four
- six
- nine
train:
annotations: /data_out/tfrecord_train
images: /data_out/tfrecord_train
samples: 1063
validation:
annotations: /data_out/tfrecord_val
images: /data_out/tfrecord_val
samples: 265
During training, ModelPack will read TFRecord files from images path and class names from the configuration file (dataset_info.yaml).
Note: samples key is a counter, making reference to the number of instances in the dataset, NOT the number of files within the directories.
Conclusions
In this article we have shown how to convert a dataset from Sagemaker manifest file into TFRecord dataset ready to be used for ModelPack. We have also shown the command line options and documented some concepts like splits and partitions, both useful to speed up the training process.
Bibliography
- Use an Input Manifest File
- TFRecord and tf.train.Example
- TFRecord Format for ModelPack.
- TFRecord Optimal File Size
Previous Step | Home | Next Step |
TFRecord format for ModelPack | ModelPack Overview |
Comments
0 comments
Please sign in to leave a comment.