In this article, we will describe how to convert a dataset based on SageMaker or LabelBox formats into output datasets our trainer can use, such as TFRecords and Darknet. We have developed a new docker container under the name deepview/dataset-converter.
There are two dataset formats that the conversion docker considers input formats: SageMaker and LabelBox format. The output formats include TFRecords, Darknet, and our proprietary DeepView Project format.
The SageMaker input manifest file stores multiple JSON objects (one per line) in order to define image references and annotations. Each annotation line contains a source-ref key that makes reference to the image name, which can be in any format (PNG, JPG, etc.). In this tutorial we showcase the Playing Cards (v7) dataset, which counts with multiple JPG image files and one manifest file (output.manifest in the example at the bottom of the webpage). The local representation of the dataset on the computer looks like the picture below (multiple images and a single manifest file that contains all the annotations).
LabelBox is a company that aids in dataset curation. labeling, and model building. They host annotations as part of their cloud services for images located on their own or other cloud services. These annotation and image locations can be exported as a JSON file . Unfortunately, we do not have an example LabelBox dataset to import but our converter has been successfully used with LabelBox datasets from our users.
Output Dataset Considerations
Both SageMaker and LabelBox formats only describe methods of assigning ground truth annotations to images (specifically in these cases, for detection). As part of training, the datasets need to be split into training and validation data subsets. The dataset converter will do this. By default, 20% of the entire dataset will be set aside for validation purposes, though this is configurable. As well, prior to splitting the dataset, the dataset will be shuffled to randomize which images are set aside for the validation data subset.
Users may note performance slight differences between the performance of datasets in TFRecord and Darknet formats. Generally, the expectation is TFRecord datasets will be faster assuming the training machine can load the entirety of the dataset into RAM. Darknet is slower, because it will load each image, but it will have smaller RAM requirements than TFRecord. Both outputs are provided to allow users to determine which is better for their needs.
The resulting output TFRecord dataset follows the schema defined in TFRecord Format for ModelPack. To export a dataset into TFRecord format The main entry-point of the converter container will read the input dataset, parse it and create TFRecord representations for the dataset  within the output location (given as a parameter). The resulting dataset will be represented as follows:
- dataset_info.yaml: This file contains the path to the folder where the training and validation samples are. The name of the classes will also be included.
- tfrecord_train: This folder contains all the TFRecord files corresponding to the training set.
- tfrecord_val: This folder contains all the TFRecord files corresponding to the validation set.
This container has all the utilities needed to go to a TFRecord dataset compatible with ModelPack training pipelines.
TFRecord format usually has a partition parameter will define how many samples we want to store within each TFRecord file. Notice that the optimal size  of the resulting TFRecords file should be around 100 MB to take advantage of parallelism and data sharing buffers. The number of samples per TFRecord file will be conditioned by two factors, the number of annotations per image and the image resolution. Currently, the dataset converter has locked this value at 100MB and is not configurable by the user.
The Darknet dataset format follows the schema defined in Darknet Ground Truth Annotations Schema. Once converted, the resulting dataset will be represented as follows:
- dataset_info.yaml: This file contains the path to the folders where the images and labels are kept. The names of the classes are also contained here. This is not a standard part of the Darknet format.
- labels.txt: a text file containing the labels of the dataset
- darknet/labels: This folder contains all the labels, separated into training and validation sub-folders.
- darknet/images: This folder contains all the images, separated into training and validation sub-folders.
This container has all the utilities needed to go to a Darknet dataset compatible with ModelPack training pipelines.
How to run the container
In this section will be explained in details the command line options for the exporter container as well as some docker features that will help you to play with the container and be successful when converting the dataset.
- -s [--source]: absolute or relative path and filename to either the SageMaker manifest file or the LabelBox JSON file source dataset.
- --source_format: the input dataset format; tested arguments are sagemaker and labelbox.
- -d [--destination]: absolute or relative path to the destination directory of the converted output dataset
- --dest_format: the output dataset format; tested arguments are tfrecord and darknet.
- --validation_set_size: The size of the validation data subset as a fraction of the entire dataset. It should be a value between 0 and 1. The default value is 0.2, which is 20% of the entire dataset will be reserved for validation activities only.
Note: It is recommended to use the same validation and training datasets for training the model. Thus do not run the split command between training runs as the command may move pictures from the training dataset to the validation dataset and this will cause errors in the training results.
Calling the Docker container
before running the docker container, a few requirements must be met. The first requirement is related to the input and output pipeline.
- We have to use -v option to mount a local directory with the images into /data_in. (docker run -it -v D:/data/images:/data_in). By default, this container has an internal reference to the folder containing the images pointing to /data_in
- We have to use the -v option to mount the a local directory we want to use to store the final dataset (TFRecord format) into /data_out. (docker run -it -v D:/data/tfrecord:/data_out). By default, this container has an internal reference to the folder we want to use to store the resulting dataset pointing to /data_out.
docker run -it \
-v /data/datasets/sagemaker/playing_cards_train_v7:/data_in \
-v /data/datasets/tfrecord:/data_out \
--source /data_in/output.manifest \
--source_format sagemaker \
--destination /data_out \
--dest_format tfrecord \
By running the above commands, the TFRecord dataset will be stored into the /data/datasets/tfrecord folder.
Notice that Dataset Information YAML content will be relative to the docker container.
Note: samples key is a counter, making reference to the number of instances in the dataset, NOT the number of files within the directories.
During training, ModelPack will read TFRecord files from images path and class names from the configuration file.
There will also be a Dataset Information YAML generated with Darknet outputs, that will look as follows:
This works exactly the same way that the TFRecords file works.
In this article we have shown how to convert a dataset from SageMaker and LabelBox inputs into TFRecord or Darknet dataset ready to be used for ModelPack. We have also shown the command line options and documented some concepts like splits and partitions, both useful to speed up the training process.
- Use an Input Manifest File
- TFRecord and tf.train.Example
- TFRecord Format for ModelPack.
- TFRecord Optimal File Size
|Previous Step||Home||Next Step|
|TFRecord format for ModelPack||ModelPack Overview|