2019 February 18

The game of labeling

Given the data hungry models we have, it is no wonder labeling is a significant potion of the process. Working on the restaurant automation project for one and a half year, I constantly had to add more data to the dataset. My workflow primarily consisted of image data, I had data that was collected from different data sources: Web scraping, collected by visiting restaurants, collected in the prototype live environment etc. Me being me, I wound up implementing the labeling pipeline using opencv and python. Initially it was quite simple, with time it started getting more and more complex. I started only needing to add labels, as the project progressed I needed to annotate the data with count, bounding box and segments. The script started getting more and more complex, and we all know how much fun it is to maintain large python scripts. So I took a break and restructured the whole labeling pipeline, which I have iterated upon. At one point, labeling roughly 1000 images with the label and the count for each item in the image took me about 4 hours; which is not fun (forget about the sore fingers, looking at food that long dives me nuts). Now, my current record is, labeling 5136 images with label and counts in g2 hours and 22 minutes, which is about 36 images per minute! Boom!. Here I outline the process I follow to collect and label data. I have the metadata framework I use in ml problems.

Collecting Images

The current setup I have has four cameras that I use to capture the images. I name the images in a manor that the names can be sorted in the order they are captured. The advantage of this is, since we are capturing images from the same scene, the images of the scene are bound to have similar labels. This allows to label these images faster by saying that the image’s labels are the same as the image that was labeled before this. In addition to that I’ve made a habit to follow a pattern when capturing images. For example, if I am starting with capturing images that have rolls and cupcakes, then next I will be capturing images with rolls, cupcakes and pastries. Since the images are named in way to allow sorting in the order they were captured, similar scenes are going to be close by, this also allows to label faster by saying that this scene has labels similar to the scene I labeled before this. The naming convention I follow for the names of the images is as follows:

	<time-stamp>_<scene-number>_<camera-number>.jpg

The time stamp will be the time-stamp of the time the particular image collection session started. And then we have the scene number and camera number. This naming convention, not only allows for the optimizations I mentioned above, but also lets me identify where the image might be coming from. One example of this is when I had to separate the dataset into two based on when it was captured. Due to some mistakes by my hand, the metadata of the files had changed. The fact that I had the time-stamp of the session in which the images were captured saved the day.

Another minor detail that gave me a headache was that if there are no leading zeros to the numbers when they are turned into a string, the sorting by the name would result in order like 199, 20, 21, …, 200. It is possible to just extract the number from the name and sort based on that (which I implemented just for the kicks of it), but having leading zeros is a lot more convenient. I generally add leading zeros such that the scene numbers are 4 characters and the camera number 2 characters long.

The labeling pipeline

Initially with labeling pipeline I had, I would be labeling each image for count, label, crop, and then bounding box. It is a lot less efficient; a better way to do it was to break down the tasks and focus on one task at a time. Each task, handled by a script for each of them, produces an intermediate output which will be used by the next stage. The steps taken are as follows:

  1. Class labels: The script would walk through all the image files pointed to it, and load each file. The user would be prompted to provide the label(s) for each image. The approach I took is have a separate file with all the possible labels I could have and have a number assigned for each of them, and the input provided to the prompt would be these number(s). In addition I allow for a shortcut key to say that the label for the current image is the same as the previous key; which allows to exploit the process used to collect the images to speed up the labeling process. The output of this would be a file which contains the name of the file and the label(s) related to it.
  2. Cleaning labels: It’s very common, atleast for me, to ignore/drop some of the data, due to various reasons like a blurry image. Also, I use this part to assess other discrepancies in the dataset, like rename labels which are misspelled, get the number of images for each class to determine the bias that needs to be addressed in the dataset, etc. The output of this step is a file with the same as above with the irrelevant entries removed and labels and names processed accordingly.
  3. Count/amount/quantity annotations: In my current project, I need to annotate the dataset with the count/amount/quantity of item(s) in each image. The output of the previous step would have the label and the relevant image file’s location, which is used to give the user the appropriate prompts. Since the item(s) in an images are already provided, it will prompt the to enter the count/amount/quantity for each item. Like in the labeling script, here also I have a shortcut set to use the same input used for the previous image. This step will produce an output which contains each image’s label(s) and respective count/amount/quantity.
  4. (depending on if I need this)Bounding annotations: The next step would be to annotate each image with the bounding box annotations. I still have not figured out a quick way to annotate bounding boxes, it’s a tedious task no matter how it’s approached. Nonetheless, the output of this step would be, for each image, in addition to the annotations given in the previous steps, the bounding box annotations. Even though it would seem like directly annotating bounding boxes instead of going through the previous steps would be a faster, having that annotation helps to reduce a few extra keystrokes and click, which saved me a significant amount of time.
  5. Process data: Now that all the relevant data has it’s meta data, I generate the data and metadata that will be used to train models. There several reasons I make sure I have the original data with it’s metadata untouched and generate the data and metadata separately to be used in the training process. First is that, the original data might be useful for a different task or will require a different pre-processing steps. Having the original data saves alot of pain I have to deal with when something like that happens. Even though it’s seems trivial, it’s easy to overlook this step. Another reason is that the data that will be used by the training could require a different format, or metadata scheme. I’ve implemented the data generation scripts to allow to easily modify how the data is generated. In addition, the generation scripts can work with any of the intermediate annotations produced by any of the above steps. Sometime, I only need the labels and count/amount/quantity of the images in a dataset. In that case I would use the output of the step 3 to process the data. (optional). Crop: In some occasions there might be a need to crop images before being used for training. For this, I have a script to add the cropping coordinates. The reason I don’t directly crop the data is the same as the reasons I point out in step 5. This cropping coordinates will be stored as another filed in the metadata/annotation files. I generally do this between step 2 and 3, or step 3 and 4. If crop data is provided the step 5 can use this information to crop the output data if it’s needed.

That would be practices I follow. I will update as I learn and improve the process further. Note that, even though the process outlined is focused on images, it can be adopted to other forms of data.

© 2023 Shariff Faleel