An image run through our multiclass demo
We recently replaced our pedestrian detection demo with a new multi-class detector. In this article, I give an overview of the technology behind the new demo.
The architecture we outlined in our pedestrian detector post is appealingly simple, fast to train, and works fairly well for pedestrians. However, when we tried it on a multi-class dataset, its performance fell short. On PASCAL VOC 2007 (a very popular benchmark since the test set is publicly available), it reached a mere 29% mAP (mean average precision over all classes) — far away from the state of the art. SSD Multibox and YOLO9000 both report mAPs of around 80%!
At that point, we knew we had two choices: either try to improve our model, or implement one of the leading ones in the literature. Even though there were many things we could have tried to improve our model, we chose the latter option. The main reason was that while it was uncertain how much we could get out of our architecture, we knew that we should be able to match the papers’ reported accuracies.
The best detectors around
So what are the best current detectors? There are a few benchmarks out there. One of them is PASCAL VOC 2012, which hosts a leaderboard of the best detectors. Although there appears to be a lot of variety, the first six are in fact all based on a model called Faster R-CNN. The best network not based on Faster R-CNN is probably SSD512, coming in at number 7 and a mAP of 82.2%. The recently-released YOLOv2 scores a respectable 73.4%, and the original YOLO comes in at 57.9%. Some unusual architectures, such as HFM_VGG16, which involves an SVM, also do well.
Lack of detectors for Tensorflow
Our deep learning framework of choice is Google’s Tensorflow. Unfortunately, we have had a hard time finding good public models for object detection. Many papers provide pre-trained models in caffe format, but Tensorflow models appear to be few and far between. There are some implementations of Faster R-CNN, but they involve custom layers written in C++ — something we would rather avoid. The original YOLO has a tensorflow port, but it does not support training. Single Shot Multibox (SSD512 from the leaderboard) has a tensorflow port, too, but provides only a visual evaluation which looks somewhat underwhelming. In the words of the author, “the results are okay but not good enough.” So, instead of adapting a public model, we decided to implement a detector from scratch.
Choosing a detector
Which one to choose? Given the benchmarks, Faster R-CNN might look like the natural choice. Until recently however, it was hard to make Faster R-CNN’s ROI Pooling layer work in Tensorflow — hence the custom C++ layers in the github repositories. In a talk from October 2016, Google hint that new tensorflow ops may have fixed this, but we only discovered this recently.
From the VOC2012 benchmark, the next-best model seems to be SSD512 — Single Shot Multibox. Although slightly less accurate than Faster R-CNN, it has some advantages, too. SSD512 runs at 22fps on a NVIDIA Titan X, compared to just 7fps for Faster R-CNN. An only slightly less accurate variant, SSD300, runs at 59fps.
In the end, we chose SSD because of its conceptual simplicity, because it is fast, and because it is competitive with the best detectors available.
How does it work?
In the following, I will give an overview of how SSD works. For full detail, please have a look at the paper.
The basic idea
The image above, taken from the paper, illustrates how SSD works. SSD matches objects with default boxes of different aspects (shown as dashed rectangles in the image). Each element of the feature map has a number of default boxes associated with it. Any default box with an IOU of 0.5 or greater with a ground truth box is considered a match. Two of the 8×8 boxes are matched with the cat (shown in blue), and one of the 4×4 boxes is matched with the dog (shown in red). It is important to note that the boxes in the 8×8 feature map are smaller than those in the 4×4 feature map: SSD has six feature maps in total, each responsible for a different scale of objects, allowing it to identify objects across a large range of scales.
For each default box in each cell, the network outputs:
- A probability vector of length c, where c is the number of classes, representing the probabilities of the box containing an object of each class (including a background class indicating that there is no object in the box).
- An offset vector with 4 entries containing the predicted offsets required to make the default box match the underlying object’s bounding box. They are given in the format (cx, cy, w, h) – centre x, centre y, and width & height offsets, and are only meaningful if there actually is an object contained in the default box.
In the case of the image above, all probability labels would indicate the background class with exception of the three matched boxes (two for the cat, one for the dog).
Although the SSD architecture can in principle be used with any deep network base model, the one used in the paper is VGG16. There are two versions of SSD: SSD300 and SSD512, SSD300 taking images of size 300×300, and SSD512 images of size 512×512. We chose to implement SSD300. The full architecture of SSD300 looks like this (again, the graphic is taken from the paper):
The main things to note is that each image is first fed through VGG-16, after which several additional convolutional layers are added, producing feature maps of different sizes: 19×19, 10×10, 5×5, 3×3, and 1×1. These, together with the 38×38 feature map produced by VGG’s Conv4_3 layer, are the feature maps which will be used to predict bounding boxes as described in the previous section.
As mentioned before, each layer specialises in detecting objects of a certain scale. The fine 38×38 grid produced by the Conv4_3 layer, whose grid cells are very close together, is responsible for the smallest objects, taking up around one tenth of the size of the image, and the single 1×1 grid produced by Conv11_2 is responsible for reacting to objects which take up essentially the entire image, the other layers covering the sizes in between.
Building SSD in Tensorflow
Implementing the network in Tensorflow was quite challenging, and we are not quite there yet. The first difficulty is that while the standard version of VGG is readily available (from here for example), the paper uses a variant which replaces the fully-connected layers with convolutional layers. This model, as far as we could tell, is currently only available in caffe. Instead of attempting to convert this model to Tensorflow, we chose to try using standard VGG first. This means that we are training the Conv6 and Conv7 layers from scratch, which could hurt performance.
Initially, we had a lot of trouble getting the network to converge, despite doing our best to copy the initialisations recommended by the paper. Our first attempts would achieve mAPs of around 10% – very underwhelming! Adding batch normalisation to the new layers helped this a lot, and they are our only major change to the architecture. They also made the model train a lot more quickly. With this fix, we achieved a mAP of 45%. This was already considerably better than our previous model, but still far worse than the paper: we are training using PASCAL VOC 2007 + 2012 trainval and the 300×300 SSD variant, which should be able to achieve 74.3% according to the paper.
Our current model and next steps
Implementing most of the data augmentation techniques described in the paper has brought our score up to 58%. At this point, even though the score is still not at the level of the paper, the detector performs quite well, so we decided to update the demo. We are hopeful that we will be able to close the gap with the paper soon; the most promising idea currently is to replace the base model with the fully-convolutional VGG used by the authors, by converting it from caffe.
We are happy to have succeeded in building and training a working version of Single Shot Multibox in Tensorflow and to make the demo available to everyone. With a mAP of 58%, it performs quite well (please try it for yourself!), although there is still room for improvement.