We have recently updated our pedestrian detection demo to do multiclass detection using Single Shot Multibox. We also got quite a few questions about our previous detector, which we wrote about in a previous article. In this blog post, we give our answer to your questions.
I saw your website on Pedestrian Detection using Tensorflow and Inception and I really liked the work. I was trying to replicate it myself, but I am confused on how to train the model. Do you train the classifier by itself and then tack on the 1×1 convolution or did you train the entire model with the 1×1 convolution using a dataset that includes the location of people in the image such as INRIA or Caltech?
This is a very good question! As mentioned in the pedestrian detection article, we replace the final average pooling layer in Google’s Inception v3 with a 1×1 convolution. There are two ways we could go about doing this:
- Retrain Inception’s final layer, just as in Google’s tutorial, on images of pedestrians. Then use the weights of the retrained final layer in the 1×1 convolution.
- Use the bounding boxes provided in datasets such as INRIA and Caltech to label each grid cell as either 1 or 0 (containing pedestrian or not containing pedestrian), and use these to train the 1×1 convolution.
Jonathan’s question is about which route we took. We chose option (2) for our model. Option (1) could be interesting to try, however, especially because it would in theory allow training of the model without bounding box annotations.
I do have one question, though: for re-training on pedestrians, did you end up using a global pooling layer before the classification layer, or did you convert each input image into, say, 13 * 18 (for the 640×480 image) separate examples?
Dan’s question is quite similar to Jonathan’s. We did end up breaking the input image into 13*18 separate grid cells, and then labelled each as containing/not containing pedestrian as we just discussed.
As you pointed, the inception model renders a gridded feature vector in the image prior to pooling them. I have been able to access them, but I was wondering how did you perform the next step, how were you able to get that tensor and forward it to the directly connected layer without doing the pooling?
Antonio, if you feed an image through Inception, you will end up with the gridded feature vector just as you say. For a 640×480 input, the vector will be 13x18x2048. To do object localisation, we then used a 1×1 convolution which maps this to a 13x18x2 grid. By labelling the grid cells as containing or not containing a pedestrian, we train the weights of this 1×1 convolution. So we don’t actually use the old fully connected layer, instead we replace it with this new 1×1 convolution which handles the grid just fine.
We hope this answers your questions – please get in touch with any further ones and we will be happy to answer them!