If we analyze the time spent by the region proposal algorithm and the time spent by the R-CNN, we see that in the Fast R-CNN, most of the time is spent finding region to propose.
A solution to make the region proposal phase faster is to insert a Region Proposal Network (RPN) inside of the CNN which takes in input the whole image and outputs a set of rectangular object proposals, each one with an objectness score. This is the Faster R-CNN model.
In order to do that, for each pixel I consider an anchor box, which is a bounding box of size in which the pixel is in the center. For each of these anchor boxes, I compute their objectness score, which is the likelihood of the anchor box containing an object.
In order to capture objects at different scales and at different aspect ratio, for each pixel we use different scales and different aspect ratio (, , ), using in total a combination of anchor boxes.
This means that if the image is of size , then there will be a total of anchor boxes.
For boxes that contains the object, the model also uses regression in order to predict a box that better contains the object, using the ground truth box.
After having identifies the regions, the model sort them by objectness score and only take the top as the proposals.
The major advantage of using this method is the fact that the anchor box is modelled as a sliding window, and because of that the parameters are always shared.
We train the Faster R-CNN using a mixture of 4 losses:
-
RPN Classification (Object or Not Object)
-
RPN Regress box coordinates (4 points for each anchor box)
-
Final classification score (Object class)
-
Final box coordinates