Mask R-CNN is a Faster R-CNN with the addition of a Mask prediction in order to perform Instance Segmentation, meaning segmenting the different instances of all the different objects in the image (3 dogs, 3 different masks). This is done adding a single small mask network that operates on each ROI and predicts a binary mask.
Since we have classes, we actually predict masks, one for each class, at each step.
Mask R-CNN works very well at instance segmentation. Cluttered scenes, small objects and occluded are still problematic for most of the detectors.
We can also use Mask R-CNN to do pose estimation.