I think its that there's less a concept of 'object' in an image, its correlations from one pixel to the next. Because fingers and other objects can be in so many positions and in various contexts its difficult for the model to 'segment' out that object like a torso or facial features which have more stability across images.