This is a talk by Geoffrey Hinton at MIT in Dec, 2014.
Four arguments against pooling, or more generally, CNN, are discussed.
It is a bad fit to the psychology of shape perception: It does not explain why we assign intrinsic coordinate frames to objects and why they have such huge effects.
For instance, people tend to use rectangular frames embedded in the objects. We use long side and short side for most of the time to express the orientation. Another conclusion is that we use a bunch of neurons to capture the pose of an object rather simple one. CNN cannot explain this.
It solves the wrong problem: We want equivariance, not invariance. Disentangling rather than discarding.
Since in most classification tasks, the labels of different input images that share the same category are exactly the same. However, they might have different transformations like changing in viewpoint or illumination. Discarding that information in the prediction can achieve invariance, which is fine for classification but not for really teaching a machine to perceive the world. Hinton proposed that equivariance is a better goal to pursue: (1) place-coded equivariance requires that different neurons should be activated when the changing is significant, while (2) rate-coded equivariance requires that the same neurons should have different pose/activation when the changing small.
It fails to use the underlying linear structure: It does not make use of the natural linear manifold that perfectly handles the largest source of variance in images.
Current neural networks try to conquer the variance of the viewpoint by feeding a lot of various images. However, this requires too much data and fails to capture the built-in relationship between different viewpoints. A better way to do that is to transform the image into a space in which the manifold is globally linear. In computer graphics, changing of viewpoint is totally linear. Hinton proposed a study called “inverse-graphics” in order to reverse the 2D image into the desired space so that we can learn from a small amount of data and manipulate it linearly in that space.
Pooling is a poor way to do dynamic routing: We need to route each part of the input to the neurons that know how to deal with it. Finding the best routing is equivalent to parsing the image.
Routing means directing the flowing of information. We want to send the information to the neuron which can best utilize it. In his model, the routing is based on agreement or the intensity of the clustering. Agreement or clustering in a high dimensional space cannot happen accidentally. Hence a mapped activation is around the clustering center is more useful and likely to be further process than the one is an outlier.
A special score is designed to quantify that:
I skip the part where Hinton introduced the new model. It could prove what he said but the model is too shallow and the dataset is too simple (MNIST). Please watch the video if you find it interesting.
[DR016]: Although it was held in 2014, still it is an inspiring talk. The concept of “capsule” might be confusing at the beginning, but all things were clear after going through the whole lecture. The proposed model is too ideal from my perspective, but the insights are reasonable.
- Rotate the viewpoint with multiple models. When tackling the problem of Person Re-id, one of the hardship is that people look different in different poses (VIPeR). We tried to learn a model that can align images from many angles but failed. As Hinton said, the transformation should be handled by more than one neuron/model. For instance, we can separate 360 degree into multiple segments. And each model is in charge of the transformation in a small range. We can apply this model in two ways: (1) align all images into one direction to simplify the matching; (2) disentangle the information of posing and appearance.
- Additional Penalty for Equivariance. If we can have extra information other than label, then we can impose an additional penalty in the feature layer to achieve the equivariance. Furthermore, we can design a pooling method that can satisfy equivariance by cropping different patches from the image.