[DR018] iLab-20M: A large-scale controlled object dataset to investigate deep learning

This work focused on discussing the tolerance to image invariants in CNN feature. To that aim, they introduced a new dataset where every instance was capture in different settings turntable with 8 rotation angles, 11 cameras on a semicircular arch, 4 lighting sources (generating 5 lighting conditions), 3 focus values and random backgrounds (overall 8×11×5×3 = 1320 images for each instance per background). Extensive experiments are conducted to answer:

Can a pre-trained CNN model predict the setting parameters such as lighting source, the degree of azimuthal rotation, the degree of camera elevation, etc?
Can it transfer the learned knowledge from one object category to another?
Which parameters are more important in the transfer?
How much knowledge can a model transfer from iLab-20M to the ImageNet?
Which one is a better strategy to make an object dataset: random or systematic image harvesting?
How the order of learning parameter invariance influences overall network parameter tolerance and accuracy?

The dataset is quite comprehensive compared to previous effects:

If you are also interested in revealing the mystery of CNN’s generalization ability across different capture settings, you should check this dataset and this paper for related works. I am going to skip the details of iLab-20M and briefly go through the experiments they did. In their experiments, Alexnet was employed to extract features and two key layers called pool5 and fc7 were frequently mentioned.

1> Selectivity and invariance

In this experiment, they used the features from pool5 and fc7 separately to predict both the categorical and parameter, where SVM was used as a classifier. From figure 2, we can know that both of these features are good at classification but the features from pool5 are obviously more useful when it comes to parameter prediction.

As they stated,

This is consistent with the work by Bakry et al. [2] where they analytically found that fully connected layers make effort to collapse the low-dimensional intrinsic parameter manifolds to achieve invariant representations. … These figures suggest that camera view (considering the normalized-to-chance accuracy) has the most complex structure for parameter prediction whereas the lighting is simpler. This is somewhat sensible since changing camera view leads to geometric shape variations, and ports the prediction task into a much more difficult problem to address. In contrast, lighting variations do not alter the shape of the object, and are thus easier to capture.

2> Knowledge transfer

In this experiment, they wanted to evaluate the generalization ability of the CNN from seen category to unseen category. They trained the model over four object classes and tested it on the same or different object classes separately:

The result is rational: lighting is shape-irrelevant or object-irrelevant, and hence no degradation is observed during transfer. On the other hand, the rotation angle and camera view are related to the shape of that object, and therefore there is a certain degree of degradation in that performance.

3> Systematic and random sampling

This experiment was setup to answer why randomly collecting images from the wild can construct a feasible dataset to train a complex neural network. They proposed two sampling strategy to validate the effectiveness of randomness:

Random strategy where n samples (across all parameters and instances) are chosen randomly and are used to train an SVM to predict the object category.
Systematic (or exhaustive) strategy, in which an object instance is chosen randomly and then other images from that object are added to our training set, by scanning all parameters, until n samples are reached.

Figure 5. exhibits that not only random sampling outperforms systematic sampling on classification but also on parameter prediction. It is quite reasonable because the systematic sampling may not scan sufficient data, especially when the number of samples is small.

4> Domain adaptation

Augmenting data along with all parameters:

This experiment is proposed to test the adaptation between iLab-20M and ImageNet. Two scenarios were considered: i) a binary classification problem boat vs. tank and, ii) a 4-class problem including boat, tank, bus, and train.

From table 4, we can see that the performance is low when the testing set is different from the training set. They concluded that this is mainly because objects in these two datasets have different textures and statistics which demand more sophisticated ways of domain adaptation. It is counter-intuitive that the pretrained model performs worse than the baseline.

Augmenting data along a single parameter:

This experiment is quite complicated and I just paste the original setting here to avoid confusion:

Here, we investigate which parameter is more effective in domain adaptation (from synthetic to natural images.). Two categories, existing in both datasets, are considered: boat and tank. To form a training set, we vary only one parameter at a time while keeping all others fixed. Then, fc7 features are computed for the training set and a linear SVM is trained. The same features are computed for natural images and the learned model on synthetic samples is tested on them. For each parameter, we had 275 synthetic images for training and a fixed set of 3,000 images from ImageNet for testing. In a complementary experiment, all parameters were allowed to vary except one (opposite of the above). A set of 2,000 samples were randomly selected (complying with the conditions) and a linear SVM was trained on them (using fc7). The parameter whose absence drops the accuracy more is considered to be more dominant. 5-fold cross validation accuracies are reported in Fig. 7.

From Fig-7, the camera view is the most important component because it leads to the highest accuracy. Although the advantage margin brought by changing the camera-view is not so obvious, they argued that this is reasonable since real world objects are often viewed from angles at different degrees of elevation and speculated that camera-view might be the dominant varying parameter in natural scenes.

5> Analysis of parameter learning order

This is an interesting experiment that I never thought of. They tried to analyze how the order of sampling can affect the training. The images are labeled with the rotation label and the camera-view label separately to generate two enhanced datasets. Then they used one to train first and used another one to finetune. They denoted the two orders as rotation-camera and camera-rotation.

From table 5, we can see that the order of data delivery is important. As they concluded:

… camera view variation is a more ill-structured parameter to predict. When the network sees the camera labels in the second stage, the adapted weights are more biased towards learning this parameter. This bias does also try to keep the pre-seen knowledge for rotation unchanged. We thus conclude that when there is the option for stage-wise training, it would be better to learn parameters following a simple to complex order. This way, the last steps are devoted to manage the difficulties in complex parameters, while imposing less damage to weights adapted for simpler parameters (thus maintaining the structure).

[DR018]: There are many things to discuss this article:

In the first experiment (Selectivity and invariance), they only showed the accuracy. However, I am more interested in the confusion matrix. Is there some rotation angles that are more difficult to tell from others? For instance, it is easy to tell the left side from the right side of a car based on the shape. But it is relatively more difficult to tell the front from the rear because there is less detail to distinguish. The same type of digging can be done on camera view or lighting.
In the second experiment (Knowledge transfer), I have a question about the rotation experiment. The rotation angle is relative rather than absolute. Predicting the rotation of an unseen object without stating the front is questionable. Thinking outside the box, maybe we should prediction the rotation angle between two images rather than judging from one image. There is few advantage to do that:
1. Circular label. If we output the rotation angle from one image in the regression way, namely the output is a real number, then we disconnect the relationship between 0 degree and 359 degree. If we output the rotation angle in a classification way, then we cut of the relationship between every two neighbor angles. You can shuffle all angle slots and it performs the same. However, if a relative angle system is introduced, we can alleviate this problem by taking two images as input.
2. Generalize the rotation. The procedure of rotation is different when starting from a different angle. Teaching the network the rotation from various start points may be more general than recognizing the appearance of at different angles.
In the third experiment (Systematic and random sampling), I think the uniform sampling should be compared. Uniform sampling takes samples uniformly according to parameters. For instance, if only 4 samples are taken, we should take images of 0, 90, 280, and 270 degree, while systematic strategy would take images of 0, 1, 2, and 3 degree. I believe uniform sampling might at least perform as well as the random strategy since it also covers all the situation.
In the fourth experiment (Domain adaptation), I think the decline of performance after pretraining is unacceptable. Classical adaptation methods, like MMD, should be applied to adjust the domain shift.
In the fifth experiment (Analysis of parameter learning order), in common settings, same samples should be sampled multiple times during the entire training procedure. They should provide the result in the setting (Multi-task likewise), where two kinds of data are mixed together during training. If the performance is as well as the rotation-camera setting, then their conclusion might be fragile. There is actually a sub-field to study the accumulation of knowledge, called “Incremental Learning”. Previous knowledge should be preserved when learning new knowledge, which contradicts with their fine-tuning method.

After all, this is an innovative and comprehensive attempt to study how the CNN handle different kinds of variances. Actually, more research can be conducted with this research. Of the top of my head, I want to propose a model that might be described as “continuous cognition”. I don’t know whether it is biologically logical but it might sound interesting: rather than using a single model to align the rotated object, we might use multiple models to work on different stages. For instance, when rotating a car, the 2D projection change most frequently when we move from flank to the front (or back). However, the projection of a rotated car remains relatively consistent as long as we are facing the flank or the front. That is to say, the change of information is not uniform during the rotation and we should locate our resources more wisely. Using a mono-model to align object from any angles may not be the optimal solution.