[DR005] Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

Posted on June 24, 2017

In [DR003] Domain Separation Networks, they used a scale-invariant error as a reconstruction error rather than the classical L2 loss:

Today we are going to talk about the paper which first proposed this formula. 


Depth Map Prediction from a Single Image using a Multi-Scale Deep Network 


The title is self-explanatory. They aim to recover the depth map from a single image with a multi-scale deep network. Generally, two networks are designed to predict on coarse and fine-scale respectively:

They state that the global (blue) network can summarize the information from the whole picture by pooling and fully connected layers. This coarse output is added to the local (orange) network that can refine the prediction.

This global-local structure is quite popular in these years. Other similar networks like U-net or Hourglass net are also applied in tasks where the output is an image (e.g. segmentation, depth prediction, reconstruction, generation). What stands out in this paper is the Scale-Invariant Error they propose. The definition is:

I don’t fully understand how they come up with the strange term \alpha(y,y^*). But without that term, equation (1) becomes:

\begin{align*} D(y,y^*) &= \frac{1}{2n} \sum_{i=1}^{n}(\log y_i - \log y_i^*)^2 \\ \nonumber \end{align*},

which will change if the scale of y changes. To assuage the confusion, this paper provides two other ways to interpret this loss. The first one is to expand it into a pixel-pairwise loss:

\begin{align*} D(y,y^*) &= \frac{1}{2n} \sum_{i=1}^{n}(\log y_i - \log y_i^*+\alpha(y, y^*))^2 \\ &= \frac{1}{2n} \sum_{i=1}^{n}(\log y_i - \log y_i^*+\frac{1}{n}\sum_{j=1}^n (\log y_j^* - \log y_j))^2 \\ &= \frac{1}{2n^2} \sum_{i=1}^n \sum_{j=1}^n ((\log y_i - \log y_j) - (\log y_i^* - \log y_j^*))^2 \\ &= \frac{1}{2n^2} \sum_{i=1}^n \sum_{j=1}^n (\log \frac{y_i}{y_j} - \log \frac{y_i^*}{y_j^*})^2 \\ \end{align*}

That is to say the sacle relationship between each pair of pixels in the generated image should remain the same as that in the original image. Apparently, this loss is scale-invariant on the generated image.

The second interpretation is:

\begin{align*} D(y,y^*) &= \frac{1}{2n^2} \sum_{i=1}^n \sum_{j=1}^n ((\log y_i - \log y_i^*) - (\log y_j - \log y_j^*))^2 \\ &= \frac{1}{2n^2} \sum_{i=1}^n \sum_{j=1}^n (d_i-d_j)^2 \\ &= \frac{1}{2n^2} \sum_{i=1}^n \sum_{j=1}^n (d_i^2-2d_id_j+d_j^2) \\ &= \frac{1}{n} \sum_{i=1}^n d_i^2 - \frac{1}{n^2} \sum_{i=1}^n\sum_{j=1}^n d_id_j \\ &= \frac{1}{n} \sum_{i=1}^n d_i^2 - \frac{1}{n^2} (\sum_{i=1}^n d_i )^2\\ \end{align*}

, where d_i = \log y_i - \log y_i^*. The first term is a L2 square loss in log space, while the second term “that credits mistakes if they are in the same direction and penalizes them if they oppose. Thus, an imperfect prediction will have lower error when its mistakes are consistent with one another.”  Nevertheless, I don’t understand what the “direction” actually means. I believe that this interpretation happens to contain a L2 term and the additional term is just a compensation. It is hard to disentangle the second term from the equation because if ignore the first term, minimizing D(y, y^*) equals to maximizing \|\sum d_i\| = \|\log (\prod_i \frac{y_i}{y_i^*})\| = \|\log ( \frac{\prod_i y_i}{\prod_i y_i^*})\|, which is neither scale-invariant nor order-sensitive.


The final loss is an average of L2 and scale-invariant loss:

This formula might be a little confusing because the \lambda is multiplied by the second term directly. The relationship between the final loss function can be expressed as:

L(y,y^*) = \lambda * D(y, y^*) + (1-\lambda) * D_{MSE}(y, y^*)

Although the experiments show the superiority of their method:

The depth prediction is devoid of details:


[DR005] The scale-invariant loss is more practical and rational in many scenarios. A scale-invariant metric can sometimes overcome the distortion brought by illumination. Although three interpretations (including the original formula) are given, only the second one makes sense to me.

It also makes me wonder how neural network can recognize things in different scales in color space. If the number of neurons (and data) is large enough, then the neural network can normalize the input by enumerating all the scales in the first layer. This paper uses an unnormal data augmentation method that multiplies the color with a random RGB value c \in [0.8, 1.2]^3. Human can recognize one image is wired when the color space is over-tainted. Can a machine do that?