[DR014] Dual Path Networks - Shaofan Lai's Blog

This paper unified the form of ResNet and DenseNet from the lens of a higher order recurrent neural network (HORNN) and proposed a new type of network (Dual Path Network) which can perform better with less computation.

Basically, the ResNet and DenseNet differ in the way of “wiring”. ResNet provides a path to which a layer can get access to both the output and the input of the immediately previous layer. The DenseNet provides a path that can access the outputs of multiple previous layers.

If we share the weights of every layer, then we can fold the network into a higher order recurrent neural network (HORNN). An input is handled in multiple steps to achieve the output.

In the following discussion, $f^k_t(\cdot)$ denotes the feature extraction function and $g^k(\dot)$ denotes a transformation function that takes information from multiple places and outputs the current state.

As for ResNet, the situation is a little bit different.

In the formulation of ResNet, the feature extraction function $f_t^k(\cdot)$ is shared across all steps, which enables the feature to be reused repeatedly. On the contrary, different $f_t^k(\cdot)$ employed in DenseNet encourages the network to explore various feature representations. There is a very comprehensive analsys about these two networks:

To balance between these two methods, they proposed a net model called Dual Path Network:

In short, the hidden state is decomposed into two components: $x^k$ that resembles DenseNet and $y^k$ that resembles ResNet. In practical, they used the ResNet as the backbone and attached it with a DenseNet. More specifically, concatenate is used to replace addition when merging two kinds of features. Two path is tangling together and the tradeoff can be adjusted by changing the dimension of $x^k$ and $y^k$ . By adding DenseNet into ResNet, we can include exploration in a certain degree and avoid unnecessary calculation by limiting the dimension of DenseNet branch.

[DR014]: This article tries to unify DenseNet and ResNet through the lens of HORNN. Finding a proper balance between two models always can boost the final model. A simple method like ensembling the decision of multiple models is proved to be efficacious. This paper skillfully mix two kinds of models in the middle of every block, which allows the neural network to learn to best utilize two kinds of network.

Although the DenseNet and the ResNet can be explained by the HORNN, the author gave a relatively instinctive conclusion that ResNet is re-using the features while DenseNet is exploring new features. However, with multiple stacked ResBlock, the ResNet can actually explorer new feature as well.

My explanation about why DenseNet might outperform ResNet is that each layer of DenseNet has access to multiple previous layers. There are many models that can be categorized into a similar class, which employs multiple channels to process the raw input and learns to combine them:

Multi-scale models on Computer Vision that combine features from different scales
A special type of autoencoder named U-Net, where the decoder takes both the immediately previous feature and the feature of the corresponding encoder as input.
DenseNet that each layer takes multiple previous features as input.

Wiring across the layers can enable the information flow to next layers without being altered by immediately following networks. The DAG of network nevertheless is more complicated. Maybe a measurement can be designed to quantify this complexity.