Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials

It’s 2020 now! I am going to pick up paper reading just for fun (better than playing video games and watching TV all night lol)!

This is a paper from 2012 and my friend recommended it to me. It’s a great example of how to bridge different mathematical skills (including MAP, CRF, optimization, sampling theorem) to make a great model. Although I still cannot fully understand all the details, reading this paper can serve as a good training of how to extract important takeaways and plan the next reading list from a sophisticated paper. From another point of view, of course, the authors also make it clear enough for someone without a comprehensive mathematical background to read.

The paper focusing on pixel-wise image segmentation. For Conditional Random Field (CRF) methods, we want to build a network of many components. And we can calculate an “Energy” value of the network and by increasing/decreasing the energy, we can extract the result from the converged result. However, building a pixel-paired network is almost impossible to calculate because the number of connections (O(#pixel^2)) is so large.

Previous works tried to use some un-supervised method or hierarchy structure to cluster pixels into regions to reduce the number of components in the CRF. However, this paper proposes a way that can reduce the calculation from (O(#pixel^2) to O(#pixel)). As a result, this method can achieve a much better and detailed segmentation than previous methods.

I will be as abstract as possible about the takeaways because I don’t fully understand the details for now:

Optimize the original energy P(X) can be approximated by computing another distribution Q(X) that minimize the KL-divergence D(Q||P) such that Q(X) can be expressed by a product of independent marginals. [10]
Calculating that Q(X) can be done in an iterative way. In one of the steps, for each pixel, we need to iterate all other pixels, which takes O(#pixel^2)
We can reformulate this step as a convolution (and a minus operation) in the feature space. The convolution performs a low-pass filter. By sampling theorem, “this function can be reconstructed from a set of samples whose spacing is proportional to the standard deviation of the filter.” Therefore, we can downsample, convolute and then upsample in constant time. [16]
With the help of permutohedral lattice, high dimension (d-dim) features can be also applied to this method to achieve O(#pixel*d) time complexity rather than O(#pixel^d) with a cost of feature whitening. [1]

So based on the takeaways, if you want to develop your own pair-wise dense model and limit its linear computation complexity, you should read [16] and find an appropriate kernel for your problem as long as it can be reformulated into a special convolution. Hope I can read [16] and share more with you soon.