[023] Deep Variational Information Bottleneck

Posted on March 28, 2018

I know this paper from the references of Machine Theory of Mind, which uses a Deep Variational Information Bottleneck to compress the mental expression of an agent in order to get an expressive representation and a better visualization. Deep Variational Information Bottleneck can be categorized into a line of work where the information theory is used to regularize the embedding. VAE is one of the most classical examples in this field.

Suppose we embed the input source X to an encoding Z, usually defined by a parameteric model/encoder p(z|x;\theta). In the supervision task, we usually want to maximize the mutual information between the encoding and the label so that we can use that informative encoding to classify. The target can be express as

Asking the encoding to be informative is not general enough in many cases. We can add some restrictions to limit its complexity. This paper uses the mutual information between the encoding and the original data as a regularization term. The final optimization (maximization) objective is 

Intuitively, the first term encourages Z to be predictive of Y ; the second term encourages Z to “forget” X. Essentially it forces Z to act like a minimal sufficient statistic of X for predicting Y.

To optimize this objective, few tricks are applied:

  • [1] Since p(y|z) and p(Z) are intractable, q(y|z) and r(Z) are introduced to approximate them respectively. And therefore, the final optimization goal is a lower bound of the original one.
  • [2] The encoder p(z|x) might take a stochastic form like the Gaussian distribution \mathcal{N}(z|f_e^{\mu}(x), f_e^{\Sigma}(x)) . They used “the reparameterization trick” in Auto-encoding variational Bayes to replace p(z|x)dz with p(\epsilon)d\epsilon where z = f(x, \epsilon) .

Find a lower bound with [1]:

The empirical estimation:

Apply [2] to (16) to allow gradient to back-propagate through a single sample:


[DR023] Take away:

  • Embedding an input as a distribution is more robust to adversarial attacks than deterministic models.
  • Once some probabilities are intractable, use other functions to approximate and use the property of KL-divergence (greater than 0) to find the bound.
  • Reparameterization trick allows us to use Monte Carlo sampling to get an unbiased estimate of the gradient.