From MLE/MAP to L2-Loss Regression

Posted on July 20, 2018

Today we are going to derive the objective of regression from Maximum likelihood estimation (MLE) and Maximum a posteriori estimation(MAP). We are going to prove that, given certain assumptions, optimizing MLE/MAP is equivalent to optimizing the L2 regression objective without/with the regularization term respectively.
Both MLE and MAP sounds intimidating to me in the first place :P. But once you go through the entire proof, I ensure you will remember that for life because it’s quite simple and straightforward. I will assume you already have a basic understanding of the Bayes’ theorem.

Problem definition

To build a regression model, we can define the dataset as  D = \{x_i, y_i\}_{i=1}^N and the parameters in the model as \theta = \{\theta_i\}_{i=1}^{M}. Here we simply assume that we have N pairs of data and both the independent and dependent variables are scalars. And that the prediction \hat{y}_i = f(x_i, \theta) can be in any forms, namely, f(\cdot) can be either linear or non-linear.
In a non-bayes-setting, we can define the loss function as the L2 distance:

\large \mathcal{L(\theta)} = \sum_{i=1}^N\| y_i - f(x_i, \theta) \|^2

To prevent overfitting, we might want to add a L2 regularization on the parameter set:

\large \mathcal{L}_r(\theta) = \sum_{i=1}^N\| y_i - f(x_i, \theta) \|^2 + \alpha\sum_{i=1}^{N}\| \theta_i \|^2

But how are those objectives connecting with MLE and MAP?

Bayes' theorem

For most machine learning beginners, their understandings of Bayes’ theorem remain at “Bayes’s theorem is a transformation of the definition of the conditional probability”. At least it looks in that way:

\large P(A|B)P(B) = P(A, B) = P(B|A)P(A) \\ \Rightarrow P(A|B) = \frac{P(A)P(B|A)}{P(B)}

Bayes’ theorem can be applied in many scenarios. Here we can replace the variable A and variable B with the ones in our problem:

\large P(\theta|D) = \frac{P(\theta)P(D|\theta)}{P(D)}.

And different components in this formula have different names:

\large {\color{Red} \text{posterior}\rightarrow} P(\theta|D) = \frac{{\color{Red} \text{prior}\rightarrow} P(\theta){\color{Red} \text{likelihood}\rightarrow}P(D|\theta)}{{\color{Red} \text{marginal likelihood}\rightarrow}P(D)}.

(PS: Usually for a given dataset, we can safely assume that the marginal likelihood P(D) is a fixed number.)

So the objective of Maximum likelihood estimation is to find an optimal parameter set \theta^{MLE*}  that can maximize the likelihood. And the one of Maximum a posteriori estimation is to find an optimal parameter set \theta^{MAP*} that can maximize a posteriori. Or in a more formulated way:

\large \left\{\begin{matrix} \theta^{MLE*} = \arg\max_{\theta} P(D|\theta) \\ \theta^{MAP*} = \arg\max_{\theta} P(\theta|D) \end{matrix}\right.

In fact, that’s the whole story. Is it elegant? But to bridge MLE/MAP with the L2-loss-regression, we have to make some assumptions on both the data distribution and the parameter distribution.

Maximum likelihood estimation (MLE)

For a regression model, we can write the relationship between different components as:

\large y_i = \hat{y_i} + \epsilon = f(x_i, \theta)+ \epsilon,

where \large \epsilon is a random variable representing the noise in the real world. Based on the Central limit theorem, we can assume that the noise tends toward a normal distribution. Which means:

\large y_i - f(x_i, \theta) = \epsilon \sim \mathcal{N}(\mu_1, \sigma_1^2)

For most cases, we have a bias term in \large f(\cdot) that can learn/counter the mean \large \mu_1. In that case, we can further simplify the assumption to:

\large y_i - f(x_i, \theta) = \epsilon \sim \mathcal{N}(0, \sigma_1^2)

Remember that we treat \large \theta and \large \epsilon  as random variables here. We have to provide the value of \large \theta = \theta_{val}  before we can calculate the actual prediction/noise. Additionally, we can consider the sampling of each data pair is independent to each other. Thus, given a set of parameter \large \theta_{val} , we can calculate the (log) likelihood as:

\large \begin{align*} P(D|\theta=\theta_{val}) &= \prod_{i=1}^{N}P(\epsilon=\epsilon_i|\theta=\theta_{val}) \\ P(D|\theta) &= \prod_{i=1}^{N} P(y_i-f(x_i, \theta)) \\ \log{P(D|\theta)} &= \sum_{i=1}^{N} \log{P(y_i-f(x_i, \theta))} \\ \log{P(D|\theta)} &= \sum_{i=1}^{N} \log{P(y_i-f(x_i, \theta))} \\ \log{P(D|\theta)} &= \sum_{i=1}^{N} \log{\frac{1}{\sqrt{2\pi}\sigma_1} e^{-\frac{(y_i-f(x_i, \theta))^2}{2\sigma_1^2}} } \\ \log{P(D|\theta)} &= \sum_{i=1}^{N} \log{\frac{1}{\sqrt{2\pi}\sigma_1} + \sum_{i=1}^{N} \log{e^{-\frac{(y_i-f(x_i, \theta))^2}{2\sigma_1^2}} } }\\ \log{P(D|\theta)} &= C_1 - \frac{1}{2\sigma^2} \sum_{i=1}^{N} (y_i-f(x_i, \theta))^2 \\ \end{align*}

where \large C_1 is a constant that can be ignored when optimizing \large \theta
This probability means “if the parameter is \large \theta_{val}, how possible is this dataset \large D will be sampled given our assumption of noise.” And notice that I didn’t write \large \theta_{val} since the second row but used \large \theta to simplify the writing, which is somehow confusing but also very common in many ML materials.  Just remember that the symbol \large \theta represents a value when we want to optimize or to calculate it, but is a random variable in the Bayes’ theorem. 

So let’s plug in the simplified log likelihood back to the objective of MLE. We have:

\large \begin{align*} \theta^{MLE*} &= \arg\max_{\theta} P(D|\theta) \\ &= \arg\max_{\theta} \log{P(D|\theta)} \\ &= \arg\min_{\theta} -\log{P(D|\theta)} \\ &= \arg\min_{\theta} \frac{1}{2\sigma^2}\sum_{i=1}^N(y_i-f(x_i, \theta))^2 - C_1\\ &= \arg\min_{\theta} \sum_{i=1}^N(y_i-f(x_i, \theta))^2 \\ &= \arg\min_{\theta} \mathcal{L}(\theta) \\ \end{align*}

This the same objective as in the L2 regression.


Maximum a posteriori estimation(MAP)

We know that the posterior is related to both the prior and the likelihood:

\large P(\theta|D) \propto P(\theta)P(D|\theta)

Therefore we can assume that the value of each parameter also has a normal distribution with a mean of \large 0 and a variance of \large \sigma_2^2 independently:

\large \theta_i \sim \mathcal{N}(0, \sigma_2^2)

In this case, we have:

\large \begin{align*} P(\theta) &= \prod_{i=1}^MP(\theta_i) \\ \log { P(\theta) } &= \sum_{i=1}^M \log{P(\theta_i)} \\ \log { P(\theta) } &= \sum_{i=1}^M \log{\frac{1}{\sqrt{2\pi} \sigma_2 }} - \frac{1}{2\sigma_2^2}\sum_{i=1}^M \|\theta_i\|^2 \\ \log { P(\theta) } &= C_2 - \frac{1}{2\sigma_2^2}\sum_{i=1}^M \|\theta_i\|^2 \\ \end{align*}


\LARGE \large \begin{align*} \theta^{MAP*} &= \arg\max_{\theta} P(\theta|D) \\ &= \arg\max_{\theta} P(\theta) P(D|\theta) \\ &= \arg\max_{\theta} P(\theta) P(D|\theta) \\ &= \arg\max_{\theta} (\log{P(\theta)} + \log{P(D|\theta)}) \\ &= \arg\min_{\theta} (\frac{1}{2\sigma_2^2}\sum_{i=1}^M \|\theta_i\|^2 + \frac{1}{2\sigma_1^2} \sum_{i=1}^N\| y_i - f(x_i, \theta) \|^2 ) \\ &= \arg\min_{\theta} (\frac{\sigma_1^2}{\sigma_2^2} \sum_{i=1}^M \|\theta_i\|^2 + \sum_{i=1}^N\| y_i - f(x_i, \theta) \|^2 ) \\ \end{align*}

where we can use only one hyperparameter \large \alpha = \frac{\sigma_1}{\sigma_2} instead of using two. 


This is how MLE and MAP links with the L2-loss-regression. I think the key components are:

  • Treating both the noise and parameters as a random variable.
  • Assuming noise and parameters are in a certain distribution, whose arguments (i.e. mean and variance) are hyper-parameters
  • Plugging those into Bayes’ Theorem.

Hence, with different kinds of assumption, we can derive different optimization objective of MLE and MAP. And when we assume that both the noise and the parameters are having a normal distribution, the MLE results to L2-loss regression and MAP results to L2-loss regression with L2-regularization term.

There are a lot of LaTeX in this blog and feel welcome to discuss or to point out my mistakes in the comments :)