# From MLE/MAP to L2-Loss Regression

Posted on July 20, 2018

Today we are going to derive the objective of regression from Maximum likelihood estimation (MLE) and Maximum a posteriori estimation(MAP). We are going to prove that, given certain assumptions, optimizing MLE/MAP is equivalent to optimizing the L2 regression objective without/with the regularization term respectively.
Both MLE and MAP sounds intimidating to me in the first place :P. But once you go through the entire proof, I ensure you will remember that for life because it’s quite simple and straightforward. I will assume you already have a basic understanding of the Bayes’ theorem.

### Problem definition

To build a regression model, we can define the dataset as   and the parameters in the model as . Here we simply assume that we have  pairs of data and both the independent and dependent variables are scalars. And that the prediction  can be in any forms, namely,  can be either linear or non-linear.
In a non-bayes-setting, we can define the loss function as the L2 distance:

To prevent overfitting, we might want to add a L2 regularization on the parameter set:

But how are those objectives connecting with MLE and MAP?

### Bayes' theorem

For most machine learning beginners, their understandings of Bayes’ theorem remain at “Bayes’s theorem is a transformation of the definition of the conditional probability”. At least it looks in that way:

Bayes’ theorem can be applied in many scenarios. Here we can replace the variable  and variable  with the ones in our problem:

And different components in this formula have different names:

(PS: Usually for a given dataset, we can safely assume that the marginal likelihood  is a fixed number.)

So the objective of Maximum likelihood estimation is to find an optimal parameter set   that can maximize the likelihood. And the one of Maximum a posteriori estimation is to find an optimal parameter set  that can maximize a posteriori. Or in a more formulated way:

In fact, that’s the whole story. Is it elegant? But to bridge MLE/MAP with the L2-loss-regression, we have to make some assumptions on both the data distribution and the parameter distribution.

### Maximum likelihood estimation (MLE)

For a regression model, we can write the relationship between different components as:

where  is a random variable representing the noise in the real world. Based on the Central limit theorem, we can assume that the noise tends toward a normal distribution. Which means:

For most cases, we have a bias term in  that can learn/counter the mean . In that case, we can further simplify the assumption to:

Remember that we treat  and   as random variables here. We have to provide the value of   before we can calculate the actual prediction/noise. Additionally, we can consider the sampling of each data pair is independent to each other. Thus, given a set of parameter , we can calculate the (log) likelihood as:

where  is a constant that can be ignored when optimizing
This probability means “if the parameter is , how possible is this dataset  will be sampled given our assumption of noise.” And notice that I didn’t write  since the second row but used  to simplify the writing, which is somehow confusing but also very common in many ML materials. Just remember that the symbol  represents a value when we want to optimize or to calculate it, but is a random variable in the Bayes’ theorem.

So let’s plug in the simplified log likelihood back to the objective of MLE. We have:

This the same objective as in the L2 regression.

### Maximum a posteriori estimation(MAP)

We know that the posterior is related to both the prior and the likelihood:

Therefore we can assume that the value of each parameter also has a normal distribution with a mean of  and a variance of  independently:

In this case, we have:

and:

where we can use only one hyperparameter  instead of using two.

This is how MLE and MAP links with the L2-loss-regression. I think the key components are:

• Treating both the noise and parameters as a random variable.
• Assuming noise and parameters are in a certain distribution, whose arguments (i.e. mean and variance) are hyper-parameters
• Plugging those into Bayes’ Theorem.

Hence, with different kinds of assumption, we can derive different optimization objective of MLE and MAP. And when we assume that both the noise and the parameters are having a normal distribution, the MLE results to L2-loss regression and MAP results to L2-loss regression with L2-regularization term.

There are a lot of LaTeX in this blog and feel welcome to discuss or to point out my mistakes in the comments :)

None