Problem
One of the most commonly used algorithms in machine learning is linear
regression. The loss function used in linear regression is the squared error
given by
\begin{align}
\mathcal{L}(X,y) = \frac{1}{2} ||X^T \theta - y||^2 = \frac{1}{2} \sum_{i=1}^n (\theta^T x_i - y_i)^2.
\end{align}
where is the number of examples, are the columns of the matrix
and are the elements of the vector .
Assume the targets are generated via a linear transformation of the inputs and
some zero-mean Gaussian noise. In other words
\begin{align}
y_i = \theta^T x_i + \epsilon_i
\end{align}
where .
Show that maximizing the likelihood of the parameters is equivalent
to minimizing the squared error above.
Solution
show
First note that . This says that the
difference between the ground truth and the prediction
follows a zero-mean Gaussian distribution. Alternatively .
The likelihood of the parameters is thus given by
\begin{align}
P_{\theta}(y | X) &= \prod_{i=1}^n \mathcal{N}(y_i; \theta^T x_i, \sigma^2) \\
&= \prod_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma} \exp\left( -\frac{(y_i - \theta^T x_i)^2}{2\sigma^2} \right)
\end{align}
Rather than maximizing the likelihood, we can minimize the negative
log-likelihood. Recall is monotonic and increasing thus does not
change the value of which optimizes the function. Thus we have
\begin{align}
-\log P_{\theta}(y | X) &= - \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi}\sigma} \exp\left( -\frac{(y_i - \theta^T x_i)^2}{2\sigma^2} \right) \\
&= - n \log \frac{1}{\sqrt{2\pi}\sigma} - \sum_{i=1}^n \left( -\frac{(y_i - \theta^T x_i)^2}{2\sigma^2} \right) \\
&= - n \log \frac{1}{\sqrt{2\pi}\sigma} + \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \theta^T x_i)^2 \\
&= a + \frac{b}{2} \sum_{i=1}^n (y_i - \theta^T x_i)^2
\end{align}
where and .
The constants and also do not effect the value of which
optimizes the function, thus we are left with the equivalent problem of
minimizing
\begin{align}
\frac{1}{2} \sum_{i=1}^n (y_i - \theta^T x_i)^2
\end{align}
which is precisely the squared error objective function.