Introduction

Machine learning is based on the idea of using some observations to approximate a function that maps some inputs to some outputs. Every machine learning algorithm tries to approximate such function by solving an optimization problem given an objective function $f(\theta)$ and some constraints $g(\theta)$. Studying machine learning is not about learning the step-by-step algorithms to approximate the function. Instead, it should be about understanding the motivation behind the algorithms and what makes them different from each other.

Have you ever wondered why the residual sum of squares is the objective function of linear regression but not logistic regression? Have you ever wondered why logistic regression uses the binary cross-entropy loss function but not residual sum of squares? In this post, I will show you the motivation behind the objective functions of these popular machine learning algorithms.

Point Estimation Methods

In many machine learning algorithms such as linear regression, logistic regression, naive Bayes classifier, neural networks, etc., the process of estimating the optimal parameters for a model is based on the idea of approximating a probability density function (PDF) of the distribution of the observed data. These methods are motivated from the theory of point estimation in inferential statistics.

1. Maximum Likelihood Estimation (MLE)

1.1. Negative Log-Likelihood

In a standard Machine Learning setting, given a set of $N$ independent and identically distributed (i.i.d.) observations $\mathcal{D} = {(x_1, y_1), \dots, (x_N, y_N)}$ consisting of $N$ inputs $x_i \in \mathbb{R}^D$ and $N$ outputs $y_i \in \mathbb{R}$, the maximum likelihood estimation of a set of parameters $\theta$ of a statistical model is a point in the parameter space that maximizes the likelihood of the observations:

θ^=argmaxθL(θ)\hat{\theta} = \arg \max_{\theta} \mathcal{L}(\theta)

$\mathcal{L}(\theta)$ is the likelihood function of the parameters $\theta$, which is equal to the conditional PDF of the response vector $\mathcal{Y} := {y_1, \dots, y_N}$ given the data matrix $\mathcal{X} := {x_1, \dots, x_N}$ and the parameters $\theta$:

L(θ)=P(YX,θ)\mathcal{L}(\theta) = \mathcal{P}(\mathcal{Y} | \mathcal{X}, \theta)

Since the observations are assumed to be i.i.d., the likelihood function can be represented as a product of the probabilities of each observation:

L(θ)=i=1NP(yixi,θ)\mathcal{L}(\theta) = \prod_{i=1}^N \mathcal{P}(y_i | x_i, \theta)

Despite the likelihood function is equal to the conditional PDF of the response vector $\mathcal{Y}$ given the data matrix $\mathcal{X}$ and the parameters $\theta$, they should not be confused with each other. The likelihood function takes the response vector $\mathcal{Y}$ and the data matrix $\mathcal{X}$ to form a partial function of the parameters $\theta$. The likelihood function does not represent the probability distribution of the parameters. Its integral over the parameter space is not necessarily 1.

Intuitively speaking, MLE looks for a point in the parameter space such that the observed data is most likely to be generated by the model:

θ^=argmaxθi=1NP(yixi,θ)\hat{\theta} = \arg \max_{\theta} \prod_{i=1}^N \mathcal{P}(y_i | x_i, \theta)

In order to maximize or minimize a function with respect to some parameters $\theta$, we need to compute the derivative of the function with respect to $\theta$. In calculus, derivative is a linear operator such that $(f+g)’ = f’ + g’$. There is no such property that $(fg)’ = f’ g’$. Hence, it is much easier to compute the derivative using the chain rule if we apply the natural logarithm to turn the likelihood function, which is a product of many terms, into the log-likelihood function, which is a sum of many terms. Since the logarithm is a monotonic transformation, maximizing the log-likelihood function is equivalent to maximizing the likelihood function. In optimization, there is a convention to minimize an objective function $f(\theta)$ subject to some constraints $g(\theta)$. Hence, instead of finding the parameter vector $\theta$ that maximizes the likelihood function, we find the parameter vector $\theta$ that minimizes the negative log-likelihood function:

θ^=argminθi=1Nln(P(yixi,θ))\hat{\theta} = \arg \min_{\theta} - \sum_{i=1}^N \ln(\mathcal{P}{(y_i | x_i, \theta)})

1.2. Linear Regression

Now we have the method to estimate the parameters of a PDF, let’s see how it works in linear regression. Remember that in linear regression, for every single observation, we often assume a functional relationship between the target $y \in \mathbb{R}$ and the input $x \in \mathbb{R}^D$ (we implicitly set the last dimension of the input to be 1 for the intercept term):

y=f(x)+ϵ=xTθ+ϵy = f(x) + \epsilon = x^T \theta + \epsilon

where $\theta \in \mathbb{R}^D$ are the parameters we seek, and $\epsilon \sim \mathcal{N}(0, \sigma^2)$ is i.i.d. Gaussian noise with mean 0 and variance $\sigma^2$.

The conditional PDF of $y$ given $x$ and $\theta$ is then given by:

P(yx,θ)=N(yxTθ,σ2)=12πσexp(12(yxTθσ)2)\mathcal{P}(y | x, \theta) = \mathcal{N}(y | x^T \theta, \sigma^2) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{1}{2}\left(\frac{y - x^T \theta}{\sigma}\right)^2\right)

Given a dataset with many observations, the likelihood function of the parameters is then given by:

L(θ)=i=1NP(yixi,θ)=i=1N12πσexp(12(yixiTθσ)2)\mathcal{L}(\theta) = \prod_{i=1}^N \mathcal{P}(y_i | x_i, \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{1}{2}\left(\frac{y_i - x_i^T \theta}{\sigma}\right)^2\right)

Applying the natural logarithm to the likelihood function, we get:

lnL(θ)=i=1N{ln12πσ12(yixiTθσ)2}\ln\mathcal{L}(\theta) = \sum_{i=1}^N \left\{\ln\frac{1}{\sqrt{2\pi}\sigma} - \frac{1}{2}\left(\frac{y_i - x_i^T \theta}{\sigma}\right)^2\right\}

We can ignore the constant term $\ln\frac{1}{\sqrt{2\pi}\sigma}$ and the scaling factor $\frac{1}{2\sigma^2}$ in the log-likelihood function and minimize the negative log-likelihood function with respect to the parameters $\theta$:

θ^=argminθi=1N(yixiTθ)2\hat{\theta} = \arg \min_{\theta} \sum_{i=1}^N \left( y_i - x_i^T \theta \right)^2

From the above equation, we can clearly see the relationship between the MLE and the residual sum of squares objective function of the linear regression.

1.3. Logistic Regression

Now you may be wondering how the MLE works for logistic regression. In logistic regression, the response $y$ is a Bernoulli random variable, either 0 or 1. Hence, we can then directly model the conditional PDF of the response $y \in [0, 1]$ given the input $x \in \mathbb{R}^D$ (we implicitly set the last dimension of the input to be 1 for the intercept term):

P(yx,θ)=σ(xTθ)y(1σ(xTθ))1y\mathcal{P}(y | x, \theta) = \sigma(x^T \theta)^{y} (1 - \sigma(x^T \theta))^{1 - y}

where $\sigma(\cdot)$ is the sigmoid function.

As for a dataset with many observations, the likelihood function of the logistic regression is then given by:

L(θ)=i=1NP(yixi,θ)=i=1Nσ(xiTθ)yi(1σ(xiTθ))1yi\mathcal{L}(\theta) = \prod_{i=1}^N \mathcal{P}(y_i | x_i, \theta) = \prod_{i=1}^N \sigma(x_i^T \theta)^{y_i} (1 - \sigma(x_i^T \theta))^{1 - y_i}

Applying the natural logarithm to the likelihood function, we get:

lnL(θ)=i=1N{yilnσ(xiTθ)+(1yi)ln(1σ(xiTθ))}\ln\mathcal{L}(\theta) = \sum_{i=1}^N \left\{y_i \ln\sigma(x_i^T \theta) + (1 - y_i) \ln(1 - \sigma(x_i^T \theta))\right\}

Then we minimize the negative log-likelihood function with respect to the parameter vector $\theta$:

θ^=argminθi=1N{yiln(σ(xiTθ))+(1yi)ln(1σ(xiTθ))}\hat{\theta} = \arg \min_{\theta} - \sum_{i=1}^N \left\{y_i \ln(\sigma(x_i^T \theta)) + (1 - y_i) \ln(1 - \sigma(x_i^T \theta))\right\}

From the above equation, we can clearly see the relationship between the MLE and the binary cross-entropy objective function of the logistic regression.

2. Maximum A Posteriori Estimation (MAP)

As the complexity of the model increases like in the case of polynomial regression or neural networks, MLE can be prone to overfitting. Maximum A Posteriori Estimation (MAP), on the other hand, sets a prior on the parameters and uses the posterior distribution to estimate the parameters. The posterior over the parameters $\theta$, given the observations $\mathcal{X}$ and the equivalent responses $\mathcal{Y}$, is obtained by applying Bayes’ theorem as:

P(θX,Y)=P(YX,θ)P(θ)P(YX)\mathcal{P}(\theta | \mathcal{X}, \mathcal{Y}) = \frac{\mathcal{P}(\mathcal{Y} | \mathcal{X}, \theta) \mathcal{P}(\theta)}{\mathcal{P}(\mathcal{Y} | \mathcal{X})}

The parameters $\theta$ are then estimated by maximizing the posterior probability:

θ^=argmaxθP(θX,Y)\hat{\theta} = \arg \max_{\theta} \mathcal{P}(\theta | \mathcal{X}, \mathcal{Y})

Since the denominator of the posterior probability is the same for all the parameters, we can simply ignore the denominator and maximize the numerator:

θ^=argmaxθP(YX,θ)P(θ)\hat{\theta} = \arg \max_{\theta} \mathcal{P}(\mathcal{Y} | \mathcal{X}, \theta) \mathcal{P}(\theta)

Following the same procedure as in MLE, to maximize the posterior probability, we minimize the negative log-posterior with respect to the parameters $\theta$:

θ^=argminθ{lnP(θX,Y)lnP(θ)}\hat{\theta} = \arg \min_{\theta} \left\{- \ln{\mathcal{P}(\theta | \mathcal{X}, \mathcal{Y})} - \ln{\mathcal{P}(\theta)} \right\}

The prior plays the role of a regularization term in MAP. For instance, assume the prior is a Gaussian distribution with mean $0$ and covariance matrix $\sigma^2I$:

P(θ)=N(0,σ2I)=12πσ2exp(12σ2θ22)\mathcal{P}(\theta) = \mathcal{N}(0, \sigma^2I) = \frac{1}{\sqrt{2\pi}\sigma^2} \exp\left(-\frac{1}{2\sigma^2}\Vert\theta\Vert^2_2\right)

We obtain the negative log-Gaussian prior as:

lnP(θ)=ln12πσ+12σ2θ22-\ln\mathcal{P}(\theta) = - \ln\frac{1}{\sqrt{2\pi}\sigma} + \frac{1}{2\sigma^2}\Vert\theta\Vert_2^2

We can ignore the constant term $-\ln\frac{1}{\sqrt{2\pi}\sigma}$ and minimize the negative log-prior with respect to the parameters $\theta$. We rewrite $\frac{1}{2\sigma^2}$ as $\lambda$ for short. The MAP estimator of the parameters $\theta$ is then given by:

θ^=argminθ{lnP(θX,Y)+λθ22}\hat{\theta} = \arg \min_{\theta} \left\{- \ln{\mathcal{P}(\theta | \mathcal{X}, \mathcal{Y})} + \lambda \Vert\theta\Vert_2^2 \right\}

The $\lambda$ is called the regularization hyperparameter that controls the strength of the prior. The larger the $\lambda$, the more the posterior is regularized, and the smaller the parameters $\theta$ are estimated. In this case, the log-Gaussian prior $\mathcal{N}\left(0, \frac{1}{2\lambda}I\right)$ is equivalent to the L2 regularization term in Ridge regression. In fact, we can also use other regularization terms such as L1 (Lasso regression), or both L1 and L2 (Elastic Net).

As we defined the MAP, we can conclude that MLE is just a special case of MAP when the prior follows a uniform distribution.

What’s Next?

In this post, I have introduced the perspective of point estimation in machine learning. From now on, when someone asks you “why do we have to use binary cross-entropy as an objective function for logistic regression?”, or “why do we have to use the residual sum of squares as an objective function for linear regression?”, convexity is not the only answer for them.

There are further matters on the theory of point estimation to be considered, such as the interpretation of the estimates, the choice of the prior, Bayesian regression, etc. that can’t be covered in a single blog post. There are also other interesting points of view on the motivation behind machine learning algorithms such as information theory, Bayesian inference, analytic geometry, etc., that I will try to cover in the future posts.

Finally, please don’t hesitate to leave your comments and suggestions so that I can improve the quality of the post. We all have our own WHY questions during the learning process, so I hope the answer to my WHYs can somehow cover your WHYs.