Probability and Maximum Likelihood in Disguise

4 min readOct 8, 2021

“Are you watching closely?”. “You are looking, but what you’re really doing is filtering, interpreting. Searching for meaning.”

This was me every time looking at cost functions. It seemed as magic. Magic in which a continuous convex function suddenly comes out of the magician’s hat (as obvious as it seems) which could be used to solve any ML problem (be it regression/classification)

Recently I peeped into that magician’s hat and what I saw was indeed a revelation. It was all probability and maximum likelihood estimations. I knew probability and statistics was important but I never thought is would be hidden in plain sight and all those optional statistical sections in ML courses were this important. As always ‘Devil is in the Details’. So let’s dive into the details

Cross entropy cost function

Let’s start with the simple and most obvious one — cross entropy cost function. Cross Entropy Cost function is one of the most widely used cost function for Logistic Regression. Anyone who had an essence of Data Science/Machine Learning will be familiar with the below equation and it is quite intuitive to explain.

Cross Entropy cost function — for ‘m’ samples

With intuition, we can suggest that the above cost function is obtained by averaging the individual loss function (given below) across ‘m’ samples.

Loss function for individual sample

The loss is minimum if the predicted value (ŷ) is 1 when actual value (y) is 1 and 0 when y is 0. (This can be shown by finding the value of loss functions when the predicted value is accurate (i.e. for the (ŷ, y) combinations of (1,1) or (0,0)) and similarly for incorrect predictions of (1,0) or (0,1).
Note: Value of log(x) as x tends to 0 is -∞

Now comes the real question. How did we reach here? Is it pure intuition or are we missing something?

Let us looks at binary classification problem from a statistical point of view. Statistical approach revolves around identifying a known probability distribution to which the sample (whose class needs to be predicted) belongs to and determining the parameters of that probability distribution.

Hence in a statistical aspect we can consider a binary classification problem (where we try to classify each trial to belong to one of 2 classes) to be one in which each ‘m’ samples belong to Bernoulli distribution

To make that assumption it should satisfy the 3 assumptions of a Bernoulli trials (which almost all the binary classification problems does):
1. Each trial has two possible outcomes, in the language of reliability called success and failure.
2. The trials are independent. Intuitively, the outcome of one trial has no influence over the outcome of another trial.
3. On each trial, the probability of success is ‘p’ and the probability of failure is ‘1-p’ where p∈[0,1] is the success parameter of the process.

Equation for Bernoulli distribution

Note: ‘k’ is the possible outcomes and ‘p’ is the probability (which is ‘y’ and ‘ŷ’ respectively for binary classification problem)

Above equation can be represented also as

Once the assumption is made, next step in statistical approach is to find the parameters (looks like each of the probability distribution has different parameters) of the assumed probability distribution using a method called Maximum Likelihood Estimation. In the case of Bernoulli distribution, the parameter is ‘p’ (or ŷ in our case) and the maximum likelihood function is

Maximum Likelihood Estimation (for ‘m’ samples)

Thus the objective becomes to find the value of ‘p’ such that the above MLE function is maximized. Rewriting the above equation with analogous variables of binary classification, objective become to find ŷ that maximizes the below equation

Using logarithms to convert to summation and changing the maximization to minimization problem which is easier to work with.

To remove the effect of sample size and make the result scalable we divide the above function by the sample size ‘m’ (averaging over ‘m’ samples)

Voila! We derived the cross entropy cost function. Thus cross entropy cost function boils down to maximum likelihood estimation of Bernoulli distribution. Also minimizing the cost function can be considered as maximizing the likelihood function.

There ends the first part of ‘Probability and Maximum Likelihood in Disguise’. As always, it is okay if you skip the above aspect of cost function but it is quite satisfying to finally know how the magic happens :)

Probability and Maximum Likelihood in Disguise

Written by Liger