Classification Data Analytics Gradient Descent Machine Learning Pandas Probability Python Regression

Optimization Techniques in Machine Learning

Optimization is single most important concept used all across AI to Machine-learning to Deep learning. It is important to understand the basic optimization which is gradient descent algorithm.

I am considering my favorite example House Price vs no of rooms. I have plotted various observations and a line which represents the trend in the observation. Which trend line matches best with given observation ? We agree that the line in the third image matches well with the trend in the observed values.

Note : Please assume that no of rooms are fraction because the continuous data is generated randomly.


But how can we say that that the line in the third chart is matching best with the observation. A very good trick adopted by statistician is calculating the area – It is said if the total area between line and the observed points is small – then the line is the best fit.

We can understand this based on the following image –


We can see that for the observation at 3.58 the price is around 1002.25. For the same observation the value predicted by first line is 1001.85 and the second line is 1002.00. The difference between Observed value and predicted value is called error.

                                     Error = Observed Value – Predicted Value

Notice that the Error is high in case of first line so the square created by the error would be large. The error is small in case of second line – so the area created by the square  would be small. The observations can fall either side of the line and the error can be positive or negative – but squaring them the area would always be positive. Now, sum up all the areas created by the squares at the observations.

Based on this we can conclude that the line which fits best with the observations would have minimum area. This can be restated in other word –  We have to minimize the sum of squares of error in order to find out the best fit line.

We can put this into terms of mathematics. The equation of the line so far we have considered is Y = mX + c .  For an observation (X’,Y’), the actual value would be Ya = mX’ + c on the line, the respective error would be    ( Y’ – Ya) and the square of error would be ( Y’ – Ya)^2. The sum of square of errors for n observations can be given by following

                                  Sum of squares of Errors =  Σ1….n ( Y’i – Yi) ^ 2

The sum of square is a quadratic function and can be written as as f(x) = x^2. If we plot a quadratic function we get following chart. The bottom most point will give the minimum value of the x^2.


So, to find the best fit line we have to find the minimum value of the  ( Y’ – Ya)^2 for all the observations. Most popular method to find a minimum or maximum value of a function is “Gradient Descent”. This basically chooses a random point on the curve (here x^2) and iterate to find a point where the function acquires a minimum value.

The numpy implementation of Gradient descent  for  Linear Regression   and   Logistic Regression is at the link.

As said earlier in order to find the best fitting line, we have to minimize the error or loss. The first step is to find out the loss function – In our case loss function is Sum of squares of errors and the next step is to find the parameters where loss function has minimum value.

In the example we have started with slope(m) as 0.8 and intercept (c) as 0.1 and calculated the respective error (sum of squares) in iteration 1.

                     m = 0.8
                     c = 0.1

To continue with next iteration we have to find the new value of slope(m) and intercept(c) – The new value is obtained by partial derivative of the error function.

The error function is $f(x_i),y_i) = \frac{1}{2}\sqrt{((mx_i + c) - y_i )^2}$, so we have to find partial derivative with respect to m and c.

$$\begin{split}f'(x) =    \begin{bmatrix}      \frac{{\partial}f(x)}{{\partial}m}\\      \frac{{\partial}f(x)}{{\partial}c}\\     \end{bmatrix} =    \begin{bmatrix}      \frac{1}{N} \sum -x_i(y_i - (mx_i + c) \\        \frac{1}{N} \sum -(y_i - (mx_i + c) \\     \end{bmatrix}\end{split}$$

In every iteration – we will find partial derivative delta_m and delta_c and then find a respective m and c in the iteration. The following line in the code is used to obtain the new value of the m and c.

If we plot slope and intercept with error – we would get a chart similar to the following. The Idea is to find m and c where the error is minimum.

self.m = self.m - self.r * delta_m
self.c = self.c - self.r * delta_c

Classification Data Analytics Digit Recognition Generative Models Machine Learning Probability Python

Understanding classification using Naive Bayes Classifier.

Classification is a supervised machine learning techniques, where objects are categorized into buckets. The most common example given is classification of the fruits in a given set. It can be a set of images of fruits, real fruits in a basket or a lot of fruits on assembly line.

The most intuitive method to classify objects would be to identify the properties of the objects and say that the objects having similar properties are of same class. The same principal is used to classify the objects in statistics or machine-learning. But in more formal ways. The properties of the objects are converted into numerical values and it is given as input to a function which produces class as output.

                                      Y = f(X1, X2 … Xn)

If the function is a linear equation it is said to be a linear classifier else it is said to be non-linear classifier. Linear model (equation) are easy to interpret and mathematically less complex , they  use relatively less computational resources while working on large data set.

Selection of properties ( predictors) and identification of the function(model) is the hardest part of the machine learning. There are number of methodologies developed for feature selection and building a model.  In case of regression (where output is numeric) Linear regression is the simplest model. While in case of classification (where output is categorical ) Logistic regression and Bayesian classifiers are simplest.

Bayesian Classification

The fundamental principal of Bayesian classification is Bayes Theorem. In 19th century English mathematician Thomas Bayes showed that the probability of occurrence of an event is product of likelihood of the event and prior probability of the event. This is also known as conditional probability.

                                                    P(A|B) = P(B|A).P(A) / P(B)

Bayes Theorem is little hard to grasp at first. But it can be understood with little effort. you can refer this link to understand more.  P(A|B) is called posterior probability, P(B|A) is called likelihood and P(B) is evidence. P(B) is largely constant. The relationship can be re-written in terms of proportionality (∝).

                             posterior probability ∝  (likelihood) . (prior probability)

One real life example of this relation can be chance of raining on a given day. Lets say in some day in November. The prior probability of raining would be less, because we already know this based on our experience it rarely rains in November. The features like atmospheric pressure, temperature , location of the city ( near seashore) would make the likelihood. If we have a high likelihood, the chance of rain would be high, even in the odd seasons.

This proportionality relation can be harnessed to classify the objects based on its properties and it can be used in variety of use cases. The most famous use case is – email spam identification, OCR – Optical Character recognizer, Image Classification, Fraud detection in Insurance and banking industry, Text classification,Customer segmentation etc.

Here we will look at the Hand written digit identification in detail and understand how Bayes theorem is used to identify written digits.

Feature creation

The first step of any machine learning algorithm is feature – selection, creation or extraction. We will do the same for the hand written digits. A hand written digit can be captured as image, and compressed in a specific dimension, say 28×28 pixel. Below is the 4 sample images for the digit 5. These digits are taken from MNIST computer vision data set of hand written digits.


The 28×28 pixel can be stored into a vector of 784 elements. And if we have 100 sample(training) data we can store those data 100×784 2D array.  Each element of the vector is float and can have value ranging from 0 to 255. The value of pixel is equivalent to the intensity of the pixel. In case of digit 5 the value of pixel 0, 783 would be 0 and the pixel at 50-75 would have some value greater then 0.

To identify a given digit, we calculated the posterior probability of all the digits – 0 to 9. Which ever digit has highest posterior probability, the given digit belong to that class.

Calculate Likelihood of the Digit

In the Bayes theorem – P(B|A) is called likelihood or the probability of the B when A has already occurred. We can calculated the likelihood of each of the pixel in the sample data set, i.e. P(  Xi | Digit = 1), P(Xi | Digit = 2) … P(Xi | Digit = 9). We would have 784×10 probabilities for each digits(0-9). If the pixels are independent of each other, i.e. Naive assumption( This is one of the important assumption of the Naive Bayes classification, refer wiki for details). We can rewrite likelihood for the digit 1 as product of probability of each pixel.

        P(Xi | Digit = 1 ) = P(X0 | Digit = 1) * P(X1 | Digit = 1) *….* P(X783 | Digit = 1)

The right hand side values of P(  Xi | Digit = 1…9 ) can be calculated in variety of ways. One simple method would be counting the value in pixel and dividing it by the number of digits in the samples. The value of pixel could be anything in the range of 0 to 255. This would lead to count 255^784 occurrences. If we have thousands of training data set, the counting method is computationally expensive.

Another way to calculate P(  Xi | Digit = 1…9 ) is PDF ( Probability Density Function).   The formal definition can be looked at the wiki. In simple terms – value of PDF of a continuous numbers generated by a function, is respective probability of the occurrence of the number.( i.e, the respective value of the function on the probability scale – 0 to 1).

For a given digit y (0 – 9), we can calculate mean (μ with subscript,y) and variance(σ^2 with subscript y) for each of pixels. The PDF of the pixel xi of the test digit can be given in terms of mean and variance using following Gaussian formula.


Pixel 0 for the digit 5 is 0, mean and variance would be 0, the pdf  P(X0|digit = 5)  would result in divide by zero error. – This is the classic error received in naive Bayes classification. We can over come this error by ignoring this pixel OR assigning P(X0|digit = 5) = 1. Assigning 1 would not change the overall likelihood, because it is multiplication operation. In this way, we can calculate the likelihood of the digit 5.

       P(Xi | Digit = 5 ) = P(X0 | Digit = 5) * P(X1 | Digit = 5) *….* P(X783 | Digit = 5)

Similarly, we can calculate the likelihood of all the digits( 0 – 9).

Prior probability of the Digits

P(A) is called prior probability – for all the digits we can calculate P(Digit = 1), P(Digit = 2) … P(Digit = 9). In case of 100 samples this would be equivalent to

                                                     P(Digit = 1) = No of Digit ‘1’ / 100

Predicting the Digit based on features

So far we know how to calculate Likelihood and Prior probability of the all the digits. Now,  suppose a new image of test digit is given. This test digit can be compressed into 28×28 pixel image or it can be stored into 784 element vector.  We perform following operation.

      For Each Digit in 0 to 9                                                                                                                                   For each Pixel in 0 to 783                                                                                                                             Calculate mean and variance of the each pixel of the digit.                                                 Calculate prior probability of the Digit.

This operation gives 9 prior probabilities and  (9 * 784) mean and variances. We can calculate the posterior probability for the digits 0 to 9 s and compare the probabilities to identify the highest among them. The test digit is considered to be the label of the highest probability digit.

Bayesian classification is also called probabilistic or generative classification model. This is simplest yet powerful model. Even after a strong Naive assumption, the model returns striking accuracy with small training data set. This is specially useful in the field of natural language processing.

The implementation of the algorithm using python numpy is available on the github page here .