Classification is a supervised machine learning techniques, where objects are categorized into buckets. The most common example given is classification of the fruits in a given set. It can be a set of images of fruits, real fruits in a basket or a lot of fruits on assembly line.
The most intuitive method to classify objects would be to identify the properties of the objects and say that the objects having similar properties are of same class. The same principal is used to classify the objects in statistics or machine-learning. But in more formal ways. The properties of the objects are converted into numerical values and it is given as input to a function which produces class as output.
Y = f(X1, X2 … Xn)
If the function is a linear equation it is said to be a linear classifier else it is said to be non-linear classifier. Linear model (equation) are easy to interpret and mathematically less complex , they use relatively less computational resources while working on large data set.
Selection of properties ( predictors) and identification of the function(model) is the hardest part of the machine learning. There are number of methodologies developed for feature selection and building a model. In case of regression (where output is numeric) Linear regression is the simplest model. While in case of classification (where output is categorical ) Logistic regression and Bayesian classifiers are simplest.
The fundamental principal of Bayesian classification is Bayes Theorem. In 19th century English mathematician Thomas Bayes showed that the probability of occurrence of an event is product of likelihood of the event and prior probability of the event. This is also known as conditional probability.
P(A|B) = P(B|A).P(A) / P(B)
Bayes Theorem is little hard to grasp at first. But it can be understood with little effort. you can refer this link to understand more. P(A|B) is called posterior probability, P(B|A) is called likelihood and P(B) is evidence. P(B) is largely constant. The relationship can be re-written in terms of proportionality (∝).
posterior probability ∝ (likelihood) . (prior probability)
One real life example of this relation can be chance of raining on a given day. Lets say in some day in November. The prior probability of raining would be less, because we already know this based on our experience it rarely rains in November. The features like atmospheric pressure, temperature , location of the city ( near seashore) would make the likelihood. If we have a high likelihood, the chance of rain would be high, even in the odd seasons.
This proportionality relation can be harnessed to classify the objects based on its properties and it can be used in variety of use cases. The most famous use case is – email spam identification, OCR – Optical Character recognizer, Image Classification, Fraud detection in Insurance and banking industry, Text classification,Customer segmentation etc.
Here we will look at the Hand written digit identification in detail and understand how Bayes theorem is used to identify written digits.
The first step of any machine learning algorithm is feature – selection, creation or extraction. We will do the same for the hand written digits. A hand written digit can be captured as image, and compressed in a specific dimension, say 28×28 pixel. Below is the 4 sample images for the digit 5. These digits are taken from MNIST computer vision data set of hand written digits.
The 28×28 pixel can be stored into a vector of 784 elements. And if we have 100 sample(training) data we can store those data 100×784 2D array. Each element of the vector is float and can have value ranging from 0 to 255. The value of pixel is equivalent to the intensity of the pixel. In case of digit 5 the value of pixel 0, 783 would be 0 and the pixel at 50-75 would have some value greater then 0.
To identify a given digit, we calculated the posterior probability of all the digits – 0 to 9. Which ever digit has highest posterior probability, the given digit belong to that class.
Calculate Likelihood of the Digit
In the Bayes theorem – P(B|A) is called likelihood or the probability of the B when A has already occurred. We can calculated the likelihood of each of the pixel in the sample data set, i.e. P( Xi | Digit = 1), P(Xi | Digit = 2) … P(Xi | Digit = 9). We would have 784×10 probabilities for each digits(0-9). If the pixels are independent of each other, i.e. Naive assumption( This is one of the important assumption of the Naive Bayes classification, refer wiki for details). We can rewrite likelihood for the digit 1 as product of probability of each pixel.
P(Xi | Digit = 1 ) = P(X0 | Digit = 1) * P(X1 | Digit = 1) *….* P(X783 | Digit = 1)
The right hand side values of P( Xi | Digit = 1…9 ) can be calculated in variety of ways. One simple method would be counting the value in pixel and dividing it by the number of digits in the samples. The value of pixel could be anything in the range of 0 to 255. This would lead to count 255^784 occurrences. If we have thousands of training data set, the counting method is computationally expensive.
Another way to calculate P( Xi | Digit = 1…9 ) is PDF ( Probability Density Function). The formal definition can be looked at the wiki. In simple terms – value of PDF of a continuous numbers generated by a function, is respective probability of the occurrence of the number.( i.e, the respective value of the function on the probability scale – 0 to 1).
For a given digit y (0 – 9), we can calculate mean (μ with subscript,y) and variance(σ^2 with subscript y) for each of pixels. The PDF of the pixel xi of the test digit can be given in terms of mean and variance using following Gaussian formula.
Pixel 0 for the digit 5 is 0, mean and variance would be 0, the pdf P(X0|digit = 5) would result in divide by zero error. – This is the classic error received in naive Bayes classification. We can over come this error by ignoring this pixel OR assigning P(X0|digit = 5) = 1. Assigning 1 would not change the overall likelihood, because it is multiplication operation. In this way, we can calculate the likelihood of the digit 5.
P(Xi | Digit = 5 ) = P(X0 | Digit = 5) * P(X1 | Digit = 5) *….* P(X783 | Digit = 5)
Similarly, we can calculate the likelihood of all the digits( 0 – 9).
Prior probability of the Digits
P(A) is called prior probability – for all the digits we can calculate P(Digit = 1), P(Digit = 2) … P(Digit = 9). In case of 100 samples this would be equivalent to
P(Digit = 1) = No of Digit ‘1’ / 100
Predicting the Digit based on features
So far we know how to calculate Likelihood and Prior probability of the all the digits. Now, suppose a new image of test digit is given. This test digit can be compressed into 28×28 pixel image or it can be stored into 784 element vector. We perform following operation.
For Each Digit in 0 to 9 For each Pixel in 0 to 783 Calculate mean and variance of the each pixel of the digit. Calculate prior probability of the Digit.
This operation gives 9 prior probabilities and (9 * 784) mean and variances. We can calculate the posterior probability for the digits 0 to 9 s and compare the probabilities to identify the highest among them. The test digit is considered to be the label of the highest probability digit.
Bayesian classification is also called probabilistic or generative classification model. This is simplest yet powerful model. Even after a strong Naive assumption, the model returns striking accuracy with small training data set. This is specially useful in the field of natural language processing.
The implementation of the algorithm using python numpy is available on the github page here .