Optimization is single most important concept used all across AI to Machine-learning to Deep learning. It is important to understand the basic optimization which is gradient descent algorithm.
I am considering my favorite example House Price vs no of rooms. I have plotted various observations and a line which represents the trend in the observation. Which trend line matches best with given observation ? We agree that the line in the third image matches well with the trend in the observed values.
Note : Please assume that no of rooms are fraction because the continuous data is generated randomly.
But how can we say that that the line in the third chart is matching best with the observation. A very good trick adopted by statistician is calculating the area – It is said if the total area between line and the observed points is small – then the line is the best fit.
We can understand this based on the following image –
We can see that for the observation at 3.58 the price is around 1002.25. For the same observation the value predicted by first line is 1001.85 and the second line is 1002.00. The difference between Observed value and predicted value is called error.
Error = Observed Value – Predicted Value
Notice that the Error is high in case of first line so the square created by the error would be large. The error is small in case of second line – so the area created by the square would be small. The observations can fall either side of the line and the error can be positive or negative – but squaring them the area would always be positive. Now, sum up all the areas created by the squares at the observations.
Based on this we can conclude that the line which fits best with the observations would have minimum area. This can be restated in other word – We have to minimize the sum of squares of error in order to find out the best fit line.
We can put this into terms of mathematics. The equation of the line so far we have considered is Y = mX + c . For an observation (X’,Y’), the actual value would be Ya = mX’ + c on the line, the respective error would be ( Y’ – Ya) and the square of error would be ( Y’ – Ya)^2. The sum of square of errors for n observations can be given by following
Sum of squares of Errors = Σ1….n ( Y’i – Yi) ^ 2
The sum of square is a quadratic function and can be written as as f(x) = x^2. If we plot a quadratic function we get following chart. The bottom most point will give the minimum value of the x^2.
So, to find the best fit line we have to find the minimum value of the ( Y’ – Ya)^2 for all the observations. Most popular method to find a minimum or maximum value of a function is “Gradient Descent”. This basically chooses a random point on the curve (here x^2) and iterate to find a point where the function acquires a minimum value.
The numpy implementation of Gradient descent for Linear Regression and Logistic Regression is at the link.
As said earlier in order to find the best fitting line, we have to minimize the error or loss. The first step is to find out the loss function – In our case loss function is Sum of squares of errors and the next step is to find the parameters where loss function has minimum value.
In the example we have started with slope(m) as 0.8 and intercept (c) as 0.1 and calculated the respective error (sum of squares) in iteration 1.
m = 0.8 c = 0.1
To continue with next iteration we have to find the new value of slope(m) and intercept(c) – The new value is obtained by partial derivative of the error function.
The error function is , so we have to find partial derivative with respect to m and c.
In every iteration – we will find partial derivative delta_m and delta_c and then find a respective m and c in the iteration. The following line in the code is used to obtain the new value of the m and c.
If we plot slope and intercept with error – we would get a chart similar to the following. The Idea is to find m and c where the error is minimum.
self.m = self.m - self.r * delta_m self.c = self.c - self.r * delta_c![]()