This blog is an attempt to introduce the concept of linear regression to engineers. This is well understood and used in the community of data scientists and statisticians, but after arrival of big data technologies, and advent of data science, it is now important for engineer to understand it.
Regression is one of the supervised machine learning techniques, which is used for prediction or forecasting of the dependent entity which has a continuous value. Here I will use pandas, scikit learn and statsmodels libraries to understand the basic regression analysis.
Basics Terminology and Loading data in a DataFrame
DataFrame is memory unit to hold Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes. You can find more about data frame here.
First of all I would like to explain the terminology. Following are most important before we dive in.
- Index Column
In two dimensional array of Data – Rows are called observations and columns are called Features. One of the Feature which is being predicted is called Target. Other features which are used to predict the target is called predictors.
For linear regression to work – Primary condition is No of Target should be equal to no of Predictors i.e. Observations.
Shape is dimensionality, i.e. no of rows and columns. The shape of the data shown above is (5,4).
Index column is the pointer which is used to identify the observation, it can be numeric or alpha-numeric. But generally it is numeric starting with 0.
Now we can look at the actual data. Here we will consider sample dataset available in scikit learn library. Following code loads data in python object boston.
Let convert the boston object to Panda dataframe for easy navigation, slicing and dicing.
- First create instance of Panda as pd.
- Call the function DataFrame and pass boston.data and boston.feature_names keys.
- Print the a part of dataframe.
df.head would show the header (top) observations, Other way to select observation is using  operator.
df.index evaluates to the index of the dataframe and “df.index<6” evaluates to True and False. df[df.index<6] is very popular way of selecting certain observation.
- iloc[index] : – We can pass following elements in the dataframe.
Index using number. Array indexes using  operator. True False using functions or operators.
- loc[index] : – We can pass following elements in the dataframe.
Index using Labels. Array Labels using  operator. True False using functions or operators.
- ix[index] : – We can pass anything numbers or Labels to ix.
df.ix[[1,3,5],['CRIM','ZN']]This selects 1st, 3rd and 5th row.
We have created dataframe df with boston.data, it doesn’t have target.
Now lets add boston.target as a column in the dataframe using df “df[‘PRICE’] = boston.target”. This will add a feature(target) in the last column of the dataframe df, Print using ix notation.
The dataframe df is ready with boston data for regression analysis. Following cell prints the part of the dataframe using ix notation.
Basics of Linear equation
The data set loaded in the previous step – PRICE is a continuous dependent entity, and we are trying to find a relationship of PRICE with other features in the dataset.
The most intuitive way to understand the relationship between entities is scatter plot. So we will plot all the predictors against Price to observe their relationship.
The selection of predictor is one of the important step in the regression analysis. The analyst should select the predictor which contributes to the target variable. There are some predictors which don’t contribute to the relationship, those should be identified and not used in the regression equation. One obvious non-contributing predictor is constants. Here the predictor CHAS has value 0 or 1. it doesn’t influences price of the house, so it should not be used in the regression.
I have selected RM,AGE and DIS as my predictor – I have taken this decision based on the observation in the scatter plot below.
We can try to find the equation (function) between No of rooms and the price. The following cell plots the best fit line over the scatter plot. The red line is the line of best fit and it can predict the house price based on the number of rooms. The equation of the line is given in the chart.
One of the most important properties is Pearson product-moment correlation coefficient (PPMCC) or simply said correlation coefficient.
It gives direction of the linear correlation between two variables X and Y. The value lies between -1 to +1. A value closer to +1, i.e. 0.95 suggests very strong positive correlation. A value closer to -1 suggest negative correlation. A negative correlation means that the value of dependent variable would decrease with increasing independent variable. A value 0 suggests that there is no correlation between the variables. You can find more about this here.
Mathematically r is given by below formula.
r = Covariance of (X,Y)/Stadard Deviation of x * Standard Deviation of y
Some of Important properties related to regression line are
R-suqared Adjusted. R-squared F Statistic Prob ( F Statistic) Standard Error t Ratio p
- R-Squared is said to be the Coefficient of determination, it signify the strength of the relationship between variables in terms of percentage. This is actually the proportion of the variance in the dependent variable that can be explained by independent variable. The higher value of R-Squared is considered to be good. But this is not always true, sometimes non-contributing predictors inflate the R-Squared.
- The adjusted R-squared is a modified version of R-Squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if new term improves the model more than would be expected by chance. It decreases when predictor improves the model by less than expected by chance. The adjusted R-square can be negative, but usually not. It is always less than equal to R-squared.
- ‘F Statistic’ or ‘F Value’ is the measure of the overall significance of the regression model. This is the most important statistics which is looked at to understand the regression output.
- If F value is greater than F Critical value, it suggests that there is some significance predictor in the model. ‘F critical value’ is the value obtained from F table for a given significance level (α).
- F value, F Critical Value , Alpha (α) and p value are looked together to understand the overall significance of the regression model.
- p value less then α suggests all the predictors are significant.
- Mathematically F value is the ratio of the mean regression sum of squares divided by the mean error sum of squares. Its value will range from zero to an arbitrarily large number. The value far away from 0 suggests a very strong model.
- The value of Prob(F Statistic) is the probability that the null hypothesis for the full model is true (i.e., that all of the regression coefficients are zero).
- Basically, the f-test compares the model with zero predictor variables (the intercept only model), and decides whether the added coefficients improves the model. If we get a significant result, then whatever coefficients is included in the model is considered to be fit for the model.
- Standard Error is the measure of the accuracy of predictions. If the prediction done by the model (equation) is close to the actual value,i.e. in the scatter plot the sample values are very close to the line of best fit. The model is considered to be more accurate.
- Mathematically the standard error (σest) is given by
σest = Sqrt( SUM (Sqr(Yi - Y′)) / N )
- t statistic is the measure of significance of the individual predictor. It indicates how many times of standard errors a unit change in the predictor would bring in the response.
Following cell uses python library statsmodels.api to show the summary output of the OLS (Ordinary Least Square) method. The explanations given in the cell can be used to interpret the result.
OLS Regression Results ========================================================================= Dep. Variable: PRICE R-squared: 0.484 Model: OLS Adj. R-squared: 0.483 Method: Least Squares F-statistic: 471.8 Date: Mon, 19 Mar 2018 Prob (F-statistic): 2.49e-74 Time: 12:05:20 Log-Likelihood: -1673.1 No. Observations: 506 AIC: 3350. Df Residuals: 504 BIC: 3359. Df Model: 1 Covariance Type: nonrobust ========================================================================= coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------- const -34.6706 2.650 -13.084 0.000 -39.877 -29.465 RM 9.1021 0.419 21.722 0.000 8.279 9.925 ========================================================================= Omnibus: 102.585 Durbin-Watson: 0.684 Prob(Omnibus): 0.000 Jarque-Bera (JB): 612.449 Skew: 0.726 Prob(JB): 1.02e-133 Kurtosis: 8.190 Cond. No. 58.4 =========================================================================
Regression is a vast topic which can be covered in books only. I have found a book at the link https://www.stat.berkeley.edu/~brill/Stat131a/29_multi.pdf. This looks to be a nice read.
The python notebook for this tutorial can be found at my github page here.