Apache NiFi Data Analytics Docker

Apache NiFi docker

This is a guide to start apache NiFi as docker machine.

We can use aktechthoughts/nifi-docker repository to start a NiFi cluster.

STEP I : Clone the repository

STEP II: Execute the script

The script pulls the images and runs the zookeepr and the nifi on local machine.

The nifi is exposed at port : 8888, we can look at the browser using

This is a blank canvas. we can add some process group and use the service.

The template – SEND_REMOTE_DATA is imported and used.

This is a insecure cluster. We can create a secure cluster using repository

AWS Continuous Integration Data Analytics Docker Linux Networking

Creating a docker instance on AWS using Ansible.

There are many advantages of infrastructure automation. We can implement any software project without worrying about the system configuration.

This project utilizes ansible for creating an ec2 instance dynamically and then creates a docker instance on the remote ec2.

You can simply clone the repository in your local machine and create an ec2 instance just by changing the parameters.


ansible 2.5
ubuntu 18.04 LTS

STEP 1: Add AWS login details in the  ‘vault.yml’ .

AWS login credentials are required to create the ec2 instance. These can be stored in the ~/.aws directory or this can be added into the ansible vault using ‘ansible-vault’ command.

Execute below command and add aws_access_key_id and aws_secret_access_key in the file created by the command.

$ sudo ansible-vault create vars/vault.yml
aws_access_key = XXXXXXXXXXX
aws_secret_key = XXXXXXXXXXXXXX

STEP 2: Initialize AWS variables for the configuration of ec2 instance.

vars/vars.yml contains variables like AWS- region id, VPC id, subnet id which is essential for creating ec2. These should be changed according to the AWS login.

aws_region_id : eu-central-1
aws_vpc_id : vpc-07db38c12d26359ac
aws_sub_id : subnet-042581e3da666d445
aws_ami_id : ami-050a22b7e0cf85dd0

aws_instance_type : t2.micro

aws_sec_key : abhishek-vpc

IdFile : /home/abhishek/.aws/abhishek-vpc.pem

default_image : jenkins

public_ip : ''
instance_id : ''

terminate : true

public_ip and instance_id are variables that is initialized at the runtime.
The variable ‘terminate’ can be set to true or false, depending on the final state of the instance.

The variable default_image can be changed based on the required image.
This image is pulled from

STEP 3: Execute the ansible-playbook to create docker on ec2.

$ sudo ansible-playbook create_ec2.yml -i hosts.ini --ask-vault-pass

The command will ask the vault password which is created in the STEP1. It will create a docker on ec2 based on the configuration available in create_ce2.yml.




Continuous Integration Docker Jenkins Uncategorized

Dockerizing Jenkins Pipeline – Simplilearn project

This document is an explanation of the project done for simplilearn-devops-certification. It creates a Docker image using Dockerfile and publishes it to docker hub using Jenkins pipeline. The code can be found on the github.

STEP 1 Set up a VS code workspace and the Github repository.

  • Open VS Code.
  • Create a directory “simplilearn-devops-certification” in the terminal and change directory.
  • Run “git init” to initialize repository.
  • Create a repository “simplilearn-devops-certification “ in the
  • Create a file name “” add details of the project in the file.
  • Execute the steps to do initial commit, This will add the project in the GitHub master branch.
      1. git remote add origin
      2. git add  .
      3. git commit –m “ Initial Commit”
      4. git push –set-upstream origin master

STEP 2 Set up a Jenkin Server and a docker-machine.

“java -jar D:\Softwares\jenkins.war –httpPort=8080”

  • The previous command will run the Jenkin server at localhost:8080 port and it can be accessed in the browser using http://localhost:8080/
  • Select “Install Selected packages” and wait for installation to finish.
  • Create a new user after installation is finished.
  • Install a docker on the same machine where Jenkin is installed

STEP 3  Setup Jenkinsfile in the repository

  1. Create Jenkinsfile in the project root directory.
  2. Add below content in the file.


  • The script has four stages in the Jenkins.
    1. Building Image
    2. Deploying the image in the dockerhub repository
    3. Removing the Image from Jenkin node.
    4. Executing Image from dockerhub.

STEP 4 Register and open with your own login.

  • Create a new docker reposiory named ‘simplilearn-devops-certification’
  • Create a file ‘Dockerfile’ in the project created in STEP1
  • Add following conent in the file.



Create a pipeline in jenkins and execute the pipeline to publish the image to the docker hub.


Data Analytics Data base Linux Networking

Creating Exasol Docker Instance


Exasol is an analytics database management software, this is an in-memory, column-oriented, relational database management system.

It supports  SQL Standard 2003 and can be integrated via standard interfaces like ODBC, JDBC or ADO.NET.

Creating Exasol Docker instance

Exasol docker instance can be created with the help of GitHub page.  This requires docker tool installed on the system.

host-machine:~$ sudo docker run –name exasoldb -p 8899:8888 –detach –privileged –stop-timeout 120 -v exa_volume:/exa exasol/docker-db

The command above will create exasoldb container. It will expose exasol container port 8888 on the host machine at port 8899. It means exasol database would be available at connection string host-system:8899.

It will also configure exasol database in persistent mode. i.e., the objects created will remain stored in the system,

Starting the exasol instance

The command in the last paragraph will create a exasol instance in a docker container. we can retrieve the container id using below command.

host-machine:~$ sudo docker container ls -a
d4f553b20771 exasol/docker-db “/usr/opt/EXASuite-6…” 6 days ago Up 53 minutes>80/tcp,>8888/tcp exasoldb

d4f553b20771 is the container id. This is a reference to the container which is running the exasol. Use this id to start the container as given below.

host-machine:~$ sudo docker container start d4f553b20771 

Stopping the exasol database gracefully 

Exasol database is a system which needs to be stopped carefully. If the host-system is shutdown with stopping the database. The tables created may be lost or system may not start again.

host-machine:~$ docker exec -ti exasoldb dwad_client stop-wait DB1

The command above will stop the database container. The host-machine can be shutdown or reboot.

Connecting using sql client.

A JDBC driver can be downloaded from exasol official download.  I used Dbeaver to connect. Default user name is : sys and password is exasol.

Note – Docker image is not officially supported by exasol, but still can be used for simple use-cases.




Arduino Artifical Intelligence Classification Image Classification Machine Learning Object Tracking OpenCV

Getting Started with OpenCV library, Loading Images, and videos.

OpenCV is a cross-platform library for real-time computer vision. It was initially developed by Intel and later released as Open Source product.

I would try to explain the library and create a simple application. Further, I will try to extend the application as an object tracking device.

First thing first – I am using python, specifically python 2.7 and already installed it on my windows laptop. OpenCV requires numpy and matplotlib ( optional) libraries to be present in the python environment. Please make them available.

I have already downloaded the opencv-3.4.3-vc14_vc15.exe from the official website and as soon as I executed the exe file – it is extracting the files to a directory called opencv at the same location. I moved the opencv to root directory c.

The folder C:\opencv\build\python\2.7\x86 has cv2.pyd file, Please copy this to python root python27\Lib\site-packages folder.  With this step installation of opencv2 to python is complete. This can be verified by import cv2 on python prompt.

Opencv documentation here has a quite extensive explanation to handle images. I would try to use one or two of them before diving deeper into the library.

Loading and Saving of Image

One pretty neat example is loading image – There is two version of this example, one without matplotlib and other with matplotlib.

Example 1 – This example is reading and Image and showing. The waitKey and destroyAllWindows are to close the Image windows on pressing esc (27) key. The Image path must be in forward (/) slashes or double backslashes (\\) – otherwise, it may throw an error.


Example 2 – This uses matpltlib to show the image using this library. matplotlob gives greater control over the image.


The Output of the two examples is the following.


Capture, Load and Save Video from Webcam

Capturing live stream from a video camera is most basic operation in the real-time computer vision applications.

cap = Cv2.VideoCapture(0) method encapsulates the camera attached to the computer. 0 is the default camera, it can be -1 or 1 depending on the secondary camera. method returns a captured frame, which can be saved in a file or some other operation like filtering, the transformation of the image can be done.

########################################################## - Show Video from a webcam.
import numpy as np
import cv2 as cv
cap = cv.VideoCapture(0)
    # Capture frame-by-frame
    ret, frame =
    # Our operations on the frame come here
    gray = cv.cvtColor(frame, cv.COLOR_BGR2GRAY)
    # Display the resulting frame
    if cv.waitKey(1) & 0xFF == ord('q'):
# When everything done, release the capture
########################################################## - Play A video from a file.
import numpy as np
import cv2 as cv
cap = cv.VideoCapture('C:/opencv/videos/output.avi')
    ret, frame =
    gray = cv.cvtColor(frame, cv.COLOR_BGR2GRAY)
    if cv.waitKey(1) & 0xFF == ord('q'):
########################################################## - Save the Video from Webcam.
import numpy as np
import cv2 as cv
cap = cv.VideoCapture(0)
# Define the codec and create VideoWriter object
fourcc = cv.VideoWriter_fourcc('M','J','P','G')
#fourcc = cv.VideoWriter_fourcc(*'XVID')
out = cv.VideoWriter('C:/opencv/videos/output.avi',fourcc, 20.0, (640,480))
    ret, frame =
    if ret==True:
		# Filp The Frame
        # frame = cv.flip(frame,0)
        # write the frame
        if cv.waitKey(1) & 0xFF == ord('q'):
# Release everything if job is finished

There are multiple application of capturing and loading images using OpenCV. We can use the image or video captured for face detection, face tracking, identification of abnormality in medical scans, marketing campaigning etc.

Classification Data Analytics Gradient Descent Machine Learning Pandas Probability Python Regression

Optimization Techniques in Machine Learning

Optimization is single most important concept used all across AI to Machine-learning to Deep learning. It is important to understand the basic optimization which is gradient descent algorithm.

I am considering my favorite example House Price vs no of rooms. I have plotted various observations and a line which represents the trend in the observation. Which trend line matches best with given observation ? We agree that the line in the third image matches well with the trend in the observed values.

Note : Please assume that no of rooms are fraction because the continuous data is generated randomly.


But how can we say that that the line in the third chart is matching best with the observation. A very good trick adopted by statistician is calculating the area – It is said if the total area between line and the observed points is small – then the line is the best fit.

We can understand this based on the following image –


We can see that for the observation at 3.58 the price is around 1002.25. For the same observation the value predicted by first line is 1001.85 and the second line is 1002.00. The difference between Observed value and predicted value is called error.

                                     Error = Observed Value – Predicted Value

Notice that the Error is high in case of first line so the square created by the error would be large. The error is small in case of second line – so the area created by the square  would be small. The observations can fall either side of the line and the error can be positive or negative – but squaring them the area would always be positive. Now, sum up all the areas created by the squares at the observations.

Based on this we can conclude that the line which fits best with the observations would have minimum area. This can be restated in other word –  We have to minimize the sum of squares of error in order to find out the best fit line.

We can put this into terms of mathematics. The equation of the line so far we have considered is Y = mX + c .  For an observation (X’,Y’), the actual value would be Ya = mX’ + c on the line, the respective error would be    ( Y’ – Ya) and the square of error would be ( Y’ – Ya)^2. The sum of square of errors for n observations can be given by following

                                  Sum of squares of Errors =  Σ1….n ( Y’i – Yi) ^ 2

The sum of square is a quadratic function and can be written as as f(x) = x^2. If we plot a quadratic function we get following chart. The bottom most point will give the minimum value of the x^2.


So, to find the best fit line we have to find the minimum value of the  ( Y’ – Ya)^2 for all the observations. Most popular method to find a minimum or maximum value of a function is “Gradient Descent”. This basically chooses a random point on the curve (here x^2) and iterate to find a point where the function acquires a minimum value.

The numpy implementation of Gradient descent  for  Linear Regression   and   Logistic Regression is at the link.

As said earlier in order to find the best fitting line, we have to minimize the error or loss. The first step is to find out the loss function – In our case loss function is Sum of squares of errors and the next step is to find the parameters where loss function has minimum value.

In the example we have started with slope(m) as 0.8 and intercept (c) as 0.1 and calculated the respective error (sum of squares) in iteration 1.

                     m = 0.8
                     c = 0.1

To continue with next iteration we have to find the new value of slope(m) and intercept(c) – The new value is obtained by partial derivative of the error function.

The error function is $f(x_i),y_i) = \frac{1}{2}\sqrt{((mx_i + c) - y_i )^2}$, so we have to find partial derivative with respect to m and c.

$$\begin{split}f'(x) =    \begin{bmatrix}      \frac{{\partial}f(x)}{{\partial}m}\\      \frac{{\partial}f(x)}{{\partial}c}\\     \end{bmatrix} =    \begin{bmatrix}      \frac{1}{N} \sum -x_i(y_i - (mx_i + c) \\        \frac{1}{N} \sum -(y_i - (mx_i + c) \\     \end{bmatrix}\end{split}$$

In every iteration – we will find partial derivative delta_m and delta_c and then find a respective m and c in the iteration. The following line in the code is used to obtain the new value of the m and c.

If we plot slope and intercept with error – we would get a chart similar to the following. The Idea is to find m and c where the error is minimum.

self.m = self.m - self.r * delta_m
self.c = self.c - self.r * delta_c

Classification Data Analytics Digit Recognition Generative Models Machine Learning Probability Python

Understanding classification using Naive Bayes Classifier.

Classification is a supervised machine learning techniques, where objects are categorized into buckets. The most common example given is classification of the fruits in a given set. It can be a set of images of fruits, real fruits in a basket or a lot of fruits on assembly line.

The most intuitive method to classify objects would be to identify the properties of the objects and say that the objects having similar properties are of same class. The same principal is used to classify the objects in statistics or machine-learning. But in more formal ways. The properties of the objects are converted into numerical values and it is given as input to a function which produces class as output.

                                      Y = f(X1, X2 … Xn)

If the function is a linear equation it is said to be a linear classifier else it is said to be non-linear classifier. Linear model (equation) are easy to interpret and mathematically less complex , they  use relatively less computational resources while working on large data set.

Selection of properties ( predictors) and identification of the function(model) is the hardest part of the machine learning. There are number of methodologies developed for feature selection and building a model.  In case of regression (where output is numeric) Linear regression is the simplest model. While in case of classification (where output is categorical ) Logistic regression and Bayesian classifiers are simplest.

Bayesian Classification

The fundamental principal of Bayesian classification is Bayes Theorem. In 19th century English mathematician Thomas Bayes showed that the probability of occurrence of an event is product of likelihood of the event and prior probability of the event. This is also known as conditional probability.

                                                    P(A|B) = P(B|A).P(A) / P(B)

Bayes Theorem is little hard to grasp at first. But it can be understood with little effort. you can refer this link to understand more.  P(A|B) is called posterior probability, P(B|A) is called likelihood and P(B) is evidence. P(B) is largely constant. The relationship can be re-written in terms of proportionality (∝).

                             posterior probability ∝  (likelihood) . (prior probability)

One real life example of this relation can be chance of raining on a given day. Lets say in some day in November. The prior probability of raining would be less, because we already know this based on our experience it rarely rains in November. The features like atmospheric pressure, temperature , location of the city ( near seashore) would make the likelihood. If we have a high likelihood, the chance of rain would be high, even in the odd seasons.

This proportionality relation can be harnessed to classify the objects based on its properties and it can be used in variety of use cases. The most famous use case is – email spam identification, OCR – Optical Character recognizer, Image Classification, Fraud detection in Insurance and banking industry, Text classification,Customer segmentation etc.

Here we will look at the Hand written digit identification in detail and understand how Bayes theorem is used to identify written digits.

Feature creation

The first step of any machine learning algorithm is feature – selection, creation or extraction. We will do the same for the hand written digits. A hand written digit can be captured as image, and compressed in a specific dimension, say 28×28 pixel. Below is the 4 sample images for the digit 5. These digits are taken from MNIST computer vision data set of hand written digits.


The 28×28 pixel can be stored into a vector of 784 elements. And if we have 100 sample(training) data we can store those data 100×784 2D array.  Each element of the vector is float and can have value ranging from 0 to 255. The value of pixel is equivalent to the intensity of the pixel. In case of digit 5 the value of pixel 0, 783 would be 0 and the pixel at 50-75 would have some value greater then 0.

To identify a given digit, we calculated the posterior probability of all the digits – 0 to 9. Which ever digit has highest posterior probability, the given digit belong to that class.

Calculate Likelihood of the Digit

In the Bayes theorem – P(B|A) is called likelihood or the probability of the B when A has already occurred. We can calculated the likelihood of each of the pixel in the sample data set, i.e. P(  Xi | Digit = 1), P(Xi | Digit = 2) … P(Xi | Digit = 9). We would have 784×10 probabilities for each digits(0-9). If the pixels are independent of each other, i.e. Naive assumption( This is one of the important assumption of the Naive Bayes classification, refer wiki for details). We can rewrite likelihood for the digit 1 as product of probability of each pixel.

        P(Xi | Digit = 1 ) = P(X0 | Digit = 1) * P(X1 | Digit = 1) *….* P(X783 | Digit = 1)

The right hand side values of P(  Xi | Digit = 1…9 ) can be calculated in variety of ways. One simple method would be counting the value in pixel and dividing it by the number of digits in the samples. The value of pixel could be anything in the range of 0 to 255. This would lead to count 255^784 occurrences. If we have thousands of training data set, the counting method is computationally expensive.

Another way to calculate P(  Xi | Digit = 1…9 ) is PDF ( Probability Density Function).   The formal definition can be looked at the wiki. In simple terms – value of PDF of a continuous numbers generated by a function, is respective probability of the occurrence of the number.( i.e, the respective value of the function on the probability scale – 0 to 1).

For a given digit y (0 – 9), we can calculate mean (μ with subscript,y) and variance(σ^2 with subscript y) for each of pixels. The PDF of the pixel xi of the test digit can be given in terms of mean and variance using following Gaussian formula.


Pixel 0 for the digit 5 is 0, mean and variance would be 0, the pdf  P(X0|digit = 5)  would result in divide by zero error. – This is the classic error received in naive Bayes classification. We can over come this error by ignoring this pixel OR assigning P(X0|digit = 5) = 1. Assigning 1 would not change the overall likelihood, because it is multiplication operation. In this way, we can calculate the likelihood of the digit 5.

       P(Xi | Digit = 5 ) = P(X0 | Digit = 5) * P(X1 | Digit = 5) *….* P(X783 | Digit = 5)

Similarly, we can calculate the likelihood of all the digits( 0 – 9).

Prior probability of the Digits

P(A) is called prior probability – for all the digits we can calculate P(Digit = 1), P(Digit = 2) … P(Digit = 9). In case of 100 samples this would be equivalent to

                                                     P(Digit = 1) = No of Digit ‘1’ / 100

Predicting the Digit based on features

So far we know how to calculate Likelihood and Prior probability of the all the digits. Now,  suppose a new image of test digit is given. This test digit can be compressed into 28×28 pixel image or it can be stored into 784 element vector.  We perform following operation.

      For Each Digit in 0 to 9                                                                                                                                   For each Pixel in 0 to 783                                                                                                                             Calculate mean and variance of the each pixel of the digit.                                                 Calculate prior probability of the Digit.

This operation gives 9 prior probabilities and  (9 * 784) mean and variances. We can calculate the posterior probability for the digits 0 to 9 s and compare the probabilities to identify the highest among them. The test digit is considered to be the label of the highest probability digit.

Bayesian classification is also called probabilistic or generative classification model. This is simplest yet powerful model. Even after a strong Naive assumption, the model returns striking accuracy with small training data set. This is specially useful in the field of natural language processing.

The implementation of the algorithm using python numpy is available on the github page here .

Data Analytics Machine Learning Pandas Python Regression

Introduction to Linear regression using python

This blog is an attempt to introduce the concept of linear regression to engineers. This is well understood and used in the community of data scientists and statisticians, but after arrival of big data technologies, and advent of data science, it is now important for engineer to understand it.

Regression is one of the supervised machine learning techniques, which is used for prediction or forecasting of the dependent entity which has a continuous value.  Here I will use pandas, scikit learn and statsmodels libraries to understand the basic regression analysis.

Basics Terminology and Loading data in a DataFrame

DataFrame is memory unit to hold Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes. You can find more about data frame here.

First of all I would like to explain the terminology. Following are most important before we dive in.

  • Observations
  • Features
  • Predictors
  • Target
  • Shape
  • Index Column

In two dimensional array of Data – Rows are called observations and columns are called Features. One of the Feature which is being predicted is called Target. Other features which are used to predict the target is called predictors.

For linear regression to work – Primary condition is No of Target should be equal to no of Predictors i.e. Observations.


Shape is dimensionality, i.e. no of rows and columns. The shape of the data shown above is (5,4).

Index column is the pointer which is used to identify the observation, it can be numeric or alpha-numeric. But generally it is numeric starting with 0.

Now we can look at the actual data. Here we will consider sample dataset available in scikit learn library. Following code loads data in python object boston.


This dataset has 4 keys attribute called – data, feature_names, DESCR and target. data is a numpy 2-d array, feature_names and target is list. DESCR key explains the features available in the dataset.

Let convert the boston object to Panda dataframe for easy navigation, slicing and dicing.

  • First create instance of Panda as pd.
  • Call the function DataFrame and pass and boston.feature_names keys.
  • Print the a part of dataframe.

df.head would show the header (top) observations, Other way to select observation is using [] operator.


df.index evaluates to the index of the dataframe and “df.index<6” evaluates to True and False. df[df.index<6] is very popular way of selecting certain observation.

There are three ways to slicing pandas dataframe, loc, iloc and ix.
  • iloc[index] : – We can pass following elements in the dataframe.
       Index using number.
       Array indexes using [] operator.
       True False using functions or operators.
  • loc[index] : – We can pass following elements in the dataframe.
       Index using Labels.
       Array Labels using [] operator.
       True False using functions or operators.    
  • ix[index] : – We can pass anything numbers or Labels to ix.
       df.ix[[1,3,5],['CRIM','ZN']]  This selects 1st, 3rd and 5th row.


We have created dataframe df with, it doesn’t have target.

Now lets add as a column in the dataframe using df “df[‘PRICE’] =”. This will add a feature(target) in the last column of the dataframe df, Print using ix notation.


The dataframe df is ready with boston data for regression analysis. Following cell prints the part of the dataframe using ix notation.

Basics of Linear equation

The data set loaded in the previous step – PRICE is a continuous dependent entity, and we are trying to find a relationship of PRICE with other features in the dataset.

The most intuitive way to understand the relationship between entities is scatter plot. So we will plot all the predictors against Price to observe their relationship.

The selection of predictor is one of the important step in the regression analysis. The analyst should select the predictor which contributes to the target variable. There are some predictors which don’t contribute to the relationship, those should be identified and not used in the regression equation. One obvious non-contributing predictor is constants. Here the predictor CHAS has value 0 or 1. it doesn’t influences price of the house, so it should not be used in the regression.

I have selected RM,AGE and DIS as my predictor – I have taken this decision based on the observation in the scatter plot below.


We can observe a linear pattern in the plot. The price of house seems to be increasing with number of rooms. It is reducing with distance from the business center. And, It is reducing with Age.

We can try to find the equation (function) between No of rooms and the price. The following cell plots the best fit line over the scatter plot. The red line is the line of best fit and it can predict the house price based on the number of rooms. The equation of the line is given in the chart.


There are number of properties associated with the best fit line.

One of the most important properties is Pearson product-moment correlation coefficient (PPMCC) or simply said correlation coefficient.

It gives direction of the linear correlation between two variables X and Y. The value lies between -1 to +1. A value closer to +1, i.e. 0.95 suggests very strong positive correlation. A value closer to -1 suggest negative correlation. A negative correlation means that the value of dependent variable would decrease with increasing independent variable. A value 0 suggests that there is no correlation between the variables. You can find more about this here.

Mathematically r is given by below formula.

 r = Covariance of (X,Y)/Stadard Deviation of x * Standard Deviation of y

Some of Important properties related to regression line are

     Adjusted. R-squared
     F Statistic
     Prob ( F Statistic)
     Standard Error
     t Ratio
  • R-Squared is said to be the Coefficient of determination, it signify the strength of the relationship between variables in terms of percentage. This is actually the proportion of the variance in the dependent variable that can be explained by independent variable. The higher value of R-Squared is considered to be good. But this is not always true, sometimes non-contributing predictors inflate the R-Squared.
  • The adjusted R-squared is a modified version of R-Squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if new term improves the model more than would be expected by chance. It decreases when predictor improves the model by less than expected by chance. The adjusted R-square can be negative, but usually not. It is always less than equal to R-squared.
  • ‘F Statistic’ or ‘F Value’ is the measure of the overall significance of the regression model. This is the most important statistics which is looked at to understand the regression output.
  • If F value is greater than F Critical value, it suggests that there is some significance predictor in the model. ‘F critical value’ is the value obtained from F table for a given significance level (α).
  • F value, F Critical Value , Alpha (α) and p value are looked together to understand the overall significance of the regression model.
  • p value less then α suggests all the predictors are significant.
  • Mathematically F value is the ratio of the mean regression sum of squares divided by the mean error sum of squares. Its value will range from zero to an arbitrarily large number. The value far away from 0 suggests a very strong model.
  • The value of Prob(F Statistic) is the probability that the null hypothesis for the full model is true (i.e., that all of the regression coefficients are zero).
  • Basically, the f-test compares the model with zero predictor variables (the intercept only model), and decides whether the added coefficients improves the model. If we get a significant result, then whatever coefficients is included in the model is considered to be fit for the model.
  • Standard Error is the measure of the accuracy of predictions. If the prediction done by the model (equation) is close to the actual value,i.e. in the scatter plot the sample values are very close to the line of best fit. The model is considered to be more accurate.
  • Mathematically the standard error (σest) is given by
     σest = Sqrt( SUM (Sqr(Yi - Y′)) / N )
  • t statistic is the measure of significance of the individual predictor. It indicates how many times of standard errors a unit change in the predictor would bring in the response.

Following cell uses python library statsmodels.api to show the summary output of the OLS (Ordinary Least Square) method. The explanations given in the cell can be used to interpret the result.


                            OLS Regression Results                       
Dep. Variable:                  PRICE   R-squared:                  0.484
Model:                            OLS   Adj. R-squared:             0.483
Method:                 Least Squares   F-statistic:                471.8
Date:                Mon, 19 Mar 2018   Prob (F-statistic):      2.49e-74
Time:                        12:05:20   Log-Likelihood:           -1673.1
No. Observations:                 506   AIC:                        3350.
Df Residuals:                     504   BIC:                        3359.
Df Model:                           1                                    
Covariance Type:            nonrobust                                      
                 coef    std err          t      P>|t| [0.025      0.975]
const        -34.6706      2.650    -13.084      0.000 -39.877    -29.465
RM             9.1021      0.419     21.722      0.000  8.279       9.925
Omnibus:                      102.585   Durbin-Watson:              0.684
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         612.449
Skew:                           0.726   Prob(JB):               1.02e-133
Kurtosis:                       8.190   Cond. No.                    58.4

Regression is a vast topic which can be covered in books only. I have found a book at the link This looks to be a nice read.

The python notebook for this tutorial can be found at my github page here.

Networking Uncategorized

Booting PC over the Network

Basics of Network Booting

Network Boot is industry standard method to load Operating System from network. It was designed to boot disk -less devices when when the cost of disk was high. It is still useful for data centers where servers are made available on demand.

The core concept behind this is same as BIOS (Basic I/O System). The system loads a small piece of software which in turn loads other OS from disk and finally makes the computer live.

PXE (Preboot eXecution Environment) is widely used Network boot software. It was developed by Intel corporation. There are other open source version of Network boot software called iPXE which has more features.

BIOS is hardwired into the mother board, while PXE is encoded into a chip in the network card. All the computers are configured to load BIOS first – BIOS subsequently selects a device to transfer its control. BIOS maintains a list of items ( Hard disk, USB, CD-ROM, Network) which is called boot order. The devices on the lists are selected one by one, whichever is  available first, that device receives control from BIOS.

To boot from Network, we need to configure the BIOS “boot order” and make the Network as first boot device – This can be done in the Boot menu of the BIOS setting. You can go in BIOS setting by pressing DEL or F2 key multiple times when the PC is booting. Some new motherboards may have different setting.

When a network boot program (NBP) receives a call from BIOS, it is loaded into memory and control is passed to this program. This program has driver software for the network card and a DHCP client.  As the intention is to load system over the network, NBP configures a network connection – It sends DHCP request over the wired network and receives a set of IP address a from DHCP  (Dynamic Host Configuration Protocol) server. DHCP server is a program which handles request from a client for IP address. Every network has a DHCP server which serves IP address to the clients.

Configure DHCP server

Below is typical DHCP service call from a client for IP address. The DHCP server can assign IP address as well as a next-server to clients. ‘next-server’ is the address or dnsname for the TFTP server.

DHCP server has a configuration file called “dhcpd.conf” at /etc/dhcp/ directory. It has configuration related to IP addresses, leasing time and other parameters related to dhcp configuration.

subnet netmask {
          option routers;
          option broadcast-address;
          default-lease-time 3600;
          max-lease-time 7200;
          next-server TFTP_server_address; 
          filename “/tftpboot/pxelinux.0”;

The two parameters ‘next-server’ and ‘filename’ are important for network booting. TFTP server is Trivial FTP server which runs on the port 66, any client can request for the files hosted, and the files are served without authentication.

‘filename’ is the NBP binary hosted on the home directory for the TFTP server.  There are many opensource version of TFTP servers available – You can follow this to install TFTP server on Ubuntu machines.


Configure PXE/iPXE Boot Menu

The structure of the tftp root directory (tftpboot) is shown below. It hosts the file pxelinux.0. This file is  lightweight master boot record boot loaders from syslinux. There are some other files vesamenu.c32, ldlinux.c32, libcom32.c32 and libutil.c32 required by pxelinux.0.

All these files can be extracted  from syslinux website here.

pi@raspberrypi:~/d_drive/tftpboot $ ls -lrt
total 477
drwxrwxr-x 1 pi pi      0 Oct 27 20:02 syslinux
-rwxrwxr-x 1 pi pi  42143 Oct 27 20:33 pxelinux.0
-rwxrwxr-x 1 pi pi  26692 Oct 27 20:33 vesamenu.c32
-rwxrwxr-x 1 pi pi 116556 Oct 27 20:33 ldlinux.c32
-rwxrwxr-x 1 pi pi 181952 Oct 27 20:33 libcom32.c32
-rwxrwxr-x 1 pi pi  23636 Oct 27 20:33 libutil.c32
drwxrwxr-x 1 pi pi      0 Oct 27 21:11 img
-rwxrwxr-x 1 pi pi    216 Oct 27 21:29
drwxrwxr-x 1 pi pi    4096 Nov 5 21:53 pxelinux.cfg
-rwxrwxr-x 1 pi pi   67227 Nov 6 19:30 undionly.kpxe
drwxrwxr-x 1 pi pi   4096 Nov 15 13:49 iso
drwxrwxr-x 1 pi pi   4096 Nov 18 12:08 scripts
drwxrwxr-x 1 pi pi   4096 Jan 20 14:05 cgi-bin
pi@raspberrypi:~/d_drive/tftpboot $ cd pxelinux.cfg/
pi@raspberrypi:~/d_drive/tftpboot/pxelinux.cfg $ cat default
default vesamenu.c32
prompt 0
timeout 100
ONTIMEOUT Minimal_Linux_Live

LABEL Minimal_Linux_Live
MENU LABEL Minimal_Linux_Live (7M)
KERNEL syslinux/memdisk/memdisk
INITRD iso/minimal_linux_live_20-Jan-2017_32-bit.iso iso

LABEL lubuntu-14.10
MENU LABEL Lubuntu-14.10 Over HTTP (705M)
kernel iso raw

pi@raspberrypi:~/d_drive/tftpboot/pxelinux.cfg $

The tftproot also hosts another directory called  ‘pxelinux.cfg’. This contains a ‘default’ configuration file, which has boot menu configuration along with KERNEL and INITRD commands.

Boot Menu provides a list of OSs hosted on the tftp server in iso and img directory. KERNEL command loads memdisk which is the core of operating system, then INITRD load the actual OS.

Python SimpleHTTPServer module can be used as http server.

Laptop booting over LAN

An example of working PXE boot is available at below location.

The project is also available on github here.


Arduino – Open Source Prototyping Board

Arduino is open source hardware platform which allows to control devices using simple easy to understand programs.  The interface is so simple that anyone with little knowledge of electronics and programming can easily connect sensors, motors, servos to create programmable devices.

The hardware is actually a micro controller sitting on a prototyping board with inbuilt power supply, USB interface to connect to modern computers and bunch of input output pins to connect the other devices. The micro controller is low cost, low power 8 bit Atmega328p chip from Atmel corporation. The datasheet for the micro controller can be found here.

There are many version of the device available, the entry level module Arduino UNO is very popular (Below). It has all the components (USB, Power Supply Unit, Micro Chip) integrated on a single board.


The original device can be brought from various online stores. There are many clones available in the market. All of them work pretty similar. Most of the time the device shipped from manufacturers have boot loader burnt into them.  Boot loader is a program which makes this device programmable, i.e. You can burn your own code on top of it and it will execute your code as soon as you power it.

The heart of the Arudino is its IDE (Integrated Development Environment). Which is a java based console with 100s of easy to use library, well documented online guides, millions of helpful communities and users. It makes Arduino a breeze.

Setting up the IDE

You can follow the guide available on Arduino website to install the IDE and USB driver. The IDE looks like below screenshot. It is a typical GUI environment with multiple menus and console to write code.


Once the device is plugged using USB it shown as  COM device in the windows device manager (image below). This is basically a virtual com port, i.e. Physically the device is connected using USB, but the computer sees this as COM device. There is no magic, the UNO board has USB to serial converter chip (shown in the image above), which takes care of USB to serial conversion. The Driver software loaded while running setup helps the chip to communicate with the PC using USB interface.  Sometimes, Driver software is not installed automatically and you have to find the driver for the chip  used in the UNO board and install it manually.


Running the first Example (Blink)

Step 1: Select the device

After plugging the board and opening the IDE, we have to select the port. Port is channel which the IDE and the device would communicate. In my case it is COM7 as seen in device manager. Most of the times it appears automatically in the port menu. This can be anything from COM0 to COMXX where XX is a number. Select this in your setup.

Step 2: Select the board

Arduino comes in many versions. We have to select the board we are using. In my case “Arduino UNO” is used. So, I selected Tools -> Boards -> Ardunio/Genuino UNO.

Arduino Steps

Step 3 : Select an example ( Blink  )

The IDE has 100s of inbuilt examples, the easiest of these are Blink. The code blinks inbuilt LED. A typical Arduino example has two parts in code, function setup() and the function loop().

The setup function runs once when reset button is pressed or the board is powered on. This is used to initialize all the variables, pin, objects to be used in the subsequent code.

The loop function runs continuously till the board is powered on. It does an operation in loop and repeat  itself. Below is the blink example, its self explanatory. Further you can look at Arduino website here . to understand more about the blink example.

// the setup function runs once when you press reset or power the board
void setup() {
// initialize digital pin LED_BUILTIN as an output.

// the loop function runs over and over again forever
void loop() {
digitalWrite(LED_BUILTIN, HIGH);   // turn the LED on (HIGH is the voltage level)
delay(1000);                       // wait for a second
digitalWrite(LED_BUILTIN, LOW);    // turn the LED off by making the voltage LOW
delay(1000);                       // wait for a second

Step 4: Upload the Code

Once the code is written/copied in the IDE console. You can hit Switch -> Upload ( ctrl + u) to load the code from console to the Arduino board.  The bottom of the IDE would show “Done uploading”. Now the Blink example in loaded into the Board.


Time to Blink 🙂

Once the code is uploaded to the board, it starts working. The image below shows the blinking inbuilt LED.


There are many examples and projects which can be done using Arduino as a micro controller. Keep exploring and ask questions if you have any.