In last article we learned about the types of machine learning algorithm.
Lets us explore one of those techniques – Simple Linear Regression.
In simple linear regression, we need to predict value of one dependent variable on the basis of given independent variable.
Throughout this blog, we will use below notations:-
y – Dependent variable or output
x – Independent variable or input
The relation of x and y in simple linear regression is given as
y = a + b (x).
This equation looks familiar. Does it? That’s because we have all learnt it back in school time.
This is an equation of a sloped line drawn on X-Y axis. For eg. If X is the number of likes a person gets on photo on facebook, Y will be the popularity of index for him. So we can say that Y depends on X or in simple words, if X is independent variable then Y is its dependent.
Simple linear regression involves only one independent variable. Dependent variable may or may not be directly dependent on independent variable.
Let me explain this with an example:-
In general, salary increases with experience. More the experience higher the salary. So salary is dependent on experience.
Therefore in terms of simple linear regression , relationship between salary and experience can be defined as below:-
Salary = a + b* Experience —————————(1)
I will explain these a and b in few minutes.
Consider the ideal case where salary increases by 2 lacs every year as shown in below graph.
Why the line starts from salary = 3 instead of origin (0,0) ?? Obviously, freshers with even 0 experience have some salary. This may happen in any linear regression case that dependent variable may have some value even when the independent variable is 0. This leads to a new term called CONSTANT. The term ‘a’ in equation (1) corresponds to CONSTANT.
One more important point is that salary increases by 2 lacs every one year. The incremental factor of salary and experience is not same. We can say that for every unit change in experience (independent variable), there is increase in salary (Dependent variable) by 2 units. The term to represent this unit change between dependent and independent variable is called COEFFICIENT. The term ‘b’ in equation (1) corresponds to COEFFICIENT.
Best fit line – This is a modelled line from which the distance of our observations points is as small as possible.
This was all about the ideal case. But as you all know, salary never increases by the same amount every year. There may be deviations in the actual values of salary from its standard value. So our objective is to minimize these deviations. For this, we will draw the best fit line.
Just look at below dataset
|Experience (In years)||Salary (In lacs)|
We get the below graph on plotting above points.
Here, the colored dots are our observation points and their distance from the red line is their deviation from ideal behavior.
As you can see, some of the observation points lie above red line and some are below it. Therefore, distance calculated between observation points and red line may be negative sometimes. To avoid dealing with negative values, simple linear regression considers the square of distance between points and red line.
The difference between points and the red line denotes the error in drawing the modelled line. And we should choose this red line in such a way so that this error is minimum.
The simple linear regression draws multiple such lines and for every line it calculates the sum of squares of the distance between actual data and line drawn. It records the sum temporarily, and finds the minimum value from those recorded sum values. The line for which sum value is minimum becomes the best fit line.
That’s how a simple linear regression works. Pretty simple, right?