In last post, you became familiar with Simple Linear Regression
Model in Machine Learning. And now, its time to move on to another
Machine Learning Model – Multiple Linear Regression.
I am assuming that you guys are very clear with the usage of word ‘Simple’ in ‘Simple Linear Regression’ model and if not, you
can anytime go back to this link to get insights of it.
Just to brief, in SLR, there was only one independent variable
to predict the values of dependent variable. This is the only reason we called it sooooo ‘SIMPLE’.
Let’s go ahead to make this simple thing bit more interesting by adding some more independent variables. So, in MLR, we are going to predict dependent variables by using ‘MULTIPLE’
independent variables.
Multiple Linear Regression is a statistical technique to predict the value of dependent output variable from two or more independent variables.
Here, the dependent variable is called as criterion and the independent variables are called predictors.
Lets understand multiple linear regression with the help of a usecase.
We have a dataset which contains information about various parameters considered while calculating the value of a Toyota Corolla car. These parameters are listed below:-
- Age of car
- KM it has covered
- Fuel Type
- HorsePower of car (HP)
- Metcolor
- Automatic
- CC
- Doors
- Weight
- Price
Let’s say we have to predict the price of a car on the basis of above parameters. In this case, price will become dependent variable and remaining parameters would be independent variables.
First, we need to write an equation similat to sample equation (1) for this case. And the equation would be
y=β0 + β1×1 + β2×2 + β3×3 + β4×4 + β5×5 + β6×6 + β7×7 + β8×8+ β9×9 + ei
where
x1
–> Age of car
x2
–> KM covered by car
x3
–> Fuel Type
x4
–> HP
x5
–> Metcolor
x6
–> Automatic
x7
–> CC
x8
–>Doors
x9
–>Weight
In regression models, all the values needed for prediction are required to be in numeric type but in this case, we have fuel type which will have values like Diesel, Petrol or CNG and hence it is not numeric. Now the question is how can we use words in numeric equations. And fuel type is one of the important parameters that we can’t ignore in our case.
Don’t worry, there is a solution for this problem also.
In any dataset , there are two types of variables on the basis of their values.
- Numeric Variables —> Values are of numeric type
- Categorical Variables –> Values are of character type.
If our dataset has some categorical independent variables, we need to introduce dummy variables.
Let’s understand what Dummy variables are and why they are called so.
Dummy variables are the ones which help us to convert the values in string format into numeric format.
What is the categorical variable in above dataset?
I am sure you must have guessed it right and the answer is ‘Fueltype’. And since ‘Fueltype’ is of string type, we can’t use it directly in our calculations. So, we will create dummy variables for this column.
First, we need to check distinct values present for this column ‘Fueltype’. The distinct values are ‘Diesel’, ‘Petrol’ and ‘CNG’.
Now, for each of this distinct value , we need to add a new column to our dataset.
Next is to determine values for these new columns. Well, here comes the step where string data would be converted to numeric data. We are going to use only two numbers – 0 and 1.
For each record, only one of the three new columns will be populated as 1 corresponding to its ‘Fueltype’ column’s value and rest two will have 0 value as shown below.
Now, we can discard original column ‘Fueltype’ since its value can be determined using newly added three columns.
New dataset wil be as given below:-
Since there are three new columns, equation (1) would be changed a bit as given below.
y=β0 + β1×1 + β2×2 + β3×3 + β4×4 + β5×5 + β6×6 + β7×7 + β8×8 + β9×9 + β10×10 + β11×11 +β12×12 + ei
Please notice that we have removed coefficient as well as variable for ‘FuelType’ and in place of that, three new terms β10×10 (CNG), β11×11 (Diesel) and β12×12 (Petrol) have been added corresponding to new columns.
The three new columns ‘Diesel’ ,’Petrol’ and ‘CNG’ are called the ‘DUMMY VARIABLES’.
Now a question for you all. I will talk about three dummy columns only.
If I tell you the values of columns ‘Diesel’ and ‘Petrol’, will you be able to determine the value for ‘CNG’ column????
Please
click on below link to verify your answer with explanation.
You will also get familiar with terms ‘Multicollinearity’
and ‘Dummy variable trap’ in this link. Please go through
it once.
http://www.data-stats.com/multicollinearity/
Dummy Variable Trap
In last few lines, we talked about dummy variables and multicollineraity. Don’t you think that dummy variables are correlated to each other????
Yesss, dummy variables are correlated in terms that we can determine value of any one of the dummy variable by looking at the values of rest of the dummy variables.
Lets understand by example.
If values of ‘Diesel’ and ‘Petrol’ for one record are ‘1’ and ‘0’ respectively, obviously the value of ‘CNG’ column would be ‘0’ because only one out of these dummy variables can have value as ‘1’ at a time. It means some redundancy is there if we keep all the three dummy variables columns.
Since we go by assumption that there should not be multicollinearity in MLR, we need to deal with correlation among dummy variables. And the simplest solution for this is to remove one of the dummy variable column.
So, it can be inferred that we don’t need all the dummy variables at a time. We can omit one of them for sure.
In this case, let’s drop the column for ‘CNG’ and the new dataset will be as shown below.
New equation would be
y=β0 + β1×1 + β2×2 + β3×3 + β4×4 + β5×5 + β6×6 + β7×7 + β8×8 + β9×9 + β10×10 + β11×11 +β12×12 + ei
Thus,
y=β0 + β1×1 + β2×2 + β3×3 + β4×4 + β5×5 + β6×6 + β7×7 + β8×8 + β11×11 +β12×12 + ei ————–(3)
Before going ahead, it’s time to introduce the term ‘P-value’.
P-value
–> statistical measure to determine whether our hypothesis
are correct or not.
Null hypothesis –> Hypothesis that there is no relationship between the experimental values and observed results. P-value of any independent variable tells us that whether it should be considered while calculating the value of output or dependent variable. |
Sometimes, all the fields given in the dataset are not required to determine the value of dependent variable. Therefore, we can eliminate those variables from the data which don’t have much impact on the dependent variable. For this, we need to calculate p-value for each independent variable and if p-value comes out to be greater than significance value of 0.05, that independent variable can be removed from our calculations.
0 Comments