Multicollinearity in Linear Regression: Why is it a problem?
When creating a multiple linear regression model, is it always important to check for correlations between features other than your target variable. If two features other than your target variable are correlated, your coefficients will most likely be difficult to interpret. This is called multicollinearity.
That’s as deep into the topic that most of us go. One thing not typically discussed is the reasoning behind this concept. Why is it that having two features that relate to each other will skew your model?
In trying to understand this better, I took a simple dataset from kaggle that you can find here. This is a simple but extreme example and hopefully we can use it to illustrate the point. Say I am trying to model the effect that the GRE Score and the University Rating has on chance of admission. Here is my initial OLS model:

As you can see, the r2 score isn’t bad. If you look at the coefficients, you will see that for every unit increase for the GRE_Score, the chance of getting accepted increase by .0074 and for every unit increase in university rating, it increases at a rate of .0393 as the rating increases.
Now take a look at the model summary below which is identical to the one above besides for the duplicated column that I created of the the university rating.

The coefficients for the university rating aren’t accurate anymore. The reason is because they are perfectly correlated with each other. Each column has half the effect on the target variable because the model believes that they are both contributing to the target variables increase together when in fact, each one alone does the job exactly the same as them together.
If that doesn’t make sense, here is an intuitive real life example that I found on stack overflow: Imagine you have two people pushing a boulder up a hill. It is very difficult to figure out how much each one pushed because they are working at the same time. If they each push it separately, it is possible to figure out how much each one is doing. To take this back to our example, the model cannot figure out how each one independently is affecting the target variable because both university rating columns affect the model in the same way. Therefore, the model assumes that each one is doing half the work and that’s why you see a coefficient of 0.0393 in the first model and the second model divided the coefficient by two causing each duplicate coefficient to be 0.0197.
Bibliography