Interpreting the Root Mean Squared Error of a Linear Regression Model
The first time I ever built a Linear Regression model, I thought two things:
- Wow! I built something that can actually predict housing prices!
- Ok, but how good are these predictions?
I had learned to check all of the assumptions of a Linear Regression model (residuals should have a normal distribution, features are linearly correlated with the target, there’s no multi-collinearity, etc.). I learned to scale and sometimes even log-scale my features and target. I even learned about mean squared error (MSE) and root mean squared error (RMSE) to interpret the residuals of the model. The problem was, once I scaled my features and calculated the MSE, that number was scaled too! I wasn’t sure how to read it accurately.
As a refresher, Mean Squared Error is the average of the squared difference between each predicted point and the actual point. It is generally a very big number because we square each difference to get rid of negative numbers before taking the average. Therefore, we sometimes use Root Mean Squared Error to bring that number back down to scale (RMSE is the square root of MSE).
For example, if the target variable is in USD (i.e. Sale Price of a home), what does a root mean squared error of 0.3 really tell you? Is that big or small? Maybe if you scaled everything from 0 to 1 you could argue that it’s small, but what if you used a different scaling technique? Moreover, one of the most important skills as a Data Scientist is being able to connect these metrics back to real-world applications and interpret them through that lens.
Let’s say I created an app that could predict your home price. As a homeowner, you might see this predicted home price and feel inclined to sell. Now, what if I predicted that your home was going to sell for $350,000. Maybe you get excited and decide to sell your home. When you sell your home, you then realized that it would actually only sell for $250,000, which is less than you paid for the home. My model’s prediction was $100,000 off. For home prices, that is a lot but how would you know that if the RMSE was 0.3?
This is why it is so important to interpret your Mean Squared Error and Root Mean Squared Error correctly. Therefore, in my first Linear Regression project, I reversed my log and normalization scaling so that my MSE and RMSE were back in USD. In this way, I was able to correctly interpret the RMSE. Below I will show you how I did this in Python.
Here are the libraries you will need (this does not include the libraries needed to actually build a linear regression model).
Log Transformation & Normalization
In this example, I am building a Linear Regression model to predict housing prices. The target feature here is housing prices, which are typically in USD (or whatever currency you’re working with). In the process of building this model, I decided to log transform price (the target).
Why? In order to make the distribution of this feature more normal, and thus easier to work with. In this example, I will take the log base 10 of all of the continuous variables.
Next, I will scale the target and the other continuous variables to a standard normal scale.
Why? If we do this to the target and our other continuous features, it will allow the beta coefficients of our model to be weighted in a similar way so we can interpret which features are having a greater impact on our target. I will create a function to standard normalize so that we can apply it to our entire log data frame.
Awesome! Now we have a data frame of log-transformed, standard normalized continuous data. I can concatenate this with our categorical variables and run the model based on this. Next, I will fast-forward to the interpretation stage of the model to see how to change the target feature back to USD.
Transforming Mean Squared Error Back to USD
After building the model, I can validate it with a train-test split or k-fold cross-validation. Model validation is important to see if the model can predict a target using new data, instead of the just data it was trained on. We can analyze whether the model is overfitting (i.e. it predicts training data super well, but cannot generalize to new data) or underfitting (i.e. it is too generalized and thus produces predictions that are too far off). Many times during model validation, we analyze Mean Squared Error (MSE) or Root Mean Squared Error (RMSE) — AKA the average distance (squared to get rid of negative numbers) between the model’s predicted target value and the actual target value. In this example, I can use RMSE to see how far off the model’s predicted price generally is from the actual home price.
Remember that the RMSE will still reflect a price that has been log-transformed and standard-normal scaled. Here’s how I changed it back to USD.
Step 1: Build a function to undo the scale.
Step 2: Inverse the scale using this function.
Step 3: Inverse the log.
Step 4: Find the mean squared error and root mean squared error.
(I did steps 2–4 all in a few lines of code, but to break it down, you would inverse the normalization scale first, then to inverse the log, you would do 10 to the power of that inversed array, then find the mse)
Step 5: Interpret.
As you can see, the final RMSE for the training set was $130,614. For the testing set, the RMSE was $131, 683. This means that in general, the model’s predicted sale price is generally about $130,000 off. If I hadn’t changed that RMSE back to USD, it would have been 0.38 and 0.39 for the training and test RMSEs respectively, which is much harder to interpret. One thing to note is that since the training and test RMSE are very close, we did not overfit this model.
Now, is this a good model? Probably not… If I told you your house was going to sell for $330,000 and then it actually only sold for $200,000, you might be pretty mad. $130,000 is a big difference when you’re buying/selling a home. But what if I told you that I could predict GDP (Gross Domestic Product) of the US with only an average error of $130,000? Well, since the GDP is over $20 trillion, that difference wouldn’t be so big.
If I could give you one takeaway from this article, it’s this:
Understanding the context of the real-world application of the data you’re working with is essential in order to properly interpret the model’s metrics (like MSE and RMSE).
Without knowing about housing prices, how would you know if your error is big or small? It is so important to think about how your model will be used in practice. Once you understand that, you can transform the target back to its original unit, and truly understand your model’s errors/residuals.