To prepare for some of my job interviews, I’ve been reviewing essential concepts from my Data Science Immersive course at the Flatiron School. Lately, I’ve been reviewing some ideas essential to early on in the program when we focused on Statistical Theory, Probability, and Linear Regression.
Linear Regression models are not just used for prediction, but can help you understand the relationship between different attributes of your data. One of my instructors at the Flatiron School would build a OLS model as part of his exploratory data analysis to understand how much of his dependent variable could be explained from the data at hand.
One concept essential to understanding this relation is R², or the Coefficient of Determination. R² is a way of understanding a model’s goodness fit, or how much of the variance in the data of the dependent variable is explained through the independent variable. To save you a google search, the dependent variable is also known as the target or what you’re trying to predict based on the other parts of your data. For example, if you were trying to predict annual salary based on age, annual salary would be your dependent variable while age would be your independent variable.
R-Squared compares the performance of your linear regression model based on the performance of a baseline model. The baseline model is the mean of the observed values irrespective of the values of X. So if the mean value of annual salary was 50,000, this baseline model, or mean model, would only give 50,000 whether a person was 16 or 1600.
Calculating R-Squared involves finding the quotient from the sum of squared errors from your linear regression model (SSres) and the sum of squared error from this simple baseline model (SStot) and subtracting this value from 1. The closer this value is to 1, i.e. the more similar these two sums of squares are to each other, the lower your R-squared value will be and thus the less amount of variance in the relationship of these two variables your linear regression model is able to describe.
A way to express an R-Squared of 0.85 would be “85% of the variance in my data is accounted for in this linear regression.”
Though one might initially want a high R-Squared value, one should be wary of R-Squared values higher than 0.9, and frightened of R-squared values of 1. Often this means there was some kind of data leakage, or your dependent variable somehow snuck into your independent variables. During a project predicting the length of time a dog would stay in the Austin Animal Shelter, one of our models had a very good R-Squared value, only to find that this model had the dog’s age upon leaving the shelter as one of its features. Fortunately, we caught this immediately and continued to engineer other features.
Depending your project’s purpose, there are situations in which a high R-squared might not even be relevant. For example, in a study on the relationship of religiosity and health, one would expect a low R-Squared because there are clearly other, but an R-Squared of 0.10 to 0.15 can let you know that there is some kind of relationship. This blog post goes into more detail: https://www.theanalysisfactor.com/assessing-the-fit-of-regression-models/
You might be wondering, as I did, how similar RMSE is to R-Squared. R-Squared is a relative measure of fit according to a baseline model while RMSE is the square root of the variance of the residuals from the observed data and predicted values in your test set, given in the same units as the dependent variable. While R-Squared is a measure of the amount of variance your linear regression model is able to explain, RMSE accounts for the errors in the prediction from the test set and is considered the most important metric if your aim is a predictive model.
That’s all for this week! I’ll be returning to more PySpark information next week.