Simple Linear Regression in R for Dummies

What is it?

Simply put, simple linear regression is a method that helps us model the relationship between two variables. When we have multiple variables, we have multiple linear regression. This concept is an important concept used in data analysis, and can tell us a lot about a dataset.

When working with data with two variables, an independent and dependent variable, we sometimes wish to see the relationship those two. We want to see if the influence of one variable (independent) has an impact on the value of another variable (dependent). i

In Practice

One way to see this relationship is by plotting the data in a scatter plot, and then finding the line of best fit, which is essentially a line with form y = mx+b that best fits the data points in the scatter plot. There is quiet a lot of math behind finding these values m and b, but R makes this process very simple. 

Take the following data set for example: 


In this dataset, we want to determine the line of best fit, which can be done in R by finding the values of b and m. These values can also be denoted by β0 (intercept) and β1 (slope) respectively. The intercept represents the average height when the weight is 0, and the slope represents the average change in weight when we increase height by 1 unit. R uses linear regression to find these values.


To begin the task, set your working directory and load in the data set using the read.csv function in RStudio:




You will notice that I also called head(data), which was called to a show a preview of our data in the Console(Bottom left section). I also created weight and height variables, which represents the height and weight columns of our data. I also multiplied each entry in the height vector so our data is in cm rather than m.

Next, we plot this data to have a little insight into what we are working with using the plot() function. We also indicate the title of the plot as well as the axis labels.




In the bottom right tab (plots), we see the scatter plot we just generated: 





Next, we use the lm() function to create a linear regression model, which will compute the values for the intercept and slope as well as other values:



You will notice I used the summary() function to summarize the model we just created, we can find its output in the console.
Calling summary(model) shows us that the estimated values are β0 = -39.06196 β1 = 0.61272.


What does this mean?

An intercept of -39.06196 seems pretty unrealistic, as neither height or weight can be negative, but it starts to make sense when you realize that it represents the average weight when height is equal to 0. It is obvious that height cannot be 0, so this negative value does indeed make sense. When the height is increased by 1cm, the weight increases on average by 0.61272 lbs (slope).

Next, we use these values to fit the line of best fit into our plot:

As we can see above, we created two values named intercept and slope which we accessed through the summary() function. We then used these values in the abline() function to plot the line.


The resulting plot is:
We can see that the line fits pretty well, and it also shows that there is a positive association between the two variables. We are visually able to see that as height increases, so does weight.


Final Words


    What we have done above is just a very simple shallow introduction to the world of linear regression. There are hundreds of other concepts that we can implement using R.

    These few lines of R code allowed us to gain insightful information on a set of data. We were able to find the best estimates needed to graph the line of bet fit, as well as compute additional information about these estimates. 

    The concept of linear regression is a powerful tool, and R is an even more powerful tool. We can extend this simple linear regression to multiple linear regression, which involves data sets with more has 2 variables (we can add gender as a variable). We can also extend this concept to datasets with thousands of data points, rather than the small number we worked with. The concept of linear regression is a complicated one, it involves very heavy math, but tools such as R simplify this process allow even beginners to be able to conduct simple data analysis using linear regression without needing to know the mathematical knowledge that is behind the computations.

Comments

  1. Great introduction for Linear Regression! Very throughly explained.

    ReplyDelete
  2. You teach better than my professors! Thanks bud appreciate you.

    ReplyDelete
  3. Linear regression made easy! Thanks so much this has made my day and work so much more easier .

    ReplyDelete

Post a Comment