Pearson Correlation vs Simple Linear Regression
V. Cave & C. Supakorn
Both Pearson correlation and basic linear regression can be used to determine how two statistical variables are linearly related. Nevertheless, there are important variations in these two methods. Pearson correlation is a measure of the strength and direction of the linear association between two numeric variables that makes no assumption of causality.
Simple linear regression describes the linear relationship between a response variable (denoted by y) and an explanatory variable (denoted by x) using a statistical model, and this model can be used to make predictions.
The following table summarizes the key similarities and differences between the Pearson correlation and simple linear regression.
Pearson correlation is a number ranging from -1 to 1 that represents the strength of the linear relationship between two numeric variables. A value of 1 corresponds to a perfect positive linear relationship, a value of 0 to no linear relationship, and
a value of -1 to a perfect negative relationship.
The Pearson correlation (r) between variables “x” and “y” is calculated using the formula:
Simple linear regression
If we are interested in the effect of an “x” variate (i.e. a numeric explanatory or independent variable) on a “y” variate (i.e. a numeric response or dependent variable) regression analysis is appropriate. Simple linear regression describes the response variable “y” by the model:
where the coefficients “a” and “b” are the intercept and slope of the regression line, respectively. The intercept “a” is the value of “y” when “x” is zero value. The slope of “b” is the change in “y” for every one unit increase or decrease in “x”. When the correlation is positive, the slope (“b”) of the regression line will be positive and vice versa.
The above model is theoretical, and in practice, there will be error. The statistical model is given by:
Example using Genstat
Example: A concrete manufacturer wants to know if the hardness of their concrete depends on the amount of cement used to make it. They collected data from thirty batches of concrete:
The scatterplot of the data suggests that the two variables are linearly related:
Let’s first assess whether there is evidence of a significant Pearson correlation between the hardness of the concrete and the amount of cement used to make it. Our null hypothesis is that true correlation equals zero. Using Genstat, we can see that the correlation estimated from the data is 0.82 with a p-value of <0.001. That is, there is strong statistical evidence of a linear relationship between two variables.
Here we see the Correlations menu settings and the output this produces in Genstat:
Note, the validity of our hypothesis test depends on several assumptions, including that x and y are continuous, jointly normally distributed (i.e. bivariate Normal), random variables. If the scatterplot indicates a non-linear relationship between x and y, the bivariate Normal assumption is violated. You should also check whether both the x and y variables appear to be normally distributed. This can be done graphically (e.g. by inspecting a boxplot, histogram, or Q-Q plot) or with a hypothesis test (e.g. the Shapiro-Wilk test). For both variables in our dataset, neither their boxplot nor Shapiro-Wilk test indicates a lack of normality.
As the hardness of the concrete is assumed to represent a response to changes in the amount of cement, it is more informative to model the data using a simple linear regression.
Here we see the Linear Regression and Linear Regression Options menu settings and the output this produces in Genstat:
Using Genstat, we obtain the following regression line:
Predicted hardness of concrete = 15.91 + 2.297 x amount of cement
That is, the model predicts that for every one unit increase in the amount of cement used, the hardness of the concrete produced increases by 2.297 units. The predicted hardness with 20 units of cement is
15.91 + (2.297 x 20) = 61.85 units.
For a simple linear regression, we are also interested in whether there is evidence of a linear relationship with the explanatory variable. This can be assessed using the variance ratio (v.r.) which, under the null hypothesis of no linear relationship, has an F distribution. The p-value from this test is <0.001, providing strong statistical evidence of a relationship. The percentage variance accounted for summarizes how much variability in the data is explained by the regression model, in this example 65.7%.
The residuals (ε) from the regression model are assumed to be independent and normally distributed with constant variance. The residual diagnostic plot (see above) is useful for helping check these assumptions. The histogram (top left), Normal plot (bottom left), and half-Normal plot (bottom right) are used to assess normality; the histogram should be reasonably symmetric and bell-shaped, both the Normal plot and half-Normal plot should form roughly a straight line. The scatterplot of residuals against fitted values (top right) is used to assess the constant variance assumption; the spread of the residuals should be equal over the range of fitted values. It can also reveal violations of the independence assumption or a lack of fit; the points should be randomly scattered without any pattern. For our example, the residual diagnostic plot looks adequate.
This article describes two standard methods for investigating the relationship between two numeric variables and introduces Genstat as a tool for calculating correlation, performing linear regression, and hypothesis testing. We hope that it is helpful for educators who are interested in learning more about Pearson correlation and simple linear regression.
Check out the Pearson Correlation vs Simple Linear Regression video