home / probability and statistics / descriptive statistics / correlation

Correlation

Correlation is any statistical relationship between two random variables, regardless whether the relationship is causal (one variable causes the other) or not. Although correlation technically refers to any statistical association, it typically is used to describe how linearly related two variables are.

Even though correlation cannot be used to prove a causal relationship between two variables, it can be used to make predictions. For example, given two variables that are highly correlated, we can relatively accurately predict the value of one given the other.

Correlation between two random variables is typically presented graphically using a scatter plot, or numerically using a correlation coefficient.

Scatter plots

Scatter plots are graphs that depict clusters of dots that represent all of the pairs of data in an experiment. For example, a plot of weight vs. height will show a positive correlation: as height increases, weight also increases.

Scatter plots are constructed by plotting two variables along the horizontal (x) and vertical (y) axes. Below are examples of scatter plots showing a positive correlation, negative correlation, and no or little correlation. Note that the more closely the cluster of dots represents a straight line, the stronger the correlation.

Positive correlation - The two random variables increase together. There is a positive correlation between height and weight: weight increases as height increases.

Negative correlation - One of the random variables increases as the other decreases. There is a negative correlation between speed and the amount of time it takes to get somewhere: as speed increases, it takes a shorter amount of time to get to a destination.

No correlation - There is no linear relationship between the two random variables. There is no correlation between being able to write in cursive and the number of fish in the ocean.

Correlation coefficient

A correlation coefficient is a numerical representation of the relationship between a pair of random variables. There are several different correlation coefficients, the most commonly used of which is the Pearson correlation coefficient.

The Pearson correlation coefficient (r), also referred to as Pearson's r, is a value between -1 and +1 that describes the linear relationship between two random variables. The closer to -1 or +1, the more linear the relationship between the variables. An r of 0 would mean that there is no linear correlation between the variables at all:

r = 1: perfect positive correlation
r = -1: perfect negative correlation
r = 0: no correlation

The Pearson correlation coefficient is calculated using two different equations depending whether you are working with a sample (r) or a population (ρ). In most cases, a sample is used since it is typically too expensive or difficult to procure data for an entire population. The difference between the two formulas is that the sample formula uses estimates for the covariance and variance since the true population values are not known. Refer to the correlation coefficient page for more information.