Correlation is any statistical relationship between two random variables, regardless whether the relationship is causal (one variable causes the other) or not. Although correlation technically refers to any statistical association, it typically is used to describe how linearly related two variables are.
Even though correlation cannot be used to prove a causal relationship between two variables, it can be used to make predictions. For example, given two variables that are highly correlated, we can relatively accurately predict the value of one given the other.
Correlation between two random variables is typically presented graphically using a scatter plot, or numerically using a correlation coefficient.
Scatter plots are graphs that depict clusters of dots that represent all of the pairs of data in an experiment. For example, a plot of weight vs. height will show a positive correlation: as height increases, weight also increases.
Scatter plots are constructed by plotting two variables along the horizontal (x) and vertical (y) axes. Below are examples of scatter plots showing a positive correlation, negative correlation, and no or little correlation. Note that the more closely the cluster of dots represents a straight line, the stronger the correlation.
A correlation coefficient is a numerical representation of the relationship between a pair of random variables. There are several different correlation coefficients, the most commonly used of which is the Pearson correlation coefficient.
The Pearson correlation coefficient (r), also referred to as Pearson's r, is a value between -1 and +1 that describes the linear relationship between two random variables. The closer to -1 or +1, the more linear the relationship between the variables. An r of 0 would mean that there is no linear correlation between the variables at all:
The Pearson correlation coefficient is calculated using two different equations depending whether you are working with a sample (r) or a population (ρ). In most cases, a sample is used since it is typically too expensive or difficult to procure data for an entire population. The difference between the two formulas is that the sample formula uses estimates for the covariance and variance since the true population values are not known. Refer to the correlation coefficient page for more information.