home / probability and statistics / descriptive statistics / correlation coefficient

Correlation coefficient

A correlation coefficient is a measurement of the statistical relationship (correlation), between two variables. It is a dimensionless value that ranges between -1 and +1, where ±1 indicates the strongest correlation between a pair of variables and 0 indicates the weakest correlation.

There are a number of different types of correlation coeffients. One of the most commonly used correlation coefficients measures the strength of a linear relationship between two variables. It is known as the Pearson correlation coefficient, or Pearson's r, and is denoted as r. Generally, whenever the term "correlation coefficient" is used without specifying the type, this is the correlation coefficient being referenced. It is calculated using different formulas depending whether the collected data represents a population or a sample.

For a population:

where ρ is the Pearson correlation coefficient for a population, σ_X is the standard deviation of X, and σ_Y is the standard deviation of Y. cov is the covariance of X and Y with value: cov(X, Y) = E|(X - μ_X)(Y - μ_Y)|. Therefore, the formula above can also be written as:

where E is the expectation, μ_X and μ_Y are the means of X and Y.

For a sample:

where n is the sample size, x_i and y_i are the individual sample points indexed with i, and x and y are the sample mean for x and y.

Practical meaning of r

It is important to understand what the value of the correlation coefficient really tells us, and what it doesn't tell us. For example, correlation and causation are not the same thing. While a correlation between two variables might mean that one of the variables causes the other, no matter how strong the correlation, a correlation coefficient alone cannot prove that one of the variables directly affects the other. All a strong correlation between two variables means is that the pairs of variables are likely to lie in a similar relative space (positive r) or dissimilar space (negative r).

The value of r also does not represent some kind of proportion or percentage of a perfect relationship. Given that r = 0.8 for a set of height and weight data, the data cannot be interpreted as representing 80% of a perfect relationship.

Experimentation is an important aspect of statistical measures and can be used to determine whether a strong correlation indicates a cause-effect relationship. For example, before the effects of smoking were better known, we could not have said that smoking causes lung cancer if we were only given that there was a strong correlation between the two. Further experimentation needed to be done to confirm that smoking does indeed cause lung cancer.

Outliers

Outliers are extreme values that can have a potentially misleading impact on a summary of data:

In the scatter plot above, the pair shown in red is an outlier. There are mathematical methods for determining outliers, but in some cases, as in the figure above, we can see that all of the data trends around the line shown, except for the outlier which significantly differs from the rest of the pairs.

There are a number of ways to account for outliers, one of which is simply having more data. The more data there is, the less likely that an outlier will skew the data to any significant degree. Unless there is good reason to discard an outlier however (such as realizing that a mistake was made when collecting data for the points), the r value should be reported both with and without the outlier(s).