Residual

In statistics, models are often constructed based on experimental data in order to analyze and make predictions about the data. A residual is the difference between the observed value of a quantity and its predicted value, which helps determine how close the model is relative to the real world quantity being studied. The smaller the residual, the more accurate the model, while a large residual may indicate that the model is not appropriate. The figure below shows residuals for a simple linear regression:



The line of best fit, shown in blue, is a model of the heights of a sample of boys of different ages. The residuals are represented by the dotted red lines between each value and the line of best fit. Points that lie above the line of best fit have positive residuals; points that lie below have negative residuals; points that lie on the line have residuals of 0. Note that the sum of all the residuals should, by definition, be 0.

Residual for a simple linear regression

A simple linear regression model is represented by the equation

where x is the independent variable, is the dependent variable, is the y-intercept, and is the slope of the line.

Given that n values are collected for an experiment, the residual, e, is

Example

Let be the line of best fit for the following data:


x y
5 3
8 5
11 8
12 12
16 14

Find the residuals using the equation:

.

can be found by plugging each x value into :

Once is computed for each x value, the residual is computed by subtracting the y value from . The results are shown in the following table.


2.609 -0.391
5.825 0.825
9.041 1.041
10.113 -1.870
14.401 0.401

The regression line for the data is shown in the figure below.



The residuals are relatively small in magnitude, so the observed values are relatively close to the regression line, as shown in the figure. Thus, the model appears to be a good fit for the data. At the very least, the data seems to have a linear relationship.

Residual plots

A residual plot is a type of scatter plot that is used to determine whether a model is a good fit for the data. The horizontal axis of a residual plot represents the independent variable while the vertical axis represents the residual values.

The table below contains a set of data points and their respective residuals given by the regression line .


x y
2 4 5.466 1.466
5 9 8.613 -0.387
7 12 10.711 -1.289
12 15 15.956 0.956
13 19 17.005 -1.995
18 21 22.250 1.250

The resulting residual plot is shown in the figure below:



Generally, a residual plot that shows points that are evenly and randomly scattered above and below the line means that the model is likely a good representation of the data. On the other hand, if there is some sort of noticeable trend, such as the points trending downward or upwards, a non-linear model may better represent the data. Also, if the residuals are scattered relatively far away from the x-axis, this is also an indication that the model may not be a good fit.

In the residual plot, the points are evenly distributed above and below the x-axis with no real discernible trends. They are also close to the x-axis relative to the magnitudes of the dependent variable, so a linear model seems to be good fit for the data.