Histogram

A histogram is a type of chart used to represent the frequency distribution of a set of data. The width of the bars in a histogram represent what is referred to as a "bin" or "bucket," while the height tells us how many values in the data set fall within each respective bin.

A bin is an interval into which a given set of data is divided. Given a set of data with values that range from 1-100, we could create bins in intervals of 20 such that the first bin would contain values from 1-20, the second 21-40, the third 41-60, and so on through 100. Bins are usually equal in size, but in some cases, it can be beneficial to vary bin widths.

A histogram displays the number of values in a given data set that fall within each bin, providing a representation of the distribution of the data.

Example

The histogram below is a representation of the table on its left. The bins are in intervals of 20 from 0-100. The count, or frequency, is how many numbers in our assumed data set fall into each bin. From the histogram, we can see that most of our values fall within the 41-60 bin and taper off on either end of this bin. The shape of the histogram here is referred to as a symmetric, unimodal distribution.

Bin Count
1 to 20 12
21 to 40 30
41 to 60 65
61 to 80 35
81 to 100 15
 

Histograms are useful for showing general characteristics of a distribution of data. For example, it is possible to determine at a glance whether a distribution is symmetric or skewed, where the peaks are located, and whether there are any outliers.

Histogram distributions

Histograms can be used to determine what type of distribution a given set of data has, as shown in the figure below. The terms unimodal, bimodal, and multimodal refer to the number of modes in the distribution. In a histogram, the modes are the peaks, which are also referred to as local maxima.


Symmetric, unimodal distribution Skewed right distribution Skewed left distribution
Bimodal distribution Multimodal distribution Symmetric distribution

In some cases, it can be beneficial to experiment with various bin widths when constructing a histogram; using smaller bin sizes may reveal some information that would otherwise be hidden, since larger bin sizes may mask certain characteristics of the distribution; using larger bin sizes can help reduce noise in certain portions of the distribution.

How to construct a histogram

A large part of constructing a histogram involves selecting appropriate bin sizes. There are no universally accepted rules for selecting bin sizes since the most appropriate bin size is highly dependent upon the data, and different bin sizes can reveal different characteristics of the data. For example, using a wider bin size in lower density portions of the distribution can reduce noise, while using narrower bins in higher density portions provides greater precision, meaning that it can be beneficial to vary bin width throughout a histogram.

There are a number of different formulas used under different circumstances to determine the "optimal" number of bins, but these are not always appropriate depending on the distribution of the data and the goals of the analysis. As such, it is best to experiment with different numbers of bins and bin widths for a given set of data and select those that are most appropriate for the situation. Once the number of bins have been selected, the following steps can be used as a guideline for constructing a histogram.

Example

Construct a histogram given the following data set:

{1, 5, 6, 8, 10, 11, 12, 12, 13, 14, 14, 15, 16, 18, 18, 19, 20, 22, 23, 23, 24, 25, 25, 27, 27, 27, 28, 29, 30, 30, 31, 31, 31, 32, 32, 34, 34, 34, 35, 35, 35, 41, 42, 43, 45, 45, 45, 49}

For this data set, we will use 10 bins. Thus, the width of each bin is:

The ceiling of 4.8 is 5. The following table shows the 10 bins and the number of values within each bin.


Bin Frequency
1-6 2
6-11 3
11-16 7
16-21 4
21-26 6
26-31 7
31-36 10
36-41 6
41-46 5
46-51 1

Note that the bin intervals are denoted as [1, 6), [6, 11), [11, 16), and so on; the values at the end of each interval are not tallied twice. Thus, 6 is tallied in the 6-11 bin but not the 1-6 bin. Given the above, the resulting histogram is:


Histogram vs. bar graph

Histograms are sometimes mistaken for bar graphs since they can look very similar. The figure below shows an example of a histogram (left) and a bar graph (right).

Although a histogram looks like a bar graph, they are not the same, and convey different information. While a histogram is used for continuous data, bar graphs compare categorical data (data that takes on one, or a limited number of possible values). Furthermore, the rectangles in a histogram are always adjacent (this may not be apparent if a bin is empty). Although bar graphs are often drawn in the same way, it can be helpful to leave spaces between each bar when drawing a bar graph to make it clear that it is a bar graph, not a histogram.

Also, unlike a bar graph where the width of the bar does not have any special meaning, in the histogram, the width of a bin corresponds to its size. In the example above, the range of weights from 0 pounds to 165 pounds are divided into equally sized bins of 10 pounds each.