home / probability and statistics / inferential statistics / sampling

Sampling

Sampling is a technique used in statistics to select a subset (sample) of individuals from a statistical population in order to make inferences about the characteristics of the population as a whole.

Sampling is beneficial because in many cases, it is not possible or practical to study an entire population, since it may be too difficult or too expensive to do so. For example, if a university has 40,000 students, it would not be practical to compile height data on every single student in order to draw conclusions about the height characteristics of the student population. Instead, a researcher may collect data from a random sample of students to estimate characteristics of the student body.

In order for any inferences or estimates made about the population to be valid, it is important for the sample to be representative of the population. Various sampling methods can be used, some of which result in data that will more effectively represent a population. Sampling methods can generally be categorized as probability sampling and non-probability sampling.

Probability sampling

Probability sampling is a sampling method in which each element of the population has a determinable non-zero chance of being selected in a random sample; this makes it possible to weight characteristics of selected elements based on their probability of being selected. There are several types of probability sampling, all of which involve random selection from a population whose elements have determinable non-zero probabilities.

Simple random sampling

Simple random sampling (SRS) is a sampling method in which each element of a population has an equal probability of being selected. It is an unbiased sampling method that can also be used as part of other more complex sampling methods.

Advantages of SRS:

SRS is relatively simple, and is often cheaper to implement than other more complex sampling methods.
SRS minimizes bias since each element has an equal probability of selection.

Disadvantages of SRS:

To use SRS, a list of the entire population is necessary, and it must be possible to determine the probabilities of selection of each element in the population. Depending on the size of the population, this can be tedious or impractical.
SRS can result in sampling error because the randomness of the process can result in the sample over- or under-repressenting the population.

In regards to the second point, consider the example of a bag of fruit that contains 25 apples and 25 oranges. A random sample of the bag of fruit should, on average, yield an equal number of apples and oranges. However, any given sample may include many more apples than oranges, or vice versa, skewing the representation of the population.

Stratified sampling

Stratified sampling is a sampling method in which a population is divided into distinct categories, or "strata." Each stratum can then be sampled as a subpopulation (including using SRS) based on the subpopulation's representation within the population as a whole. For example, if a movie theater found that 40% of the population of moviegoers on a given day were male and 60% were female, this would need to be represented in a sample of the population: if 250 people went to the movies, then 100 were male and 150 were female. A random sample of these subgroups may therefore include 12 males and 18 females.

Advantages of stratified sampling:

Because the entire population is stratified prior to sampling, each population will have proper representation; this is unlike methods like SRS, where certain groups may be over/under-represented due to the randomness of the selection process.
It allows researchers to stratify populations in ways that are relevant to the study.
Different sampling methods can be used for different subpopulations.

Disadvantages of stratified sampling:

Every element of a population must be categorized into a single stratum for stratified sampling to be used. This is not always possible.
It can be far more complicated than simple random sampling, and can be significantly more costly.

Systematic sampling

Systematic sampling (also referred to as interval sampling) involves creating an ordered list of each element in the population, randomly selecting the starting element, then selecting each subsequent element as every nth element. This periodic interval is referred to as the sampling interval. Given that the study population is the set of integers from 1-100, the randomly selected starting element is 22, and the sampling interval is 5, the following set represents the sample acquired through systematic sampling:

{22, 27, 32, 37, 42, 47...}

If the end of the list is reached but there are not enough values for the desired sample size, the count loops back to the beginning. For example, the element that would be selected after 97 in the example above is 2.

Advantages of systematic sampling:

It is simple and convenient because, aside from the random selection of the first element, the selection process is predetermined.
It provides a sample with a structured distribution of elements. This gives the sample a good chance of being representative of the population unless some random characteristic disproportionately exists within every nth element, which is unlikely.
Eliminates the possibility of the sample having many elements that are very close together, which can misrepresent the population.

Disadvantages of systematic sampling:

The size of the population must be known or it must be possible to reasonably approximate the population, which may not always be possible.
Systematic sampling is highly susceptible to populations that exhibit periodic characteristics; if the sampling interval coincides with these characteristics, it could over-emphasize the periodic characteristics, resulting in a misrepresentation of the population.
There is a greater risk of data manipulation since researchers could construct the experiment to yield a desired outcome; this manipulation may not always be apparent to outside eyes.

Cluster sampling

Cluster sampling is a sampling method that is used when a population can be divided into groups (referred to as clusters) that, together, are relatively homogenous, but individually, are heterogenous. It is often cheaper or more practical to use cluster sampling than it is to use other sampling methods.

For example, if a researcher wanted to study the percentage of 15-18 year olds who participate in sports in the city of Chicago, it would not really be possible, or would be tremendously expensive and impractical to collect data for every 15-18 year old in the city. In this case, researchers may use high schools in Chicago to represent clusters of 15-18 year olds. They could then acquire a random sample by selecting high schools in Chicago (clusters), then collect data from the sample of clusters.

The above is an example of single-stage cluster sampling because data from every element within selected clusters is collected. Multi-stage cluster sampling further reduces the size of the sampling units; rather than collecting data from every 15-18 year old in each selected school, a few classes in the school may be selected as clusters; data would then be collected from just these classes. This further reduces the amount of data that needs to be collected, but each time a sample is divided into clusters, the risk of the data being less representative of the overall population increases.

Advantages of cluster sampling:

Can be cheaper than other sampling methods since it can result in less data needing to be collected. It can also reduce transport costs due to data only needing to be collected from selected clusters.
Can be used for larger populations as long as the populations can be effectively divided into clusters.

Disadvantages of cluster sampling:

Tends to have higher sampling error than other methods. Cluster sampling is ideal when clusters are homogenous as a group, but are individually heterogeneous. For example, when schools are being used as clusters, an all girl or all boy school may skew results when the study is interested in results from both sexes.
Cluster sampling is more complex to perform than other sampling methods like simple random sampling. It requires significantly more planning, and the data is typically more difficult to analyze.

Cluster sampling vs. stratified sampling

Cluster sampling and stratified sampling are similiar in that they divide populations into groups, but are otherwise quite different.

Cluster sampling:

The groups are referred to as clusters.
Typically involves natural groupings such as school districts, city blocks, voting districts, and so on, each of which can contain very heterogeneous populations.
The sampling units are the actual clusters, so data will only be collected from sampled clusters, not from each cluster.

Stratified sampling:

The groups are referred to as strata.
Involves groups that are divided based on some shared characteristic of the individuals in the population.
The sampling units are the elements within each strata, so data must be collected from each strata.

Nonprobability sampling

Nonprobability sampling is a type of sampling that is commonly used for qualitative research. Unlike probability sampling, it is not possible to determine the probability of acquiring a given sample. Thus, it is not possible to use nonprobability sampling to make any inferences about the population based on the sample, and it is typically not useful for statistical quantitative research. There are a number of nonprobability sampling methods including convenience sampling, judgmental sampling, quota sampling, and snowball sampling.

Convenience sampling

Convenience sampling (grab sampling, accidental sampling, opportunity sampling) draws a sample from a part of the population based on characteristics that make individuals convenient study participants. For example, individuals who make convenient participants may be located close to the researchers (they could be people known to the researchers) or simply have ample time. The only criteria for a convenience sample is that the individual is able and willing to be a participant in the study. As a result of this, convenience sampling can often result in extreme bias.

Judgmental sampling

Judgmental sampling, also referred to as selective sampling, is a type of sampling in which the researcher targets particular members in the population based on certain criteria they believe will result in a sample that is appropriate for the study. Judgmental sampling can result in heavy bias based on the judgment or goals of the researcher.

Quota sampling

Quota sampling involves first dividing the population into mutually exclusive groups then selecting individuals from each group based on some specified proportion. For example, a researcher may decide to select a sample of 50 men and 75 women between the ages of 25 and 60 from a population of gym goers. This selection is not random, so researchers may select specific men and women based on whether they seem approachable, or based on some other characteristics. This results in bias because people who do not meet the researcher's implicit criteria may not have a chance of selection.

Snowball sampling

Snowball sampling is a sampling method in which existing participants are asked to recruit new participants to the study. This method is useful when populations may be difficult for researchers to find or access, such as underage smokers, drug users, etc. Thus, when individuals who meet the criteria are found, researchers rely on these individuals to recruit others who meet the study criteria.