To understand how spread out a set of data points can be, we need to measure the variation. The simplest measure of variation is the range. The range is simply the difference between the largest and smallest values in a set of observations. In other words, you take the maximum value and subtract the minimum value from it.
For example, let’s consider our two small datasets again:
1. The first dataset: 5, 5, 5, 5, 5
2. The second dataset: 1, 3, 5, 8, 8
To calculate the range for each dataset:
First dataset: The maximum value is 5, and the minimum value is also 5. So, the range is 5 – 5 = 0.
Second dataset: The maximum value is 8, and the minimum value is 1. So, the range is 8 – 1 = 7.
From these calculations, we see that the first dataset has a range of 0, indicating no variability since all values are the same. The second dataset has a range of 7, showing more variability because the values are more spread out.
Range of Income
Let’s consider an example of income distribution among people in India based on their age. In this dataset, the median income is ₹23,000, while the mean income is ₹37,000.
Additionally, we have the minimum and maximum values of the dataset, which are ₹0 and ₹1.25 lakh (₹125,000), respectively. When we calculate the range by subtracting the minimum value from the maximum value, we get:
• Range: ₹1.25 lakh – ₹0 = ₹1.25 lakh
This range of ₹1.25 lakh indicates a significant amount of variability in the income data. To put this into perspective, the median income is ₹23,000, and the mean income is ₹37,000.
The range of over ₹1.25 lakh suggests a substantial spread and dispersion in the dataset. It highlights the wide gap between the lowest and highest incomes, reflecting a high level of variability.
There’s a lot of variability in this dataset, but the problem is that this measure of variation, the range, might be based on only one person. This is because the range is sensitive to outliers.
For instance, suppose in our dataset there’s really just one person making over ₹1.25 lakh, while everyone else is around the mean, median, or mode. This measure of variation or spread can be misleading because the range is not robust—it is sensitive to extreme values. Therefore, while the range gives us an idea of the overall spread, it’s important to be aware of its limitations and the potential impact of outliers on this measure
Disadvantage of Range
One issue with using the range as a measure of variability is that it only considers the two endpoints of the dataset—the maximum and minimum values. This means it neglects the data within the range, potentially giving a misleading sense of the dataset’s overall spread.
For example, let’s take two datasets:
1. Dataset A: 1, 2, 2, 5
2. Dataset B: 1, 5, 5, 5
For both datasets, the range is calculated as follows:
1. Range for Dataset A: 5 – 1 = 4
2. Range for Dataset B: 5 – 1 = 4
Despite having the same range, these two datasets differ significantly in their variability. Dataset A has values of 1, 2, 2, and 5, showing more variability within the data points. On the other hand, Dataset B consists of 1 and three values of 5, showing less variability within the data points.
The range fails to capture this difference because it only considers the maximum and minimum values and ignores the distribution of the data in between. Thus, while the range can provide a quick sense of the spread, it doesn’t give a complete picture of the dataset’s variability.
Another issue with using the range as a measure of variability is its sensitivity to outliers or extreme values. For example, let’s consider a dataset: 1, 2, 3, 3, 3. Here, the range is calculated as the difference between the maximum and minimum values:
• Range: 3 – 1 = 2
Now, let’s modify one of the values to an extreme outlier: 1, 2, 3, 3, 95
In this modified dataset, the range becomes:
• Range: 95 – 1 = 94
By changing just one observation from 3 to 95, the range jumps dramatically from 2 to 94. This substantial change in the range results from just one extreme value, illustrating how sensitive the range is to outliers. Even though the majority of the data points remain the same, the presence of an outlier skews the range significantly.
Thus, the range, while easy to calculate and understand, can be misleading when there are outliers in the dataset. It does not provide a reliable measure of variability because it is overly influenced by extreme values.
Inter Quartile Range
We need a measure that is more robust and insensitive to outlier points. This measure of variation is called the interquartile range (IQR). The IQR is simply the difference between the third quartile (Q3) and the first quartile (Q1).
To calculate the IQR, you follow these steps:
1. Organize your data points from lowest to highest.
2. Calculate the percentiles.
3. Identify the 25th percentile (Q1) and the 75th percentile (Q3).
4. Subtract Q1 from Q3.
The middle 50% of the data, represented by the IQR, excludes the top 25% and the bottom 25% of the data, thereby eliminating many extreme or outlying values. This makes the IQR a more reliable measure of variation, as it focuses on the central portion of the dataset and is not influenced by extreme values that could make the range misleading.
By using the IQR, we get a clearer picture of the true variability in the data, allowing us to understand better how spread out the middle 50% of our observations are.
Let’s consider an example using income data. If we look at our summary statistics, the third quartile (Q3) is around 48,000, and the first quartile (Q1) is 8,000. The interquartile range (IQR) is calculated by subtracting Q1 from Q3:
IQR=48,000−8,000=40,000
As a measure of variation, the IQR is significantly smaller than the range, which was 1.25 lakh. This indicates that when we focus on the middle 50% of the data, the variation is much smaller.
The IQR gives us a more accurate picture of the variability in the dataset by ignoring extreme values. It focuses on the central portion of the data, making it a more effective measure of spread or variation, especially when we suspect that our dataset has outliers. By using the IQR, we get a clearer understanding of the true variability among the majority of the observations, without the distortion caused by outliers.