Demo Example
Demo Example
Demo Example
Author

Campusπ

Browsing

One of the way to measure the variability is to calculate the difference of each observation from the mean. We then square the each distances, sum it and divide by the sample size which gives the variance. Since we square each observation it is represented by the square boxes (which you may have learnt from elementary school) as we can see in the figure. Just to give a reference of amount of variability we can take few examples of deviations with different variances.

Key Takeaway 

  • For a dataset with less variability, the data points are not very different from each other, resulting in a smaller total area occupied by these squares. This means that the sum of the squared deviations or distances from the mean will be smaller.
  • Dataset with greater variability, occupies more space when we sum up the areas of the squares. In simpler terms, for a dataset with more variability, the sum of the squared distances from the mean is higher.
  • The units of variance are in squared units, meaning if the original dataset uses miles as units, the variance would be in square miles; if it’s in inches, the variance would be in square inches. This reflects how we calculate variance by squaring the deviations from the mean.
  • Interpreting variance can be challenging to interpret directly in terms of the original unit of measurement. For instance, if our dataset is in Miles, it’s more intuitive to understand variability in Miles rather than Miles squared.

For a dataset with less variability, the data points are not very different from each other, resulting in a smaller total area occupied by these squares. This means that the sum of the squared deviations or distances from the mean will be smaller. In the above example, the sum of the squared deviations from the mean is 9 plus 1 plus 0 plus 1 plus 36 which is 47 and 1 plus 1 plus 0 plus 1 plus 1, which equals 4. Visually, you can see that the area covered by the squares is smaller for the chart in right side than left side.

Variance

When you examine a dataset with greater variability, it occupies more space when you sum up the areas of the squares. In simpler terms, for a dataset with more variability, the sum of the squared distances from the mean is higher. This sum, which represents the area covered by the squares after squaring the distances from the mean, encapsulates the extent of variation within the dataset. Conversely, a dataset with less variability would have a smaller sum of squared distances.

To quantify this variation with a single measure, we use the variance, which is calculated by dividing the sum of the squared distances from the mean by the sample size. Importantly, variance indicates the variability present in our dataset.

If we examine the dataset with less overall variability, we see that the total area covered by the boxes is smaller. In this case, the typical box representing the variance has an area of 0.8. In simpler terms, when the total variability is reduced, the variance also decreases. For this example, the total variation, calculated as the sum of squared deviations from the mean (1 + 1 + 0 + 1 + 1) divided by five, equals 0.8.

So, that’s how variance is determined for a dataset. A dataset with higher variability results in a slightly larger box, as seen in this case where the variance is 9.4. The typical box representing the variance for this dataset is 9.4 in area. This calculation remains consistent: total variability divided by the number of observations gives us this value of 9.4.

The variance effectively serves as the average of the sum of squared deviations from the mean. It accurately reflects the level of variability within a dataset: a higher variance value indicates greater variability among observations, while a lower variance value indicates less variability.

Unit of Variance

The units of variance are in squared units, meaning if the original dataset uses miles as units, the variance would be in square miles; if it’s in inches, the variance would be in square inches. This reflects how we calculate variance by squaring the deviations from the mean.

However, this drawback of the variance as it’s presented in squared units, can be challenging to interpret directly in terms of the original unit of measurement. For instance, if our dataset is in Miles, it’s more intuitive to understand variability in Miles rather than Miles squared.

This highlights a key issue with variance: its unit of measure squared makes interpretation less straightforward. Instead, a more practical approach would be to use a measure of variability that directly reflects the original unit of measurement. For example, instead of inches squared, we would prefer a measure based on inches; similarly, for Miles squared, we would prefer a measure based on Miles. A solution to this is to take square root of variance which is also know as standard deviation.

One issue with range and the interquartile range (IQR) both is that they don’t include information from all the data points; the range only considers the two endpoints, and the IQR only considers the third and first quartiles.

We want a measure of variation that incorporates all the information from data set. One way to achieve this is by considering the average distance of each observation from the mean. This method gives us a sense of how much our data points deviate from the mean on average.

To do this, we take each observation, subtract the mean to find the distance from the mean, sum all these distances or deviations, and then divide by the total sample size. This method provides a comprehensive measure of variability because it considers every data point, giving us a more complete picture of the data’s spread.

For example, consider a set of observations: 1, 2, 3, 4, and 5. The sample size is 5. The mean x̅ is calculated by summing all the values and dividing by the sample size, which gives us 15÷5=3 So, the mean is 3. To measure variation in terms of distances from the mean, we subtract the mean from each observation and then calculate the average of those distances. We have the five observations 1,2,3,4, and 5 and after subtracting the mean 3 we will get -2, -1, 0, 1, 2.

Instead of having five different distances from the mean, we might want an average distance from the mean to make more sense. This would provide a single number representing how far off an observation is from the mean. One way to do this is to take the average of these distances. However, there’s a problem: when you calculate the average of these distances, you end up with a value of zero.

A value of zero is meaningless as a measure of variation or spread in a data set. Clearly, there is some variation, as some points are higher, some are lower, and one point is exactly at the sample mean. Thus, it doesn’t make sense to claim there’s no variation. This simple average of the distances from the mean fails to reflect the variation in the data set.

The issue is that if you take the average of the distances from the mean, it will always equal zero. This approach is flawed because the negative distances below the mean cancel out the positive distances above the mean, and the value at the mean is zero. Essentially, these distances cancel each other out, resulting in an average of zero.

Square the Distance

We aim to measure variability in a way that treats negative distances from the mean as positive or at least non-negative. One effective approach is to square these distances. When you square a number, even if it’s negative, it becomes positive. For instance, squaring -1 results in 1 (since -1 × -1 = 1).

By squaring these deviations from the mean, we transform negative deviations into positive ones. This method ensures that when we sum up these squared values and calculate their average, we obtain a meaningful measure of variability that isn’t confined to zero. This measure is known as variance. It involves squaring each deviation from the mean, summing them, and then averaging them out to provide insight into the spread of data.

In our example, the first deviation is -2, meaning the observation lies two units below the sample mean in negative territory. When we square this value, -2 squared equals 4, which conceptually creates a square or a box with an area of 4. Instead of representing the deviation as -2, we now consider it as a positive square distance of 4 from the mean.

Moving to the next deviation which is one unit below the mean, squaring -1 gives us an area of 1. The observation at the mean itself, 3, has a squared distance of 0 since 0 squared is still 0. The deviation obtained with observation, 4, is one unit above the mean, so squaring 1 result in an area of 1. Finally, for the observation of 5, which is two units above the mean, squaring 2 gives us an area of 4.

Thus, after squaring these deviations, we get values of 4, 1, 0, 1, and 4. By squaring these deviations, we transform negative distances into positive values and maintain positive values as they are. This method allows us to calculate an average of these squared distances, which provides a meaningful measure of variability in the data set.

We can consider the overall variability in this dataset as the sum of these areas of the squares. Adding up the areas of the squares—4 plus 1 plus 0 plus 1 plus 4—gives you a total of 10. This sum represents a snapshot measure of the amount of variation present in this dataset.

Variance of other data

When we average the sum of the squared distances from the mean, we obtain a value of 2. This calculation involves adding up the areas represented by these squared deviations and then dividing by the total number of observations, which is 5. In this context, 10 represents the total variability in our dataset, and 5 is the number of observations. Therefore, as a measure of overall variation, we derive a variance value of 2. Variance is expressed in squared units because we square the distances from the mean during the calculation process.

 

 

To understand how spread out a set of data points can be, we need to measure the variation. The simplest measure of variation is the range. The range is simply the difference between the largest and smallest values in a set of observations. In other words, you take the maximum value and subtract the minimum value from it.

For example, let’s consider our two small datasets again:
1. The first dataset: 5, 5, 5, 5, 5
2. The second dataset: 1, 3, 5, 8, 8

To calculate the range for each dataset:
First dataset: The maximum value is 5, and the minimum value is also 5. So, the range is 5 – 5 = 0.
Second dataset: The maximum value is 8, and the minimum value is 1. So, the range is 8 – 1 = 7.

From these calculations, we see that the first dataset has a range of 0, indicating no variability since all values are the same. The second dataset has a range of 7, showing more variability because the values are more spread out.

Range of Income
Let’s consider an example of income distribution among people in India based on their age. In this dataset, the median income is ₹23,000, while the mean income is ₹37,000.

Additionally, we have the minimum and maximum values of the dataset, which are ₹0 and ₹1.25 lakh (₹125,000), respectively. When we calculate the range by subtracting the minimum value from the maximum value, we get:
• Range: ₹1.25 lakh – ₹0 = ₹1.25 lakh

This range of ₹1.25 lakh indicates a significant amount of variability in the income data. To put this into perspective, the median income is ₹23,000, and the mean income is ₹37,000.

The range of over ₹1.25 lakh suggests a substantial spread and dispersion in the dataset. It highlights the wide gap between the lowest and highest incomes, reflecting a high level of variability.

There’s a lot of variability in this dataset, but the problem is that this measure of variation, the range, might be based on only one person. This is because the range is sensitive to outliers.

For instance, suppose in our dataset there’s really just one person making over ₹1.25 lakh, while everyone else is around the mean, median, or mode. This measure of variation or spread can be misleading because the range is not robust—it is sensitive to extreme values. Therefore, while the range gives us an idea of the overall spread, it’s important to be aware of its limitations and the potential impact of outliers on this measure

Disadvantage of Range
One issue with using the range as a measure of variability is that it only considers the two endpoints of the dataset—the maximum and minimum values. This means it neglects the data within the range, potentially giving a misleading sense of the dataset’s overall spread.

For example, let’s take two datasets:
1. Dataset A: 1, 2, 2, 5
2. Dataset B: 1, 5, 5, 5

For both datasets, the range is calculated as follows:
1. Range for Dataset A: 5 – 1 = 4
2. Range for Dataset B: 5 – 1 = 4

Despite having the same range, these two datasets differ significantly in their variability. Dataset A has values of 1, 2, 2, and 5, showing more variability within the data points. On the other hand, Dataset B consists of 1 and three values of 5, showing less variability within the data points.

The range fails to capture this difference because it only considers the maximum and minimum values and ignores the distribution of the data in between. Thus, while the range can provide a quick sense of the spread, it doesn’t give a complete picture of the dataset’s variability.

Another issue with using the range as a measure of variability is its sensitivity to outliers or extreme values. For example, let’s consider a dataset: 1, 2, 3, 3, 3. Here, the range is calculated as the difference between the maximum and minimum values:
• Range: 3 – 1 = 2
Now, let’s modify one of the values to an extreme outlier: 1, 2, 3, 3, 95

In this modified dataset, the range becomes:
• Range: 95 – 1 = 94

By changing just one observation from 3 to 95, the range jumps dramatically from 2 to 94. This substantial change in the range results from just one extreme value, illustrating how sensitive the range is to outliers. Even though the majority of the data points remain the same, the presence of an outlier skews the range significantly.

Thus, the range, while easy to calculate and understand, can be misleading when there are outliers in the dataset. It does not provide a reliable measure of variability because it is overly influenced by extreme values.

Inter Quartile Range
We need a measure that is more robust and insensitive to outlier points. This measure of variation is called the interquartile range (IQR). The IQR is simply the difference between the third quartile (Q3) and the first quartile (Q1).

To calculate the IQR, you follow these steps:
1. Organize your data points from lowest to highest.
2. Calculate the percentiles.
3. Identify the 25th percentile (Q1) and the 75th percentile (Q3).
4. Subtract Q1 from Q3.

The middle 50% of the data, represented by the IQR, excludes the top 25% and the bottom 25% of the data, thereby eliminating many extreme or outlying values. This makes the IQR a more reliable measure of variation, as it focuses on the central portion of the dataset and is not influenced by extreme values that could make the range misleading.

By using the IQR, we get a clearer picture of the true variability in the data, allowing us to understand better how spread out the middle 50% of our observations are.

Let’s consider an example using income data. If we look at our summary statistics, the third quartile (Q3) is around 48,000, and the first quartile (Q1) is 8,000. The interquartile range (IQR) is calculated by subtracting Q1 from Q3:
IQR=48,000−8,000=40,000

As a measure of variation, the IQR is significantly smaller than the range, which was 1.25 lakh. This indicates that when we focus on the middle 50% of the data, the variation is much smaller.

The IQR gives us a more accurate picture of the variability in the dataset by ignoring extreme values. It focuses on the central portion of the data, making it a more effective measure of spread or variation, especially when we suspect that our dataset has outliers. By using the IQR, we get a clearer understanding of the true variability among the majority of the observations, without the distortion caused by outliers.

Imagine you’re a teacher assessing student performance in a math exam where scores range from 0 to 100. You’ve collected a dataset with each student’s name and their respective marks.

Using statistical software, we calculated the mean and median scores for the dataset. The mean score, averaging around 75, suggests the typical mark obtained across all students. However, it’s essential to note that the median score, approximately 68, is slightly lower than the mean.

The measures of central tendency, giving us a clear idea of where the median, mean, and mode lie within this set of observations. Additionally, we’ve examined measures of location or position, which tell us about the marks obtained by the bottom and top percentiles of this data set using percentile and quartile.

However, to get a complete picture, we also need to understand how varied or spread out these marks are. When analyzing marks, it’s not only useful to know the central tendency or location measures but also to grasp the extent of variation among students’ marks. This brings us to the concept of variation or variability in a set of observations.

Measures of variation, also known as measures of spread, dispersion, or variability, all refer to the same idea: assessing how spread out a set of data points is. While the mean, median, and mode provide information about a typical set of observations, they don’t reveal much about the dispersion or spread of the data.

Understanding the spread of marks can highlight how much scores deviate from the average, indicating whether most students have similar scores or if there’s a wide range of performance levels. This is crucial for educators to identify gaps and areas needing attention.
Measure of variation is measured through concepts like range, variance, and standard deviation, which will complement our understanding of central tendency by providing insights into the spread of data points within the marks dataset.

Let’s consider two small datasets to understand the concept of variability better. The first dataset consists of the number five repeated five times. The second dataset consists of the values one, three, five, eight, and eight.

Looking at these two sets of data, you might ask, “Which set of observations has greater variability? Which one is more spread out, and which one has more variation?” Most people would say that the second set of observations has more variability. The first set of observations has no variation because all the numbers are the same—it’s five repeated five times.

However, if we only look at measures of central tendency, we get a limited picture. For both of these datasets, the mean is five. The median, or the 50th percentile, is also five for both datasets. While the mode differs slightly, it still doesn’t provide much information about how spread out these two sets of observations are.

In essence, measures of central tendency like the mean, median, and mode tell us about the typical value in a dataset, but they don’t reveal the spread or variability of the data. To understand how varied or dispersed these sets are, we need to look at measures of variation or spread.

We have covered major measures of center and location—mean, median, and mode—we now turn to other important measures: percentiles and quartiles. These measures divide a dataset into equal parts, offering deeper insights into the distribution and spread of data. Percentiles, such as the 25th percentile or the median (50th percentile), indicate the value below which a certain percentage of observations fall. Quartiles divide the dataset into four equal parts, each containing 25% of the data.

Let’s delve into the concept of percentiles, which are crucial in understanding the distribution of data and measuring relative positions within a dataset. A percentile simply indicates the value below which a certain percentage of data points fall.

For instance, if we talk about the 30th percentile of a dataset, it signifies the value below which 30% of the observations lie. Similarly, the 50th percentile, known as the median, indicates the point below which half of the dataset’s values are located.

Percentiles are widely used in various contexts, such as academic exams. For example, in CAT exams, if someone achieves a 98th percentile, it means that they have performed better than 98% of all test-takers. This percentile score provides a clear indication of their relative performance compared to others.

Calculating percentiles manually can sometimes be complex, especially with large datasets or non-standard distributions. Hence, in practice, most analysts and researchers rely on statistical software to compute percentiles accurately and efficiently.

Understanding percentiles allows us to grasp the distribution of data points and their relative positions within a dataset. Whether analyzing exam scores, income levels, or other metrics, percentiles provide valuable insights into where an individual or observation stands in relation to the broader dataset.

Quartiles

Quartiles is another important measure that divides a dataset into four equal parts, each containing 25% of the data. Quartiles, are a specific type of percentiles that divide a dataset into four equal parts. For example, the 25th percentile corresponds to the first quartile. Dividing a dataset into four equal parts gives us three quartiles: the first quartile (25th percentile), the second quartile (50th percentile or median), and the third quartile (75th percentile).

Percentiles, such as the 50th percentile (median), indicate the point below which a certain percentage of data values lie. Quartiles are simply special percentiles that occur at specific data points within this distribution.

Understanding quartiles provides valuable insights into the distribution of data. The third quartile (75th percentile), for instance, indicates that 75% of the observations lie below this value. It helps us gauge the spread and variability within the dataset, providing a clearer picture of where values are concentrated.

Sometimes you’ll also see a reference to the fourth quartile as the quartile zero. those are simply the zeroth percentile and the hundredth percentile respectively. so quartile zero is the zeroth percentile quartile four is the hundredth percentile

It’s important to note that terms like median, 50th percentile, and second quartile all refer to the same value in a dataset. They represent the middle point where half of the data points fall below and half above, offering a central measure of the dataset’s distribution.

Understanding quartiles allows us to analyze data in larger segments, offering a clearer picture of the dataset’s distribution and variability. They are particularly useful when assessing the spread of data and identifying outliers or extremes within a dataset.
In statistical analysis, quartiles provide a structured approach to interpreting data, complementing percentiles by offering broader insights into how data is dispersed across its range. Whether examining exam scores, income levels, or other metrics, quartiles help researchers and analysts better understand the central tendencies and variability within the data.

Example: Quartile

Let’s delve into the marks data to understand its distribution using quartiles and specific percentiles. We’ve already explored the mean and the median (50th percentile), now let’s examine the quartiles to gain deeper insights.

The first quartile, also known as the 25th percentile, is 65 marks. This tells us that 25% of students scored below 65 marks. Moving to the second quartile, which aligns with the median, we find it’s at 75 marks. This means half of the students scored below 75 marks.

Next, the third quartile, or the 75th percentile, is at 90 marks. Here, 75% of students scored below 90 marks. These quartiles provide a structured view of how marks are distributed across the dataset, giving us a sense of the spread and central tendencies.

Now, let’s focus on extremes: the top 1% of students achieved 100 marks, while the bottom 1% scored 55 marks. This highlights the range and outliers within the dataset, showcasing the highest and lowest scores.

To gain a more detailed perspective, we can calculate specific percentiles such as the 10th, 20th, 30th, up to the 90th percentile. These percentiles allow us to zoom in on smaller chunks of the data, providing a nuanced understanding of how marks are distributed across different segments of the student population.

As we’ve explored percentiles and quartiles, we’re gaining a clear snapshot of how marks are distributed among students. To be in the top one percent, you’d need close to 100 marks, while the bottom one percent is defined by having 55 marks. It’s important to note that while these percentiles mark extreme values, they represent the same number of individuals—just as many students are in the top one percent as in the bottom one percent; the values simply differ.

Imagine you’re a teacher assessing student performance in a math exam where scores range from 0 to 100. You’ve collected a dataset with each student’s name and their respective marks. Now, let’s delve into analyzing these scores using measures of central tendency.
Using statistical software, we calculated the mean and median scores for the dataset. The mean score, averaging around 75, suggests the typical mark obtained across all students. However, it’s essential to note that the median score, approximately 68, is slightly lower than the mean.

This difference between the mean and median indicates that some students achieved markedly high scores, pulling the mean upward relative to the median.

In real-world contexts, such as reporting income statistics, the median is often favored over the mean. This preference arises because the mean can be skewed by extreme values—like the incomes of a small number of very wealthy individuals—resulting in a misleading representation of the typical income level. Instead, the median income provides a more accurate snapshot of the center of the income distribution, reflecting what most people typically earn.

Similarly, in our exam scores dataset, the median score gives us a clearer picture of the central tendency of student performance. It represents the score at which half of the students scored below and half scored above, without being heavily influenced by extreme scores at either end.

When deciding which measure of central tendency best captures the typical observation in a numerical dataset, we often consider the characteristics of the data and our analytical goals.

The mean is widely used and provides a straightforward average of all values. However, it can be heavily influenced by extreme values or outliers, making it less suitable for skewed datasets. In such cases, the median offers a robust alternative. It represents the middle value when all observations are ordered, making it less sensitive to extreme values and thus preferred when the dataset contains outliers.
On the other hand, the mode identifies the most frequently occurring value and is particularly useful for understanding the most common category or value in categorical or discrete datasets. It’s especially handy when exploring initial properties of a dataset or when dealing with variables that have a limited number of distinct values.