Demo Example
Demo Example
Demo Example
Author

Campusπ

Browsing

Histograms are incredibly useful when you have a large dataset and want to visualize the distribution of a numerical variable. Unlike stem-and-leaf plot or dot plot that can become cluttered with too many data points, histograms groups data into intervals, or bins. Each bar in a histogram represents an interval, and its height corresponds to the number of observations within that interval.

Histogram

What makes histograms versatile is its ability to adapt to datasets of any size. Whether you’re working with a small sample in psychology or a massive dataset in epidemiology; histograms can effectively summarize the data while maintaining clarity. By adjusting the number of bins, you can fine-tune the level of detail in your visualization, capturing broader trends or focusing on specific nuances in the data distribution.

Key Takeaway 

  • Histograms can effectively summarize the huge data while maintaining clarity. It is most widely used chart also for checking normality assumption in data set.
  • It is helpful in spotting outliers.
  • The representation of data heavily depends on the number of intervals, or bins, chosen.

Histograms are a go-to tool in data analysis because it provides a clear, visual snapshot of how data points are spread across different ranges, making them indispensable for exploring patterns and outliers in numerical data.

Here’s a histogram of the Air Quality Index (AQI) dataset. Similar to the Dot Plot and stem-and-leaf plot, the histogram reveals insights into the distribution of AQI values.

We can see that most cities have AQI values clustered in the moderate range, indicated by the taller bars in the 60-80 AQI range. However, there’s variability across the dataset, particularly noticeable in the lower and higher ends of the AQI spectrum.

The smallest bar on the left side of the histogram represents a few cities with exceptionally good air quality, possibly with AQI values below 50. Conversely, there might be a smaller bar on the right-side indicating cities with higher AQI values, potentially between 90 to100.

This visual representation effectively highlights any outliers or extreme values in AQI across different cities, offering a quick overview of how air quality is distributed within our dataset.

Histogram with bins

When creating a histogram of the Air Quality Index (AQI) in our dataset, the representation of data heavily depends on the number of intervals, or bins, chosen. For instance, comparing a histogram with three bins to one with ten bins can drastically alter how the data distribution appears.

With fewer bins, such as three, the histogram provides a broader overview, showing general trends and broad ranges of AQI values. In contrast, using ten bins offers a more detailed, granular view, revealing nuances in AQI distribution across narrower ranges of values.

However, histograms can be misleading if not carefully constructed. If the number of bins is too low, important details like multiple peaks or variability in AQI values might be obscured. Conversely, too many bins can create a noisy or cluttered histogram that complicates interpretation.

In summary, selecting the right number of bins in a histogram is crucial for effectively visualizing and understanding the distribution of AQI data. It strikes a balance between capturing meaningful patterns and avoiding the oversimplification or overcomplication of data insights.

Stem and leaf plot is used to plot for a numerical variable. It’s essentially a way to list our data in an organized manner. In a stem-and-leaf plot, we have “stems” and “leaves.” The numbers to the left of a vertical bar are called stems, while the digits to the right are called leaves.

All stems that start with the same digit have their corresponding leaves written beside them. This setup allows us to reconstruct each observation in the dataset by combining the stem and the leaf.

Key Takeaway

  • Stem and leaf plot is used to plot for a numerical variable.
  • The numbers to the left of a vertical bar are called stems, while the digits to the right are called leaves. All stems that start with the same digit have their corresponding leaves written beside them.
  • We get an instant snapshot of the minimum and maximum values.
  • This plot helps us understand the distribution of values, identify the range, and see the frequency of specific values at a glance.

Let’s create a stem-and-leaf plot for the Air Quality Index (AQI) of our cities. Each observation has its own number, with a vertical bar separating the stems from the leaves.

For example, in our stem-and-leaf plot:

  • The numbers to the left of the vertical bar are the stems.
  • The numbers to the right of the vertical bar are the leaves.

Typically, a stem-and-leaf plot will specify what differentiates a stem from a leaf. In this case, the stem represents the leading value of our AQI number, and the leaf represents the digit in the ones place.

Steam and Leaf Plot

Here’s a simple example for clarity:

In this example:

  • The stem “5” with leaves “1, 3, 7” represents AQI values of 51, 53, and 57.
  • The stem “6” with leaves “0, 4, 8” represents AQI values of 60, 64, and 68.
  • The stem “7” with leaves “2, 5” represents AQI values of 72 and 75.

This plot helps us see the distribution of AQI values and allows us to easily reconstruct the actual data points from the stems and leaves.

In our dataset, we have integer values for the Air Quality Index (AQI) such as 54, 56, 78, and 80, without any decimal places. Using a stem-and-leaf plot, we can quickly visualize this data.

For instance:

  • The lowest AQI in our dataset is 54, represented by a stem of 5 and a leaf of 4.
  • The highest AQI in our dataset is 80, represented by a stem of 8 and a leaf of 0.

Here’s what our stem-and-leaf plot might look like. We can also see it in the above figure.

Stem | Leaf

5  | 4 6

6  | 0 2 5

7  | 1 3 8

8  | 0

In this plot:

  • The stem “5” with leaves “4, 6” represents AQI values of 54 and 56.
  • The stem “6” with leaves “0, 2, 5” represents AQI values of 60, 62, and 65.
  • The stem “7” with leaves “1, 3, 8” represents AQI values of 71, 73, and 78.
  • The stem “8” with leaf “0” represents an AQI value of 80.

From this stem-and-leaf plot, we get an instant snapshot of the minimum and maximum AQI values. We can also see how many cities fall into specific AQI ranges. For example, if the plot showed that five cities had a stem of 7 and a leaf of 5, it would indicate that five cities have an AQI of 75.

This plot helps us understand the distribution of AQI values, identify the range, and see the frequency of specific values at a glance.

Problem with Stem and Leaf plot

One thing to note about a stem-and-leaf plot, similar to a Dot Plot (link to article 24), is that it works best for small datasets since all observations are represented individually. For our dataset of countries, the number of observations is manageable, making the stem-and-leaf plot quite useful.

However, for larger datasets—like those with a million people, ten thousand observations, or even 500 people, which is common in survey research in social sciences and humanities—a stem-and-leaf plot can become messy and difficult to interpret. The visual clutter from so many data points can overwhelm the plot.

In such cases, it’s more practical to use other methods or to subset your sample to a smaller set of observations. For example, in psychology, datasets are often smaller, making a stem-and-leaf plot more feasible and helpful.

So, while stem-and-leaf plots are great for small datasets, their utility diminishes as the dataset grows larger. For larger datasets, consider other visualization techniques or focus on smaller subsets to keep the plot clear and informative.

We can get the numerical summary through measures of central tendency, location, and variability for numerical data. But numbers alone don’t always tell the full story. To truly understand data, especially when considering measures of spread, it’s crucial to visualize it. Seeing the data can reveal patterns and insights that raw numbers might obscure. By creating visual representations, we can get a clearer picture of the distribution and observe how spread out our observations really are. Let’s explore how visual tools can bring our data to life and enhance our analysis!

Key Take away

  • Each dot represents a different observation, indicating a different data point.
  • Dot Plot is quite useful for understanding the central part of the distribution, the spread, and spotting outliers.
  • Dot Plot become very messy and difficult to interpret with larger data because of the sheer number of data points.

Example

Imagine you’re an environmental scientist keen on understanding air quality across various cities. Why do some cities enjoy fresher air while others struggle with pollution? How widespread is the variation in Air Quality Index (AQI) across these cities? What’s the typical AQI if we consider all these cities together?

To tackle these questions, your first step would be to visualize the central tendency of AQI—essentially, where most AQI values cluster—and its distribution across different cities. This will give a snapshot of typical air quality. Following this, we would delve deeper into examining the spread of AQI values, exploring how far apart the best and worst air qualities are.

Visualizing these measures of central tendency and spread is crucial. It transforms raw data into a more intuitive format, making it easier to grasp the overall picture of air quality and identify any patterns or outliers that might need further investigation.

Let’s consider the example for our analysis. Here’s the following columns represents the following:

  1. City: The name of each city.
  2. Population Density (people per square km): This variable shows how crowded a city is. Some cities have a high population density, while others are more spread out.
  3. Air Quality Index (AQI): This variable indicates the quality of air in each city, with lower values representing better air quality.
  4. Average Income per Capita: This is a measure of the average income for individuals in each city. In environmental studies, income per capita can be a key factor in understanding variations in air quality and other environmental outcomes.

By analyzing these variables, we can uncover important insights about the factors influencing air quality in different cities.

Dot Plot

Dot Plot

Let’s visualize this data to see how these factors interact and impact air quality across various urban areas! Here’s a Dot Plot of the Air Quality Index (AQI) in our dataset. In a Dot Plot, we typically use one dot for each observation. In this case, we have a set of cities, and each dot represents a different city, indicating a different data point or observation.

On the y-axis (the vertical axis), we have the frequency, which is the number of observations or data points for each AQI value. We can see that many cities have similar AQI values, resulting in some clustering. However, there is also variation in AQI across different cities.

We can see that most cities cluster around AQI values in the 50s,60s and 70s, indicating moderate air quality. This clustering shows the typical air quality in our set of observations. However, we do have an outlier. This outlier is City X to the extreme right, which has a significantly higher AQI, indicating much poorer air quality than the other cities.

In fact, City X is the only city in our dataset with an AQI above 100. This Dot Plot is quite useful for understanding the central part of the distribution, the spread, and whether there are any outliers. In this case, City X stands out as an outlier with much worse air quality compared to most other cities in our dataset.

Drawback

One issue with a Dot Plot is that it can become messy with large datasets. Since we are plotting each observation, if we have a large number of data points, say a dataset of a million cities, a Dot Plot will be very messy and difficult to interpret because of the sheer number of data points.

Variance represents the average of the squared deviations from the mean. Because we square the deviations, the variance is always a non-negative value, meaning it’s either positive or zero. The method for calculating variance slightly varies depending on whether we are dealing with a sample or a population.

Lets consider an hypothetical example of income, when we analyze income data, we discover that the sample variance is a large figure. Since this is a sample from a population, the sample variance is measured in thousands of rupees squared. This unit, rupees squared, can be challenging to interpret directly. Hence, we prefer using the standard deviation because it will be expressed in rupees, which provides a more intuitive understanding of the variability in the income data.

Key Takeaway 

  • The standard deviation represents the typical distance of observations from the mean.
  • When the standard deviation is low, it suggests that most values cluster closely around the mean, indicating less variability.
  • Conversely, a larger standard deviation implies greater variability, with values more spread out from the mean.

Standard Deviation: Population and Sample

When we discuss standard deviation, whether from a population or a sample, the formulas are essentially the same. We simply take the square root of the variance. In the case of population standard deviation, it’s the square root of population variance, and for sample standard deviation, it’s the square root of sample variances.

For our income data, calculating the sample standard deviation is straightforward using statistical software. Taking the square root of our sample variance gives us a standard deviation of approximately 55,000 rupees. This measurement is in the original units of rupees, unlike the sample variance, which is in rupees squared.

Conceptually, the standard deviation provides an indication of the typical distance of our income observations from the sample mean. On average, we can interpret this standard deviation to mean that incomes vary by about 55,000 rupees.

Standard Deviation Interpretation w.r.t Mean

Standard deviation interpretation with respect to mean

To better understand the standard deviation, especially in relation to income or any dataset, it’s helpful to consider its interpretation alongside the mean. The standard deviation represents the typical distance of observations from the mean.

For instance, if we have a dataset with a mean income of INR 100 and a sample standard deviation of INR 0, what does this tell us about the spread of the data? Well, with a standard deviation of zero, it means all values in the dataset are exactly 100. Therefore, the spread of the data in this case would be considered none.

This example illustrates how the mean can aid in interpreting the standard deviation. When the standard deviation is low, it suggests that most values cluster closely around the mean, indicating less variability. Conversely, a larger standard deviation implies greater variability, with values more spread out from the mean. This comparison helps provide a more intuitive understanding of what the standard deviation measures in a dataset.

If we have a sample standard deviation of one rupee and a sample mean of 100, it indicates that the data points are relatively close to the mean. The standard deviation being much smaller than the mean suggests that most observations cluster tightly around the mean value of 100. Therefore, the amount of variability or spread in the data is minimal in this case. A sample mean of 100 and a standard deviation of 5 and is still pretty small there’s not going to be very much variation around the data set.

Suppose our sample mean is 100 and the sample standard deviation is 75. In this case you can say the sample standard deviation is pretty close to the sample mean. So, let’s say that there’s just a medium amount of variability, you can think of it as there are quite a few observations around this sample mean but there is some degree of spread. Finally consider a mean of 100 and a standard deviation of ten thousand then the overall spread of the data it’s going to be large.

We can interpret the sample standard deviation by comparing it with the sample mean. In each scenario we’ve discussed, we’ve kept the sample mean consistent while varying the sample standard deviation.

When the sample standard deviation is significantly large compared to the sample mean, it indicates that the data points are widely spread around the mean. Conversely, if the sample standard deviation is relatively small compared to the sample mean, it suggests there is little variation, meaning most observations are closely clustered around the sample mean. Thus, comparing the standard deviation to the mean helps us gauge the degree of variability in the dataset.

Standard Deviation of Income

Let’s examine our income data to understand what the standard deviation is telling us. The sample standard deviation is 55,000, while the mean income is around 37,000. Since, the sample mean is smaller than the standard deviation, this indicates a relatively large spread around the mean. In other words, these values suggest a significant degree of variability in incomes based on this sample data.

In previous article we discussed about variance and standard deviation, assuming we had a population of data, even though we referred to it as a sample. However, in practice, the formulas used to calculate variance and standard deviation will differ slightly depending on whether the dataset is a sample drawn from a larger population or if it represents the entire population itself.

Key Takeaway 

  • We compute population variance by summing the squared deviations of each observation from the population mean, adding them together, and then dividing by the sample size n.
  • However, in case of sample variance we divide by n-1, where n is the number of observations in the sample.
  • we divide by n−1 because of degrees of freedom, degree of freedom provides an unbiased estimate of the population variance when working with a sample.
  • Degrees of freedom refer to the number of independent pieces of information or values that can vary freely in a dataset or calculation.
  • In essence, n−1 accounts for the adjustment needed to ensure that our sample statistic, such as variance, provides an unbiased estimate of the population parameter, considering the limitations imposed by using sample data rather than the entire population dataset.

Variance of a Population

So, when discussing the variance for a population, we use the Greek letter sigma squared. This notation is a convention used for population parameters, where Greek letters are typically employed. To calculate the population variance, we subtract the population mean from each observation, square the result, sum all these squared deviations, and finally dividing by N.

Population Vs Sample Variance

Variance of a Sample 

When calculating the variance for a sample, we follow a slightly different approach and use Latin or Roman letters to distinguish it from population variance. This distinction is important in statistics to indicate whether we are working with a sample or the entire population dataset.

Conceptually, the calculation for sample variance, is similar to that for population variance. We compute it by summing the squared deviations of each observation from the sample mean, adding them together, and then dividing by the sample size n, where n is the number of observations in the sample. However, a key difference is that instead of dividing by n (the total sample size), we divide by n−1. This adjustment accounts for the fact that using n−1 degrees of freedom provides an unbiased estimate of the population variance when working with a sample.

Why we divide by n-1?

The reason we divide by n−1 instead of n when calculating sample statistics like variance relates to the concept of degrees of freedom. Degrees of freedom refer to the number of independent pieces of information or values that can vary freely in a dataset or calculation.

When we gather a sample from a population, we use sample statistics to estimate population parameters. For instance, when calculating sample variance, we rely on the sample mean as part of the calculation. The degrees of freedom in this context, n−1, represents the number of independent observations that can vary freely after using one sample statistic (the mean) to estimate variance.

In essence, n−1 accounts for the adjustment needed to ensure that our sample statistic, such as variance, provides an unbiased estimate of the population parameter, considering the limitations imposed by using sample data rather than the entire population dataset.

The sample mean, is computed by summing all the observations in our sample and dividing by the sample size n. No other sample statistics are used in this calculation. The degrees of freedom in the context of sample variance is n, which is the sample size minus zero adjustments. This reflects the number of independent observations available in our sample.

When we calculate the sample variance, we use the sample mean X bar as part of the calculation. Because we rely on one sample statistic (the mean) to estimate another (the variance), the degrees of freedom become n−1. This adjustment ensures that the sample variance provides an unbiased estimate of the population variance, considering the constraints of using a sample rather than the entire population dataset.

Degrees of freedom

Degrees of freedom has to do with constraints. So, it’s really constraints on the possible values of a set of observations. Suppose we have a small data set that just consists of three numbers and we abstractly, we call these x1, x2, x3.

If you’re told that we have no constraints, then each number could be anything than first number could be one, the second number could be five, the next number could be a million. So, we have no constraints, if we have no constraints these values are free to vary, there are three independent pieces of information.

Degree of freedom

Imagine we have a dataset where the sample mean is fixed at three. This means there’s a constraint because the average of all values in the dataset must be exactly three. Let’s consider a small dataset with three numbers: suppose the first number is 1 and the second number is 3. Given that the sample mean is three, the third number must be 5. This is because the sum of 1, 3, and 5 divided by 3 equals the sample mean of three.

The issue here is that knowing the sample mean restricts the possible values of the remaining observations. In statistical terms, this constraint on the variability is reflected in the degrees of freedom associated with the sample variance. Degrees of freedom are reduced by one (n – 1) because the sample mean acts as a constraint on how freely the values can vary.

A straightforward solution to deal with unit of variance which is in squared units is to simply calculate the square root of the variance. Since variance is derived from squaring the deviations or distances from the mean, taking its square root reverses this process. Therefore, the square root of the variance gives us a measure of variation known as the standard deviation.

Standard Deviation

Key Takeaway 

  • Standard Deviation can be seen as the average distance from the mean for a set of observations, providing a measure of how spread out the data points are.
  • The square root of the variance gives us a measure of variation known as the standard deviation.
  • The significant advantage of the standard deviation is that it returns us the units in interpretable way for example rather than miles squares in case of variance, just miles.
  • If we examine a dataset with less variability, both the variance and the standard deviation will be smaller.
  • It helps convey the degree of variability in a dataset: the higher the standard deviation, the greater the variability of the observations around the mean.

Let’s examine our dataset of one, two, three, four, and five, and consider the significance of the standard deviation. As mentioned earlier, the variance represents the typical size of the boxes that symbolize the squared deviations from the mean. To find the standard deviation, we simply take the square root of the variance. For a dataset where the variance is two, the standard deviation would be approximately 1.41.

The significant advantage of the standard deviation is that it returns us to our original units of measurement. For instance, if our dataset is in miles, the variance would be in square miles, whereas the standard deviation would be in miles, which is more intuitive for interpretation.

What does the standard deviation signify?

Standard Deviation

While the variance indicates the typical squared deviation from the mean, the standard deviation represents the side length of a typical box in our dataset. In this example, with a variance of two, the standard deviation of 1.41 can be seen as the length of one of these boxes. Squaring the standard deviation gives us the variance, demonstrating their mathematical equivalence. However, the preference for the standard deviation lies in its direct association with the original units of measurement, unlike the variance which is in squared units.

If we examine a dataset with less variability, both the variance and the standard deviation will be smaller. For example, in this dataset where the variance is 0.8, the square root of that variance, which is the standard deviation, is 0.9. In other words, the side length of our typical box in this dataset is 0.9. Therefore, the standard deviation of 0.9 reflects the smaller degree of variability present in these observations.

If we examine a dataset with greater variability, the variance will be larger. For instance, if the variance is 9.4, the typical square deviation from the mean is larger, roughly 3. Taking the square root of the variance gives us the standard deviation, which is approximately three.

Degree of variability

The standard deviation is simply the square root of the variance. A larger standard deviation, like one exceeding three, indicates a greater degree of variability in our dataset compared to a standard deviation below one. Thus, the total variability, represented by the variance and standard deviation, changes according to how spread out the data points are from the mean.

Let’s go back to dataset consisting of the values one, two, three, four, and five. These values have distances both above and below the mean. These distances are in their original form and haven’t been squared.

The standard deviation represents the side length of our typical box, which is the typical squared deviation from the mean. When we calculate the standard deviation for this dataset, we get a value of 1.41 units. This length reflects the measure of variability in our dataset.

The standard deviation, when compared to the actual distances or deviations in our dataset, appears quite typical—it’s neither the smallest nor the largest distance from the mean. It can be seen as the average distance from the mean for a set of observations, providing a measure of how spread out the data points are.

In statistical terms, the standard deviation is a clear indicator—it represents a typical deviation or distance from the mean. It helps convey the degree of variability in a dataset: the higher the standard deviation, the greater the variability of the observations around the mean. Unlike simpler measures like the range or interquartile range, both the standard deviation and variance take into account all observations in the dataset.

One of the strengths of the standard deviation is its ability to condense all this information about the observations into a single number, providing a concise summary of the variability or dispersion in the dataset.

One of the way to measure the variability is to calculate the difference of each observation from the mean. We then square the each distances, sum it and divide by the sample size which gives the variance. Since we square each observation it is represented by the square boxes (which you may have learnt from elementary school) as we can see in the figure. Just to give a reference of amount of variability we can take few examples of deviations with different variances.

Key Takeaway 

  • For a dataset with less variability, the data points are not very different from each other, resulting in a smaller total area occupied by these squares. This means that the sum of the squared deviations or distances from the mean will be smaller.
  • Dataset with greater variability, occupies more space when we sum up the areas of the squares. In simpler terms, for a dataset with more variability, the sum of the squared distances from the mean is higher.
  • The units of variance are in squared units, meaning if the original dataset uses miles as units, the variance would be in square miles; if it’s in inches, the variance would be in square inches. This reflects how we calculate variance by squaring the deviations from the mean.
  • Interpreting variance can be challenging to interpret directly in terms of the original unit of measurement. For instance, if our dataset is in Miles, it’s more intuitive to understand variability in Miles rather than Miles squared.

For a dataset with less variability, the data points are not very different from each other, resulting in a smaller total area occupied by these squares. This means that the sum of the squared deviations or distances from the mean will be smaller. In the above example, the sum of the squared deviations from the mean is 9 plus 1 plus 0 plus 1 plus 36 which is 47 and 1 plus 1 plus 0 plus 1 plus 1, which equals 4. Visually, you can see that the area covered by the squares is smaller for the chart in right side than left side.

Variance

When you examine a dataset with greater variability, it occupies more space when you sum up the areas of the squares. In simpler terms, for a dataset with more variability, the sum of the squared distances from the mean is higher. This sum, which represents the area covered by the squares after squaring the distances from the mean, encapsulates the extent of variation within the dataset. Conversely, a dataset with less variability would have a smaller sum of squared distances.

To quantify this variation with a single measure, we use the variance, which is calculated by dividing the sum of the squared distances from the mean by the sample size. Importantly, variance indicates the variability present in our dataset.

If we examine the dataset with less overall variability, we see that the total area covered by the boxes is smaller. In this case, the typical box representing the variance has an area of 0.8. In simpler terms, when the total variability is reduced, the variance also decreases. For this example, the total variation, calculated as the sum of squared deviations from the mean (1 + 1 + 0 + 1 + 1) divided by five, equals 0.8.

So, that’s how variance is determined for a dataset. A dataset with higher variability results in a slightly larger box, as seen in this case where the variance is 9.4. The typical box representing the variance for this dataset is 9.4 in area. This calculation remains consistent: total variability divided by the number of observations gives us this value of 9.4.

The variance effectively serves as the average of the sum of squared deviations from the mean. It accurately reflects the level of variability within a dataset: a higher variance value indicates greater variability among observations, while a lower variance value indicates less variability.

Unit of Variance

The units of variance are in squared units, meaning if the original dataset uses miles as units, the variance would be in square miles; if it’s in inches, the variance would be in square inches. This reflects how we calculate variance by squaring the deviations from the mean.

However, this drawback of the variance as it’s presented in squared units, can be challenging to interpret directly in terms of the original unit of measurement. For instance, if our dataset is in Miles, it’s more intuitive to understand variability in Miles rather than Miles squared.

This highlights a key issue with variance: its unit of measure squared makes interpretation less straightforward. Instead, a more practical approach would be to use a measure of variability that directly reflects the original unit of measurement. For example, instead of inches squared, we would prefer a measure based on inches; similarly, for Miles squared, we would prefer a measure based on Miles. A solution to this is to take square root of variance which is also know as standard deviation.

One issue with range and the interquartile range (IQR) both is that they don’t include information from all the data points; the range only considers the two endpoints, and the IQR only considers the third and first quartiles.

We want a measure of variation that incorporates all the information from data set. One way to achieve this is by considering the average distance of each observation from the mean. This method gives us a sense of how much our data points deviate from the mean on average.

To do this, we take each observation, subtract the mean to find the distance from the mean, sum all these distances or deviations, and then divide by the total sample size. This method provides a comprehensive measure of variability because it considers every data point, giving us a more complete picture of the data’s spread.

For example, consider a set of observations: 1, 2, 3, 4, and 5. The sample size is 5. The mean x̅ is calculated by summing all the values and dividing by the sample size, which gives us 15÷5=3 So, the mean is 3. To measure variation in terms of distances from the mean, we subtract the mean from each observation and then calculate the average of those distances. We have the five observations 1,2,3,4, and 5 and after subtracting the mean 3 we will get -2, -1, 0, 1, 2.

Instead of having five different distances from the mean, we might want an average distance from the mean to make more sense. This would provide a single number representing how far off an observation is from the mean. One way to do this is to take the average of these distances. However, there’s a problem: when you calculate the average of these distances, you end up with a value of zero.

A value of zero is meaningless as a measure of variation or spread in a data set. Clearly, there is some variation, as some points are higher, some are lower, and one point is exactly at the sample mean. Thus, it doesn’t make sense to claim there’s no variation. This simple average of the distances from the mean fails to reflect the variation in the data set.

The issue is that if you take the average of the distances from the mean, it will always equal zero. This approach is flawed because the negative distances below the mean cancel out the positive distances above the mean, and the value at the mean is zero. Essentially, these distances cancel each other out, resulting in an average of zero.

Square the Distance

We aim to measure variability in a way that treats negative distances from the mean as positive or at least non-negative. One effective approach is to square these distances. When you square a number, even if it’s negative, it becomes positive. For instance, squaring -1 results in 1 (since -1 × -1 = 1).

By squaring these deviations from the mean, we transform negative deviations into positive ones. This method ensures that when we sum up these squared values and calculate their average, we obtain a meaningful measure of variability that isn’t confined to zero. This measure is known as variance. It involves squaring each deviation from the mean, summing them, and then averaging them out to provide insight into the spread of data.

In our example, the first deviation is -2, meaning the observation lies two units below the sample mean in negative territory. When we square this value, -2 squared equals 4, which conceptually creates a square or a box with an area of 4. Instead of representing the deviation as -2, we now consider it as a positive square distance of 4 from the mean.

Moving to the next deviation which is one unit below the mean, squaring -1 gives us an area of 1. The observation at the mean itself, 3, has a squared distance of 0 since 0 squared is still 0. The deviation obtained with observation, 4, is one unit above the mean, so squaring 1 result in an area of 1. Finally, for the observation of 5, which is two units above the mean, squaring 2 gives us an area of 4.

Thus, after squaring these deviations, we get values of 4, 1, 0, 1, and 4. By squaring these deviations, we transform negative distances into positive values and maintain positive values as they are. This method allows us to calculate an average of these squared distances, which provides a meaningful measure of variability in the data set.

We can consider the overall variability in this dataset as the sum of these areas of the squares. Adding up the areas of the squares—4 plus 1 plus 0 plus 1 plus 4—gives you a total of 10. This sum represents a snapshot measure of the amount of variation present in this dataset.

Variance of other data

When we average the sum of the squared distances from the mean, we obtain a value of 2. This calculation involves adding up the areas represented by these squared deviations and then dividing by the total number of observations, which is 5. In this context, 10 represents the total variability in our dataset, and 5 is the number of observations. Therefore, as a measure of overall variation, we derive a variance value of 2. Variance is expressed in squared units because we square the distances from the mean during the calculation process.

 

 

To understand how spread out a set of data points can be, we need to measure the variation. The simplest measure of variation is the range. The range is simply the difference between the largest and smallest values in a set of observations. In other words, you take the maximum value and subtract the minimum value from it.

For example, let’s consider our two small datasets again:
1. The first dataset: 5, 5, 5, 5, 5
2. The second dataset: 1, 3, 5, 8, 8

To calculate the range for each dataset:
First dataset: The maximum value is 5, and the minimum value is also 5. So, the range is 5 – 5 = 0.
Second dataset: The maximum value is 8, and the minimum value is 1. So, the range is 8 – 1 = 7.

From these calculations, we see that the first dataset has a range of 0, indicating no variability since all values are the same. The second dataset has a range of 7, showing more variability because the values are more spread out.

Range of Income
Let’s consider an example of income distribution among people in India based on their age. In this dataset, the median income is ₹23,000, while the mean income is ₹37,000.

Additionally, we have the minimum and maximum values of the dataset, which are ₹0 and ₹1.25 lakh (₹125,000), respectively. When we calculate the range by subtracting the minimum value from the maximum value, we get:
• Range: ₹1.25 lakh – ₹0 = ₹1.25 lakh

This range of ₹1.25 lakh indicates a significant amount of variability in the income data. To put this into perspective, the median income is ₹23,000, and the mean income is ₹37,000.

The range of over ₹1.25 lakh suggests a substantial spread and dispersion in the dataset. It highlights the wide gap between the lowest and highest incomes, reflecting a high level of variability.

There’s a lot of variability in this dataset, but the problem is that this measure of variation, the range, might be based on only one person. This is because the range is sensitive to outliers.

For instance, suppose in our dataset there’s really just one person making over ₹1.25 lakh, while everyone else is around the mean, median, or mode. This measure of variation or spread can be misleading because the range is not robust—it is sensitive to extreme values. Therefore, while the range gives us an idea of the overall spread, it’s important to be aware of its limitations and the potential impact of outliers on this measure

Disadvantage of Range
One issue with using the range as a measure of variability is that it only considers the two endpoints of the dataset—the maximum and minimum values. This means it neglects the data within the range, potentially giving a misleading sense of the dataset’s overall spread.

For example, let’s take two datasets:
1. Dataset A: 1, 2, 2, 5
2. Dataset B: 1, 5, 5, 5

For both datasets, the range is calculated as follows:
1. Range for Dataset A: 5 – 1 = 4
2. Range for Dataset B: 5 – 1 = 4

Despite having the same range, these two datasets differ significantly in their variability. Dataset A has values of 1, 2, 2, and 5, showing more variability within the data points. On the other hand, Dataset B consists of 1 and three values of 5, showing less variability within the data points.

The range fails to capture this difference because it only considers the maximum and minimum values and ignores the distribution of the data in between. Thus, while the range can provide a quick sense of the spread, it doesn’t give a complete picture of the dataset’s variability.

Another issue with using the range as a measure of variability is its sensitivity to outliers or extreme values. For example, let’s consider a dataset: 1, 2, 3, 3, 3. Here, the range is calculated as the difference between the maximum and minimum values:
• Range: 3 – 1 = 2
Now, let’s modify one of the values to an extreme outlier: 1, 2, 3, 3, 95

In this modified dataset, the range becomes:
• Range: 95 – 1 = 94

By changing just one observation from 3 to 95, the range jumps dramatically from 2 to 94. This substantial change in the range results from just one extreme value, illustrating how sensitive the range is to outliers. Even though the majority of the data points remain the same, the presence of an outlier skews the range significantly.

Thus, the range, while easy to calculate and understand, can be misleading when there are outliers in the dataset. It does not provide a reliable measure of variability because it is overly influenced by extreme values.

Inter Quartile Range
We need a measure that is more robust and insensitive to outlier points. This measure of variation is called the interquartile range (IQR). The IQR is simply the difference between the third quartile (Q3) and the first quartile (Q1).

To calculate the IQR, you follow these steps:
1. Organize your data points from lowest to highest.
2. Calculate the percentiles.
3. Identify the 25th percentile (Q1) and the 75th percentile (Q3).
4. Subtract Q1 from Q3.

The middle 50% of the data, represented by the IQR, excludes the top 25% and the bottom 25% of the data, thereby eliminating many extreme or outlying values. This makes the IQR a more reliable measure of variation, as it focuses on the central portion of the dataset and is not influenced by extreme values that could make the range misleading.

By using the IQR, we get a clearer picture of the true variability in the data, allowing us to understand better how spread out the middle 50% of our observations are.

Let’s consider an example using income data. If we look at our summary statistics, the third quartile (Q3) is around 48,000, and the first quartile (Q1) is 8,000. The interquartile range (IQR) is calculated by subtracting Q1 from Q3:
IQR=48,000−8,000=40,000

As a measure of variation, the IQR is significantly smaller than the range, which was 1.25 lakh. This indicates that when we focus on the middle 50% of the data, the variation is much smaller.

The IQR gives us a more accurate picture of the variability in the dataset by ignoring extreme values. It focuses on the central portion of the data, making it a more effective measure of spread or variation, especially when we suspect that our dataset has outliers. By using the IQR, we get a clearer understanding of the true variability among the majority of the observations, without the distortion caused by outliers.

Imagine you’re a teacher assessing student performance in a math exam where scores range from 0 to 100. You’ve collected a dataset with each student’s name and their respective marks.

Using statistical software, we calculated the mean and median scores for the dataset. The mean score, averaging around 75, suggests the typical mark obtained across all students. However, it’s essential to note that the median score, approximately 68, is slightly lower than the mean.

The measures of central tendency, giving us a clear idea of where the median, mean, and mode lie within this set of observations. Additionally, we’ve examined measures of location or position, which tell us about the marks obtained by the bottom and top percentiles of this data set using percentile and quartile.

However, to get a complete picture, we also need to understand how varied or spread out these marks are. When analyzing marks, it’s not only useful to know the central tendency or location measures but also to grasp the extent of variation among students’ marks. This brings us to the concept of variation or variability in a set of observations.

Measures of variation, also known as measures of spread, dispersion, or variability, all refer to the same idea: assessing how spread out a set of data points is. While the mean, median, and mode provide information about a typical set of observations, they don’t reveal much about the dispersion or spread of the data.

Understanding the spread of marks can highlight how much scores deviate from the average, indicating whether most students have similar scores or if there’s a wide range of performance levels. This is crucial for educators to identify gaps and areas needing attention.
Measure of variation is measured through concepts like range, variance, and standard deviation, which will complement our understanding of central tendency by providing insights into the spread of data points within the marks dataset.

Let’s consider two small datasets to understand the concept of variability better. The first dataset consists of the number five repeated five times. The second dataset consists of the values one, three, five, eight, and eight.

Looking at these two sets of data, you might ask, “Which set of observations has greater variability? Which one is more spread out, and which one has more variation?” Most people would say that the second set of observations has more variability. The first set of observations has no variation because all the numbers are the same—it’s five repeated five times.

However, if we only look at measures of central tendency, we get a limited picture. For both of these datasets, the mean is five. The median, or the 50th percentile, is also five for both datasets. While the mode differs slightly, it still doesn’t provide much information about how spread out these two sets of observations are.

In essence, measures of central tendency like the mean, median, and mode tell us about the typical value in a dataset, but they don’t reveal the spread or variability of the data. To understand how varied or dispersed these sets are, we need to look at measures of variation or spread.