Demo Example
Demo Example
Demo Example
Author

Campusπ

Browsing

When analyzing distributions, we often discuss their modalities. The modalities is the number of peaks they exhibit. A distribution with a single prominent peak is termed unimodal. If there are two distinct peaks, it’s called bimodal, and with three peaks, it’s referred to as multimodal. These modalities describe the visual appearance of peaks in a distribution, highlighting where data points cluster most prominently.

However, identifying the number of modes in a distribution isn’t the same as determining its numerical mode, which represents the most frequently occurring value. A dataset may visually show multiple peaks (bimodal or multimodal), yet numerically, it might have only one mode.

For instance, imagine a dataset graphed as having two peaks. While numerically there might be one mode—the value with the highest frequency—visually, it’s evident that the data can be separated into two distinct distributions overlaid on each other. This visual distinction underscores the presence of two modes when considering the underlying distributions independently.

Understanding modalities in distributions is crucial for interpreting data patterns accurately. It helps analysts recognize when data may represent multiple underlying trends or phenomena, despite numerical summaries suggesting otherwise. This distinction ensures that insights drawn from data reflect its true complexity and structure, enhancing the reliability of statistical analyses and conclusions.

We can take an example of Air Quality Index. The AQI distribution will be right-skewed because of a few cities with extremely high pollution levels. These outliers skew the mean AQI upwards, making it appear higher than the median or 50th percentile. This skewing effect demonstrates why the mean can be misleading as a summary measure of typical air quality.

Despite the skew, the AQI distribution will be unimodal. This means that most cities’ AQI values cluster around a single peak, which lies in the lower to mid-range of the AQI spectrum. Most cities enjoy moderate air quality, but those few outliers with extremely high AQI values pull the overall mean upwards. This visualization highlights the importance of considering both the central tendency and the spread of data to get a true picture of air quality across different cities.

Let’s consider another example, where we did a survey of schools of get the age of every individual. When we plot the density curve, we will get a bimodal distribution. The reason we get a bimodal distribution is that the ages of the students and teachers get mixed which will create two modes of distribution. So, modality of distribution often provides a good insight in deciphering the statistical analysis.

Sharpe Ratio is the most extensively used metric in the finance world to evaluate the performance of an investment. It measures the return of an investment relative to its risk, offering a clearer perspective on the efficiency of a portfolio. Sharpe Ratio is defined as the ratio of the expected return and volatility. By incorporating both return and volatility, the Sharpe Ratio helps investors make more informed decisions about their investments. By adjusting portfolios to maximize the Sharpe Ratio, investors can achieve a better balance between risk and return.

Key Summary

  • Sharpe Ratio is the ratio of the expected return and volatility. By incorporating both return and volatility, the Sharpe Ratio helps investors make more informed decisions about their investments.
  • A higher Sharpe Ratio suggests a better risk-adjusted return, while a lower value indicates that the investment may not be adequately compensating for the level of risk taken.
  • When judging on a particular strategy, if the back testing results shows that the Sharpe Ratio is 2, it simply builds confidence that it’s a good strategy.
  • In case of two negatively correlated stocks, we need to diversify the stock through creating a portfolio of two stocks, this will increase the Sharpe Ratio that both the stocks combined have.
  • Sharpe Ratio treats all volatility as undesirable, failing to distinguish between upside and downside risk.

Majority of investors consider returns as one of the major parameters to judge the portfolio but investing is not only about return. The Sharpe Ratio essentially indicates how much excess return an investor earns per unit of risk. A higher Sharpe Ratio suggests a better risk-adjusted return, while a lower value indicates that the investment may not be adequately compensating for the level of risk taken.

Stocks with same return

Let’s consider an example of two investment stock A and stock B, having the same return, the seldom question arises, which one is better? In order to answer which of the investment is better, we just can’t see returns as both the stock have the same return.

Sharp Ratio

We can look in the given figure, the path of both the stocks taken to reach the same return. We can see that stock A took linear path; however, the stock B had a more volatile path.

In seems intuitive that stock A seems better, also stock B is painful to hold when it goes in downtrend and cheerful when in uptrend. On the other side, the stock A is following a steady path. There is high uncertainty in the stock B to recover the investment, when its goes down to negative.

One of the other reason stock B is bad is also due to liquidity. It may happen that at some point of time we need our money; what happens when we need it and the stock is down to 10%. We may have to sell at loss while on the other hand stock A is growing steadily at every point, the slope is consistent. We can interpret with the above logic that stock A is better investment than stock B, while both of them have the same returns.

To quantify our intuitive understanding, we use Sharpe Ratio to judge which stock perform better. The Sharpe Ratio is the risk adjusted measure performance. It is the ratio of the average daily return divided by the volatility. Volatility in the denominator punishes the returns. More the volatility is the lower the return will be.

Sharpe Ratio

The Sharpe Ratio of the both the stocks discussed as example is different which gives a healthy metric to judge an investment better. Stock A has a Sharpe Ratio 2 while the stock B have a Sharpe Ratio of 0.5 which validate our intuition that stock A is better. A higher Sharpe Ratio generally signifies that it had managed its risk and returns properly.

Interpreting the Sharpe Ratio

A Sharpe Ratio above 1 is generally considered good, as it implies that the investment provides sufficient returns for the risk involved. A ratio above 2 is viewed as very good, while a value exceeding 3 is deemed excellent. Conversely, a ratio below 1 suggests that the risk taken may not be justified by the returns, making the investment less attractive.

Negative correlation and diversification  

Let’s consider another example with two stocks A and B having the same Sharpe Ratio. Which simply mean both have the same return and volatility. How do we identify the two stocks which one is better? As we can see in the graph, both the stocks are negatively corelated and a combination of both the stock A and B will give an optimum return. Both the stocks have a Sharpe Ratio of 2 while the combination of both have the Sharpe Ratio of 5. This is why we need to diversify the negatively correlated stocks.

Sharp Ratio

Limitations and Alternatives

Sharpe Ratio assumes that returns are normally distributed, which may not always be the case in real-world financial markets. Additionally, it does not account for the potential impact of extreme market events, known as tail risks.

Sharpe Ratio treats all volatility as undesirable, failing to distinguish between upside and downside risk. Some investors prefer alternatives such as the Sortino Ratio, which focuses only on downside deviation, or the Treynor Ratio, which considers systematic risk instead of total risk.

Creating good graphics is a very important part of data visualization. Data visualization is not a science this is an art. It is imperative to know the concepts – biased labelling, misleading scales, excessive visualization and low data to ink ratio. The data visualizations principles should resonate with the statistical analysis that give a visual appeal, remove biasedness and complexity out of data.

Key Summary

  • Biased labelling- Loaded words or phrases should be avoided that might obscure the information and are pejorative.
  • Misleading scales, no scale or labels -We should avoid is misleading scales and truncated scales or no scales at all.
  • Excessive visualizations – We should avoid excessive decoration or visual clutter.
  • Low “data- to Ink” ratio-  The general idea for high data to ink ratio is that any graph we create for statistical purpose should convey much information about the data. A low data to Ink ratio is a graph that doesn’t say much about the data, it spills a lot of ink or redundant information
  • Unequal areas – The area occupied by any part of a graph should correspond to the magnitude of the value it represents.

Biased labelling

Biased labelling is one of the things to avoid when creating a graph for statistical purposes. Loaded words or phrases should be avoided that might obscure the information. As we can see in the figure a bar graph is labelled as poor people instead of lower-class category. It’s really pejorative and we should avoid that and be as neutral as possible in the labels.

creating good graphics

Biased labelling -Misleading scale and no scales

Another thing we should avoid is misleading scales and truncated scales. In the example above the scales are shortened to artificially inflate the effect size.

We have truncated the scale for frequency the number of counts which is visible as scale jumps from zero to 300. This sort of truncation of the scale really makes it look like the lower-class number of counts is vastly more than the middle and upper class. It’s artificially inflating the number of counts visually in the lower class.

Another problem that can go on is having no scale or labels. Make sure to include all appropriate scales or labels with the point to convey our findings effectively.

Excessive Visualizations and Low Data to ink ratio

We should avoid excessive decoration or visual clutter. This is also known as low data to Ink ratio. We should keep high data to ink ratio. The general idea is that any graph we create for statistical purpose should convey much information about the data.

A low data to Ink ratio is a graph that doesn’t say much about the data, it spills a lot of ink or redundant information. In the example we have colorings that really not informative. Its unrelated to the data and excessive visualization can really overwhelm the reader.

creating good graphics

As we can see in the above figure lower, middle, and upper classes and the axis is labelled with different colorings. The problem is that this is totally redundant, it’s excessive decoration we don’t need it and it can actually just be misleading because the reader might think that it’s something different. The colors of the different bars also not needed because they already have different labels the reader already knows that these different categories are part of the same categorical variable.

When we remove the color we have the kind of bar graph on the left-hand side. The graph is really better having high data to ink ratio meaning, we are saying more about the data with less ink on the paper. So, it’s useful to compare these two graphs, the one graph has a low relatively data ink ratio,, the other one is a high data ink ratio.

Unequal areas

Another problem that can occur in creating a statistical graph is unequal areas called the area principle. The idea is that the area occupied by any part of a graph should correspond to the magnitude of the value it represents.

creating good graphics

We should avoid is 3D visualizations because this can really violate this idea of the area principle. It can imply that certain wedges or certain bars are much larger than the other just by virtue of the way it is presented in a 3D context.

The above graph shows an example in a 2D context of violating the area principle. This is just the same bar graph of counts for the different categories of the passenger class variables. We have lower, middle, and upper classes and then we have a number of counts.

However, the lower class it’s much larger bar, width-wise which is meaningless, there’s nothing about the width that is informative. It implies many more observations that lower category than actually exists. This is really a violation of the area principle because the size of the bar is not in relation to the magnet attitude of the value it represents.

Stock Traders look for various signal like GDP, inflation rate, interest rate to predict the future economic condition. One of the indicators is from the bond market known as yield curve used by the investors. The yield curve predictive power of economic condition had made it a crucial metric for investors. It is a very accurate predictor or warning of upcoming market conditions like slowdown or recession.

Key Summary

  • Yield curve is a graph showing the relation between short term and long-term interest rate of treasure note.
  • The relationship between the short-term interest rate and long-term interest rate decides the market conditions. If long term interest rate is higher than short term then it signals economic expansion while opposite signals slowdown or recession.
  • If both have the same interest rate then we have a flat curve, if the long term is lower than the short-term interest rate then we have an inverted yield curve.
  • An inverted yield curve generally as signal for an upcoming recession.

Yield curve is a graph showing the relation between short term and long-term interest rate of treasure note. A treasury note is issued by the government which provide a fix interest rate over a period of time to the public. Usually, the short-term interest rate in lower than the long-term interest rate. A steep yield curve (when long-term rates rise significantly) suggests economic expansion and vice versa.

Long-term and Short-term interest rate

For longer term treasure note the interest is higher because, you are lending the cash for a longer period of time thereby more risk taken.  A long-term interest rate is usually higher also because as the economy grows investors demand more money that they actually lend but long-term economy may not be giving higher interest rate which is where the problem come in. As a predictor of future economic condition, if long term interest rate is higher than short term then it signals wellness in market.

Sometimes, the relationship between the short-term interest rate and long-term interest rate changes. If both have the same interest rate then we have a flat curve, if the long term is lower than the short-term interest rate then we have an inverted yield curve. This is when the market is concerned about it. An inverted yield curve generally as signal for an upcoming recession.

Recession Time Frame

Recession does not happen immediately after the inversion, but it means that we are in near one year it could happen. The yield curve may take a year or two to kick in and show an upcoming recession. A quick curve inversion in a day, week or even in the month are not gives strong signal and nothing to worry but a continuous yield curve inversion for months does.

To get a prediction the upcoming market condition, we take the difference between a three month and 10-year interest rate of a treasury note, if the value that we get is zero and less than zero then this is when we say we have a problem. Less than zero generally represent an inverted yield curve. In the finance it is common to take a difference between a two year and 10 years difference in interest rate as a good predictor. Since 1950 every yield curve inversion has been successfully predicted as recession. It is one of the leading indicators of the economy.

In data analysis, identifying outliers with the box plot  rule helps pinpoint data points that significantly deviate from the majority. Outliers represent extreme values that can skew numerical summaries like the mean, standard deviation, or variance, potentially misleading interpretations.

Skewness, closely related to outliers, refers to the asymmetry or lack of symmetry in a distribution. A skewed distribution tends to have a tail that extends more in one direction than the other. This directional emphasis indicates where outliers or extreme values are concentrated within the dataset.

Key Summary

  • Skewness in distributions reflects where outliers or extreme values tend to cluster.
  • When a distribution is symmetric, it means data points are evenly distributed around the mean, resulting in a balanced shape in histograms or density plots
  • A skewed distribution tends to have a tail that extends more in one direction than the other.
  • Left-skewed distributions have a longer tail on the left side, while right-skewed distributions have a longer tail on the right.

When a distribution is symmetric, it means data points are evenly distributed around the mean, resulting in a balanced shape in histograms or density plots. In contrast, left-skewed distributions have a longer tail on the left side, while right-skewed distributions have a longer tail on the right. These skewness patterns influence how we interpret data’s central tendency and spread, underscoring the importance of understanding distribution shapes to draw accurate conclusions from data analysis.

Skewness

Skewness in distributions reflects where outliers or extreme values tend to cluster. A symmetric distribution shows a balanced spread of values around the mean, with no prominent skewness towards either end. Here, the mean, median, and mode is typically close in value.

In contrast, a left-skewed distribution features a longer tail on the left side, where fewer but extremely low values pull the mean towards them, away from the median and mode. This skewness suggests the data is concentrated towards higher values, with outliers predominantly on the lower end.

Skewness

Conversely, a right-skewed distribution displays a longer tail on the right side, indicating more extreme high values that pull the mean in their direction, away from the median and mode. This skewness implies a dataset where outliers are mostly clustered at the higher end of the distribution. Understanding skewness helps analysts interpret data trends accurately, ensuring that summary statistics reflect the true central tendency and spread of the dataset.

In a box plot, the “box” represents the interquartile range (IQR) of the data. Inside the box, you find the median (50th percentile) of the dataset. The “whiskers” extend from the box to show the range of the data excluding outliers, often defined as 1.5 times the IQR beyond the quartiles.

Box plot is alternative to a density plot or histogram to understand the distribution of a numerical variable, also known as a box-and-whisker plot. This type of plot is widely used in research to visualize data distributions.

Boxplot

In a box plot, the central box represents the interquartile range (IQR), which is a measure of statistical dispersion. This box is defined by the first quartile (25th percentile) and the third quartile (75th percentile) of the dataset. Inside the box, you’ll find a dot or line representing the median (50th percentile) of the data. This median indicates where half of the observations fall above and half below.

The whiskers extending from the box show the range of typical values in the dataset. The lower whisker extends to the smallest observation within 1.5 times the IQR below the first quartile. Similarly, the upper whisker extends to the largest observation within 1.5 times the IQR above the third quartile.

Box plots are valuable because they provide a concise summary of the data distribution, emphasizing key metrics like the median and quartiles while also identifying potential outliers beyond the whiskers.

In a box plot, the central box spans the interquartile range (IQR), containing the middle fifty percent of the data. At the center of this box lies the median, marked by a dot or line, indicating where half of the observations fall above and half below.

Extending from the box are whiskers that signify the typical range of values in the dataset. The lower whisker reaches to the smallest observation that isn’t considered an outlier, typically within 1.5 times the IQR below the first quartile. Conversely, the upper whisker stretches to the largest non-outlier observation within 1.5 times the IQR above the third quartile.

Box plots are effective for quickly grasping the spread and central tendency of data, while also identifying outliers—those data points that lie beyond the whiskers. In our Air Quality Index (AQI) example, a box plot might reveal a central box indicating where the majority of cities’ AQI values cluster, with whiskers extending to the typical range of AQI levels. If there’s an outlier, such as a city with extremely high AQI values, it would be visibly marked beyond the whiskers, highlighting it as an exceptional case.

Identifying Outliers

Boxplot

In the realm of data analysis, identifying outliers is crucial because they can signal potential errors or anomalies in the dataset. Outliers are data points that significantly deviate from the majority of observations, and they play a vital role in ensuring the accuracy and reliability of our analyses.

Imagine you’re examining a dataset of Air Quality Index (AQI) values across different cities. While most cities might show AQI levels within a certain range, an outlier could represent a city with an exceptionally high AQI value. This outlier might indicate a unique environmental condition or, potentially, an error in data recording.

The reason we pay close attention to outliers is their impact on summary statistics such as the mean, standard deviation, and variance. These measures are sensitive to extreme values—if an outlier isn’t identified and appropriately handled, it could skew our understanding of the dataset’s central tendency and variability.

To manage outliers effectively, one commonly used approach is the box plot rule, which helps in visually identifying data points that lie outside the typical range of observations. By systematically detecting and addressing outliers, researchers ensure their analyses are based on accurate, representative data, thereby avoiding misleading conclusions.

In a dataset, an alternative plot apart from the histogram is a density plot. Similar to a histogram, a density plot provides a visual representation of data distribution, but it does so by smoothing out the concept of bins.

Imagine the density plot as a histogram with incredibly small bins—so small that they blend together into a smooth curve or line that fits over the data points. Instead of discrete bars, the density plot uses this continuous curve to show where values are concentrated.

Density Plot

For our dataset of Air Quality Index (AQI), a density plot would display a smooth line that indicates where most cities’ AQI values cluster. It offers a clearer view of the distribution’s shape and can reveal subtle peaks or patterns that might be missed in a histogram with fewer bins.

The height of the curve in a density plot indicates the density of observations at different AQI levels, making it easier to interpret data concentration and variability across the dataset. This approach provides a nuanced understanding of how AQI values are distributed without the potential visual limitations of traditional histograms.

In a dataset, an alternative to the histogram is a density plot. It’s akin to a histogram with incredibly tiny bins, resulting in a smooth curve that represents data distribution. However, this smoothness can sometimes be misleading.

For Air Quality Index (AQI) dataset, imagine a density plot showing a continuous curve that illustrates where AQI values concentrate across cities. To enhance understanding, we often include a “rug” of data points (shown in the above figure). These points are randomly jittered along a vertical axis to prevent them from overlapping, giving a clear indication of where observations lie.

The rug in an AQI density plot would reveal a cluster of data points around the central peak, indicating where most cities’ AQI values fall. However, it might also show sparse points in the tail ends of the plot. For instance, the sparsest part might represent just one observation—say, a city with exceptionally high AQI, standing alone.

The purpose of the rug is to caution interpreters not to draw conclusions hastily from sparse data points. While the central part of the density plot might be densely populated with observations, like the peak indicating moderate AQI levels across many cities, sparse areas should be approached with caution. In our example, the sparse tail might only include one outlier city with extremely high AQI values, reminding us that interpretations should account for the distribution’s varying data density.

Introduction

Let’s take a hypothetical example and consider a company as property. The value of a property can be evaluated in many ways using its location, area, future revenue or whether the property is residential, agricultural or business. If we consider a residential property, one of the parameters to gauze the property value is how much rental income it can generate in the future. The property analogy based on future rental generation is similar to valuing a company as discounted cash flow i.e. it is based on how much cash company can generate in future.

Key Takeaway

  • Discounted cash flow analysis helps in valuation of a company or equity based on company’s future cash flows.
  • It relies on the assumption of terminal value, risk free rate and free cash flow projection.

To value a company, let’s imagine the company having a five-year life. Let’s consider that the company will generate 100 million each year for five years and windup after that. Does the value of the company will be 500 million?

Discounted Cash Flow

The answer is no, because the inflation erodes the value to money; 100 million today is not worth the same as 100 million after one year or two. The actual value of the company will be less than 500 million. We can’t just add up the 100 million each because the 100 million today is not worth the same as in year one, year two and so on. The value of the hundred million will erode as much as the time will pass by.

A hundred million received in year one will be little more than 100 million received in two-year time; a 100 million received in year two will be little more than 100 million received in year three and so on. The reason is that if you get 100 million today you would be investing it in fixed deposit or buying a property in that case the value of money will increase. Money has a time value, the sooner you will get it, the more it will worth it.

Let’s take another assumption, what is the risk-free rate that someone can get for 100 million invested, let’s consider it be 10%. If we get 10% return each year on 100 million and subsequently, we need to reduce the future money received in year one and so on to today’s money called discounting. We need to apply formula, to get the true value of money received in future in today’s value (given in the figure). So here r in the interest rate and n is the number of the period, here in our case r is 10% and n i.e. terminal value is five.

To calculate the true value, we apply the formula. We can see that in year one 100 million is worth only 90 million, in year two 100 million is worth 0.83, year three hundred million is worth 0.75, year four is 0.68 and year five is 0.62. We get this value by dividing 1/(1+r) to the power n to the 100 million. After summing the true values of 100 million for five years we get 378 million, not 500 million. So, a hundred million for five years is not worth 500 million but 378 million after five year.

Please note that the higher the interest rate are, the lower the value of 500 million in five-year time will be. If the company issue 100 million shares the share per value is 3.78 per share which is the true share price of a company. Using this method, we can predict the value of a share price today based on the future projection. We can also decide, whether the current share price reflect the true value, or is higher or lower than the value of the share.

Assumptions

  • Just to clarify the assumption, the companies do not stop after five years. We can plug in the n value to get a reliable estimate to future value of a company. The terminal value could be based of assumption that after n number of years there is no point forecasting the future cash flow.
  • The second assumption is that we forecasted the cash flow of 100 million each which is hard, if our assumption of forecast is wrong then the company value will change.
  • The third assumption is the risk-free rate, the higher the risk-free rate the lower the value of the company will be. 10% reflect the number of things; the riskier the business is the higher the chances of going the risk-free rate higher.

Conclusion

Valuing a company is not easy, there is no full proof method. We can only predict it on the basis of assumption which can work like double edge sword. It required quite a bit assumption, it require forecasting, it require to come up with right interest rate, but it is recognized as good technique to value of the company or individual shares.

Histograms are incredibly useful when you have a large dataset and want to visualize the distribution of a numerical variable. Unlike stem-and-leaf plot or dot plot that can become cluttered with too many data points, histograms groups data into intervals, or bins. Each bar in a histogram represents an interval, and its height corresponds to the number of observations within that interval.

Histogram

What makes histograms versatile is its ability to adapt to datasets of any size. Whether you’re working with a small sample in psychology or a massive dataset in epidemiology; histograms can effectively summarize the data while maintaining clarity. By adjusting the number of bins, you can fine-tune the level of detail in your visualization, capturing broader trends or focusing on specific nuances in the data distribution.

Key Takeaway 

  • Histograms can effectively summarize the huge data while maintaining clarity. It is most widely used chart also for checking normality assumption in data set.
  • It is helpful in spotting outliers.
  • The representation of data heavily depends on the number of intervals, or bins, chosen.

Histograms are a go-to tool in data analysis because it provides a clear, visual snapshot of how data points are spread across different ranges, making them indispensable for exploring patterns and outliers in numerical data.

Here’s a histogram of the Air Quality Index (AQI) dataset. Similar to the Dot Plot and stem-and-leaf plot, the histogram reveals insights into the distribution of AQI values.

We can see that most cities have AQI values clustered in the moderate range, indicated by the taller bars in the 60-80 AQI range. However, there’s variability across the dataset, particularly noticeable in the lower and higher ends of the AQI spectrum.

The smallest bar on the left side of the histogram represents a few cities with exceptionally good air quality, possibly with AQI values below 50. Conversely, there might be a smaller bar on the right-side indicating cities with higher AQI values, potentially between 90 to100.

This visual representation effectively highlights any outliers or extreme values in AQI across different cities, offering a quick overview of how air quality is distributed within our dataset.

Histogram with bins

When creating a histogram of the Air Quality Index (AQI) in our dataset, the representation of data heavily depends on the number of intervals, or bins, chosen. For instance, comparing a histogram with three bins to one with ten bins can drastically alter how the data distribution appears.

With fewer bins, such as three, the histogram provides a broader overview, showing general trends and broad ranges of AQI values. In contrast, using ten bins offers a more detailed, granular view, revealing nuances in AQI distribution across narrower ranges of values.

However, histograms can be misleading if not carefully constructed. If the number of bins is too low, important details like multiple peaks or variability in AQI values might be obscured. Conversely, too many bins can create a noisy or cluttered histogram that complicates interpretation.

In summary, selecting the right number of bins in a histogram is crucial for effectively visualizing and understanding the distribution of AQI data. It strikes a balance between capturing meaningful patterns and avoiding the oversimplification or overcomplication of data insights.

Stem and leaf plot is used to plot for a numerical variable. It’s essentially a way to list our data in an organized manner. In a stem-and-leaf plot, we have “stems” and “leaves.” The numbers to the left of a vertical bar are called stems, while the digits to the right are called leaves.

All stems that start with the same digit have their corresponding leaves written beside them. This setup allows us to reconstruct each observation in the dataset by combining the stem and the leaf.

Key Takeaway

  • Stem and leaf plot is used to plot for a numerical variable.
  • The numbers to the left of a vertical bar are called stems, while the digits to the right are called leaves. All stems that start with the same digit have their corresponding leaves written beside them.
  • We get an instant snapshot of the minimum and maximum values.
  • This plot helps us understand the distribution of values, identify the range, and see the frequency of specific values at a glance.

Let’s create a stem-and-leaf plot for the Air Quality Index (AQI) of our cities. Each observation has its own number, with a vertical bar separating the stems from the leaves.

For example, in our stem-and-leaf plot:

  • The numbers to the left of the vertical bar are the stems.
  • The numbers to the right of the vertical bar are the leaves.

Typically, a stem-and-leaf plot will specify what differentiates a stem from a leaf. In this case, the stem represents the leading value of our AQI number, and the leaf represents the digit in the ones place.

Steam and Leaf Plot

Here’s a simple example for clarity:

In this example:

  • The stem “5” with leaves “1, 3, 7” represents AQI values of 51, 53, and 57.
  • The stem “6” with leaves “0, 4, 8” represents AQI values of 60, 64, and 68.
  • The stem “7” with leaves “2, 5” represents AQI values of 72 and 75.

This plot helps us see the distribution of AQI values and allows us to easily reconstruct the actual data points from the stems and leaves.

In our dataset, we have integer values for the Air Quality Index (AQI) such as 54, 56, 78, and 80, without any decimal places. Using a stem-and-leaf plot, we can quickly visualize this data.

For instance:

  • The lowest AQI in our dataset is 54, represented by a stem of 5 and a leaf of 4.
  • The highest AQI in our dataset is 80, represented by a stem of 8 and a leaf of 0.

Here’s what our stem-and-leaf plot might look like. We can also see it in the above figure.

Stem | Leaf

5  | 4 6

6  | 0 2 5

7  | 1 3 8

8  | 0

In this plot:

  • The stem “5” with leaves “4, 6” represents AQI values of 54 and 56.
  • The stem “6” with leaves “0, 2, 5” represents AQI values of 60, 62, and 65.
  • The stem “7” with leaves “1, 3, 8” represents AQI values of 71, 73, and 78.
  • The stem “8” with leaf “0” represents an AQI value of 80.

From this stem-and-leaf plot, we get an instant snapshot of the minimum and maximum AQI values. We can also see how many cities fall into specific AQI ranges. For example, if the plot showed that five cities had a stem of 7 and a leaf of 5, it would indicate that five cities have an AQI of 75.

This plot helps us understand the distribution of AQI values, identify the range, and see the frequency of specific values at a glance.

Problem with Stem and Leaf plot

One thing to note about a stem-and-leaf plot, similar to a Dot Plot (link to article 24), is that it works best for small datasets since all observations are represented individually. For our dataset of countries, the number of observations is manageable, making the stem-and-leaf plot quite useful.

However, for larger datasets—like those with a million people, ten thousand observations, or even 500 people, which is common in survey research in social sciences and humanities—a stem-and-leaf plot can become messy and difficult to interpret. The visual clutter from so many data points can overwhelm the plot.

In such cases, it’s more practical to use other methods or to subset your sample to a smaller set of observations. For example, in psychology, datasets are often smaller, making a stem-and-leaf plot more feasible and helpful.

So, while stem-and-leaf plots are great for small datasets, their utility diminishes as the dataset grows larger. For larger datasets, consider other visualization techniques or focus on smaller subsets to keep the plot clear and informative.