Demo Example
Demo Example
Demo Example
Author

Campusπ

Browsing

Creating good graphics is a very important part of data visualization. Data visualization is not a science this is an art. It is imperative to know the concepts – biased labelling, misleading scales, excessive visualization and low data to ink ratio. The data visualizations principles should resonate with the statistical analysis that give a visual appeal, remove biasedness and complexity out of data.

Key Summary

  • Biased labelling- Loaded words or phrases should be avoided that might obscure the information and are pejorative.
  • Misleading scales, no scale or labels -We should avoid is misleading scales and truncated scales or no scales at all.
  • Excessive visualizations – We should avoid excessive decoration or visual clutter.
  • Low “data- to Ink” ratio-  The general idea for high data to ink ratio is that any graph we create for statistical purpose should convey much information about the data. A low data to Ink ratio is a graph that doesn’t say much about the data, it spills a lot of ink or redundant information
  • Unequal areas – The area occupied by any part of a graph should correspond to the magnitude of the value it represents.

Biased labelling

Biased labelling is one of the things to avoid when creating a graph for statistical purposes. Loaded words or phrases should be avoided that might obscure the information. As we can see in the figure a bar graph is labelled as poor people instead of lower-class category. It’s really pejorative and we should avoid that and be as neutral as possible in the labels.

creating good graphics

Biased labelling -Misleading scale and no scales

Another thing we should avoid is misleading scales and truncated scales. In the example above the scales are shortened to artificially inflate the effect size.

We have truncated the scale for frequency the number of counts which is visible as scale jumps from zero to 300. This sort of truncation of the scale really makes it look like the lower-class number of counts is vastly more than the middle and upper class. It’s artificially inflating the number of counts visually in the lower class.

Another problem that can go on is having no scale or labels. Make sure to include all appropriate scales or labels with the point to convey our findings effectively.

Excessive Visualizations and Low Data to ink ratio

We should avoid excessive decoration or visual clutter. This is also known as low data to Ink ratio. We should keep high data to ink ratio. The general idea is that any graph we create for statistical purpose should convey much information about the data.

A low data to Ink ratio is a graph that doesn’t say much about the data, it spills a lot of ink or redundant information. In the example we have colorings that really not informative. Its unrelated to the data and excessive visualization can really overwhelm the reader.

creating good graphics

As we can see in the above figure lower, middle, and upper classes and the axis is labelled with different colorings. The problem is that this is totally redundant, it’s excessive decoration we don’t need it and it can actually just be misleading because the reader might think that it’s something different. The colors of the different bars also not needed because they already have different labels the reader already knows that these different categories are part of the same categorical variable.

When we remove the color we have the kind of bar graph on the left-hand side. The graph is really better having high data to ink ratio meaning, we are saying more about the data with less ink on the paper. So, it’s useful to compare these two graphs, the one graph has a low relatively data ink ratio,, the other one is a high data ink ratio.

Unequal areas

Another problem that can occur in creating a statistical graph is unequal areas called the area principle. The idea is that the area occupied by any part of a graph should correspond to the magnitude of the value it represents.

creating good graphics

We should avoid is 3D visualizations because this can really violate this idea of the area principle. It can imply that certain wedges or certain bars are much larger than the other just by virtue of the way it is presented in a 3D context.

The above graph shows an example in a 2D context of violating the area principle. This is just the same bar graph of counts for the different categories of the passenger class variables. We have lower, middle, and upper classes and then we have a number of counts.

However, the lower class it’s much larger bar, width-wise which is meaningless, there’s nothing about the width that is informative. It implies many more observations that lower category than actually exists. This is really a violation of the area principle because the size of the bar is not in relation to the magnet attitude of the value it represents.

Stock Traders look for various signal like GDP, inflation rate, interest rate to predict the future economic condition. One of the indicators is from the bond market known as yield curve used by the investors. The yield curve predictive power of economic condition had made it a crucial metric for investors. It is a very accurate predictor or warning of upcoming market conditions like slowdown or recession.

Key Summary

  • Yield curve is a graph showing the relation between short term and long-term interest rate of treasure note.
  • The relationship between the short-term interest rate and long-term interest rate decides the market conditions. If long term interest rate is higher than short term then it signals economic expansion while opposite signals slowdown or recession.
  • If both have the same interest rate then we have a flat curve, if the long term is lower than the short-term interest rate then we have an inverted yield curve.
  • An inverted yield curve generally as signal for an upcoming recession.

Yield curve is a graph showing the relation between short term and long-term interest rate of treasure note. A treasury note is issued by the government which provide a fix interest rate over a period of time to the public. Usually, the short-term interest rate in lower than the long-term interest rate. A steep yield curve (when long-term rates rise significantly) suggests economic expansion and vice versa.

Long-term and Short-term interest rate

For longer term treasure note the interest is higher because, you are lending the cash for a longer period of time thereby more risk taken.  A long-term interest rate is usually higher also because as the economy grows investors demand more money that they actually lend but long-term economy may not be giving higher interest rate which is where the problem come in. As a predictor of future economic condition, if long term interest rate is higher than short term then it signals wellness in market.

Sometimes, the relationship between the short-term interest rate and long-term interest rate changes. If both have the same interest rate then we have a flat curve, if the long term is lower than the short-term interest rate then we have an inverted yield curve. This is when the market is concerned about it. An inverted yield curve generally as signal for an upcoming recession.

Recession Time Frame

Recession does not happen immediately after the inversion, but it means that we are in near one year it could happen. The yield curve may take a year or two to kick in and show an upcoming recession. A quick curve inversion in a day, week or even in the month are not gives strong signal and nothing to worry but a continuous yield curve inversion for months does.

To get a prediction the upcoming market condition, we take the difference between a three month and 10-year interest rate of a treasury note, if the value that we get is zero and less than zero then this is when we say we have a problem. Less than zero generally represent an inverted yield curve. In the finance it is common to take a difference between a two year and 10 years difference in interest rate as a good predictor. Since 1950 every yield curve inversion has been successfully predicted as recession. It is one of the leading indicators of the economy.

In data analysis, identifying outliers with the box plot  rule helps pinpoint data points that significantly deviate from the majority. Outliers represent extreme values that can skew numerical summaries like the mean, standard deviation, or variance, potentially misleading interpretations.

Skewness, closely related to outliers, refers to the asymmetry or lack of symmetry in a distribution. A skewed distribution tends to have a tail that extends more in one direction than the other. This directional emphasis indicates where outliers or extreme values are concentrated within the dataset.

Key Summary

  • Skewness in distributions reflects where outliers or extreme values tend to cluster.
  • When a distribution is symmetric, it means data points are evenly distributed around the mean, resulting in a balanced shape in histograms or density plots
  • A skewed distribution tends to have a tail that extends more in one direction than the other.
  • Left-skewed distributions have a longer tail on the left side, while right-skewed distributions have a longer tail on the right.

When a distribution is symmetric, it means data points are evenly distributed around the mean, resulting in a balanced shape in histograms or density plots. In contrast, left-skewed distributions have a longer tail on the left side, while right-skewed distributions have a longer tail on the right. These skewness patterns influence how we interpret data’s central tendency and spread, underscoring the importance of understanding distribution shapes to draw accurate conclusions from data analysis.

Skewness

Skewness in distributions reflects where outliers or extreme values tend to cluster. A symmetric distribution shows a balanced spread of values around the mean, with no prominent skewness towards either end. Here, the mean, median, and mode is typically close in value.

In contrast, a left-skewed distribution features a longer tail on the left side, where fewer but extremely low values pull the mean towards them, away from the median and mode. This skewness suggests the data is concentrated towards higher values, with outliers predominantly on the lower end.

Skewness

Conversely, a right-skewed distribution displays a longer tail on the right side, indicating more extreme high values that pull the mean in their direction, away from the median and mode. This skewness implies a dataset where outliers are mostly clustered at the higher end of the distribution. Understanding skewness helps analysts interpret data trends accurately, ensuring that summary statistics reflect the true central tendency and spread of the dataset.

In a box plot, the “box” represents the interquartile range (IQR) of the data. Inside the box, you find the median (50th percentile) of the dataset. The “whiskers” extend from the box to show the range of the data excluding outliers, often defined as 1.5 times the IQR beyond the quartiles.

Box plot is alternative to a density plot or histogram to understand the distribution of a numerical variable, also known as a box-and-whisker plot. This type of plot is widely used in research to visualize data distributions.

Boxplot

In a box plot, the central box represents the interquartile range (IQR), which is a measure of statistical dispersion. This box is defined by the first quartile (25th percentile) and the third quartile (75th percentile) of the dataset. Inside the box, you’ll find a dot or line representing the median (50th percentile) of the data. This median indicates where half of the observations fall above and half below.

The whiskers extending from the box show the range of typical values in the dataset. The lower whisker extends to the smallest observation within 1.5 times the IQR below the first quartile. Similarly, the upper whisker extends to the largest observation within 1.5 times the IQR above the third quartile.

Box plots are valuable because they provide a concise summary of the data distribution, emphasizing key metrics like the median and quartiles while also identifying potential outliers beyond the whiskers.

In a box plot, the central box spans the interquartile range (IQR), containing the middle fifty percent of the data. At the center of this box lies the median, marked by a dot or line, indicating where half of the observations fall above and half below.

Extending from the box are whiskers that signify the typical range of values in the dataset. The lower whisker reaches to the smallest observation that isn’t considered an outlier, typically within 1.5 times the IQR below the first quartile. Conversely, the upper whisker stretches to the largest non-outlier observation within 1.5 times the IQR above the third quartile.

Box plots are effective for quickly grasping the spread and central tendency of data, while also identifying outliers—those data points that lie beyond the whiskers. In our Air Quality Index (AQI) example, a box plot might reveal a central box indicating where the majority of cities’ AQI values cluster, with whiskers extending to the typical range of AQI levels. If there’s an outlier, such as a city with extremely high AQI values, it would be visibly marked beyond the whiskers, highlighting it as an exceptional case.

Identifying Outliers

Boxplot

In the realm of data analysis, identifying outliers is crucial because they can signal potential errors or anomalies in the dataset. Outliers are data points that significantly deviate from the majority of observations, and they play a vital role in ensuring the accuracy and reliability of our analyses.

Imagine you’re examining a dataset of Air Quality Index (AQI) values across different cities. While most cities might show AQI levels within a certain range, an outlier could represent a city with an exceptionally high AQI value. This outlier might indicate a unique environmental condition or, potentially, an error in data recording.

The reason we pay close attention to outliers is their impact on summary statistics such as the mean, standard deviation, and variance. These measures are sensitive to extreme values—if an outlier isn’t identified and appropriately handled, it could skew our understanding of the dataset’s central tendency and variability.

To manage outliers effectively, one commonly used approach is the box plot rule, which helps in visually identifying data points that lie outside the typical range of observations. By systematically detecting and addressing outliers, researchers ensure their analyses are based on accurate, representative data, thereby avoiding misleading conclusions.

In a dataset, an alternative plot apart from the histogram is a density plot. Similar to a histogram, a density plot provides a visual representation of data distribution, but it does so by smoothing out the concept of bins.

Imagine the density plot as a histogram with incredibly small bins—so small that they blend together into a smooth curve or line that fits over the data points. Instead of discrete bars, the density plot uses this continuous curve to show where values are concentrated.

Density Plot

For our dataset of Air Quality Index (AQI), a density plot would display a smooth line that indicates where most cities’ AQI values cluster. It offers a clearer view of the distribution’s shape and can reveal subtle peaks or patterns that might be missed in a histogram with fewer bins.

The height of the curve in a density plot indicates the density of observations at different AQI levels, making it easier to interpret data concentration and variability across the dataset. This approach provides a nuanced understanding of how AQI values are distributed without the potential visual limitations of traditional histograms.

In a dataset, an alternative to the histogram is a density plot. It’s akin to a histogram with incredibly tiny bins, resulting in a smooth curve that represents data distribution. However, this smoothness can sometimes be misleading.

For Air Quality Index (AQI) dataset, imagine a density plot showing a continuous curve that illustrates where AQI values concentrate across cities. To enhance understanding, we often include a “rug” of data points (shown in the above figure). These points are randomly jittered along a vertical axis to prevent them from overlapping, giving a clear indication of where observations lie.

The rug in an AQI density plot would reveal a cluster of data points around the central peak, indicating where most cities’ AQI values fall. However, it might also show sparse points in the tail ends of the plot. For instance, the sparsest part might represent just one observation—say, a city with exceptionally high AQI, standing alone.

The purpose of the rug is to caution interpreters not to draw conclusions hastily from sparse data points. While the central part of the density plot might be densely populated with observations, like the peak indicating moderate AQI levels across many cities, sparse areas should be approached with caution. In our example, the sparse tail might only include one outlier city with extremely high AQI values, reminding us that interpretations should account for the distribution’s varying data density.

Introduction

Let’s take a hypothetical example and consider a company as property. The value of a property can be evaluated in many ways using its location, area, future revenue or whether the property is residential, agricultural or business. If we consider a residential property, one of the parameters to gauze the property value is how much rental income it can generate in the future. The property analogy based on future rental generation is similar to valuing a company as discounted cash flow i.e. it is based on how much cash company can generate in future.

Key Takeaway

  • Discounted cash flow analysis helps in valuation of a company or equity based on company’s future cash flows.
  • It relies on the assumption of terminal value, risk free rate and free cash flow projection.

To value a company, let’s imagine the company having a five-year life. Let’s consider that the company will generate 100 million each year for five years and windup after that. Does the value of the company will be 500 million?

Discounted Cash Flow

The answer is no, because the inflation erodes the value to money; 100 million today is not worth the same as 100 million after one year or two. The actual value of the company will be less than 500 million. We can’t just add up the 100 million each because the 100 million today is not worth the same as in year one, year two and so on. The value of the hundred million will erode as much as the time will pass by.

A hundred million received in year one will be little more than 100 million received in two-year time; a 100 million received in year two will be little more than 100 million received in year three and so on. The reason is that if you get 100 million today you would be investing it in fixed deposit or buying a property in that case the value of money will increase. Money has a time value, the sooner you will get it, the more it will worth it.

Let’s take another assumption, what is the risk-free rate that someone can get for 100 million invested, let’s consider it be 10%. If we get 10% return each year on 100 million and subsequently, we need to reduce the future money received in year one and so on to today’s money called discounting. We need to apply formula, to get the true value of money received in future in today’s value (given in the figure). So here r in the interest rate and n is the number of the period, here in our case r is 10% and n i.e. terminal value is five.

To calculate the true value, we apply the formula. We can see that in year one 100 million is worth only 90 million, in year two 100 million is worth 0.83, year three hundred million is worth 0.75, year four is 0.68 and year five is 0.62. We get this value by dividing 1/(1+r) to the power n to the 100 million. After summing the true values of 100 million for five years we get 378 million, not 500 million. So, a hundred million for five years is not worth 500 million but 378 million after five year.

Please note that the higher the interest rate are, the lower the value of 500 million in five-year time will be. If the company issue 100 million shares the share per value is 3.78 per share which is the true share price of a company. Using this method, we can predict the value of a share price today based on the future projection. We can also decide, whether the current share price reflect the true value, or is higher or lower than the value of the share.

Assumptions

  • Just to clarify the assumption, the companies do not stop after five years. We can plug in the n value to get a reliable estimate to future value of a company. The terminal value could be based of assumption that after n number of years there is no point forecasting the future cash flow.
  • The second assumption is that we forecasted the cash flow of 100 million each which is hard, if our assumption of forecast is wrong then the company value will change.
  • The third assumption is the risk-free rate, the higher the risk-free rate the lower the value of the company will be. 10% reflect the number of things; the riskier the business is the higher the chances of going the risk-free rate higher.

Conclusion

Valuing a company is not easy, there is no full proof method. We can only predict it on the basis of assumption which can work like double edge sword. It required quite a bit assumption, it require forecasting, it require to come up with right interest rate, but it is recognized as good technique to value of the company or individual shares.