Demo Example
Demo Example
Demo Example
Author

Campusπ

Browsing

Unlike the mean and median, which represent average and middle values respectively, the mode identifies the most frequently occurring value in a dataset. Calculating the mode is straightforward. You first list all values in the dataset, and then identify which value appears most frequently. This value is considered the mode.

In practice, statistical software is often used to calculate the mode, especially when dealing with large datasets. Just as we rely on software for calculating the mean and median efficiently, using statistical tools ensures accuracy and saves time in identifying the mode, particularly in datasets with numerous observations.

The mode offers valuable insights into the typical or predominant value within a dataset. It’s particularly useful in scenarios where identifying the most common occurrence is important, such as in analyzing consumer preferences, exam scores, or product sales data.

Advantage and Disadvantage
One of the key advantages of the mode is its resilience against extreme values, similar to the median. Unlike the mean, which can be heavily influenced by outliers, the mode remains relatively unaffected. This makes it a robust measure when dealing with datasets that contain extreme values or skewed distributions.

Additionally, the mode has intuitive appeal. It represents the value that appears most frequently in a dataset, aligning with our natural expectation that the center of a distribution should be represented by its most common value.

However, the mode does have limitations. Like the median, it can be challenging to incorporate into more complex statistical models. Statistical techniques and models often rely on the mean due to its mathematical properties and widespread applicability. The mode, while intuitive, is less commonly used in advanced statistical analyses.

Another drawback of the mode arises when dealing with continuous numerical variables. In cases where variables can take on a wide range of values with small differences (like 1.1, 1.2, 1.3 up to 100.1), determining a single mode can be ambiguous or meaningless. The mode is more suited to discrete numerical variables where values are distinct and identifiable.

Mode of survey example

Let’s dive into our example of Key to a great career in data science dataset and determine the mode for each of the variables: grades, networking, communication skills, and projects. The example is common to explain the concept of mean, median and mode, if you have already read the article on mean or median of having a great career in data science you can understand or else you can read it here.

Firstly, for the grade variable, we organize the rankings from least to most common. Students rated grades as follows: 1, 2, 3, and 4. Among these, the most common rating was 1, chosen by 133 students. This makes grades the top-ranked attribute, reflecting its perceived importance. Therefore, the mode for the grade variable is 1.

Moving to networking, the mode is 4. This indicates that among the students surveyed, networking was most frequently ranked as the least important attribute for a successful career in data science.

For communication skills, the mode is 3, suggesting that it was commonly ranked in the middle among the surveyed students in terms of importance.

Lastly, projects also have a mode of 1, indicating that a significant number of students considered practical projects as highly important for a successful career in data science, similar to grades.

This table of modes provides a snapshot of how students prioritize these attributes. It highlights the attributes that are consistently perceived as crucial versus those that are less emphasized according to the surveyed population.

If we look at the mode, we can compare it with the median and the mean. The mean the most important quality was grade, if you look at the median the most important quality was either grade or projects, we have a tie in that case the 50th percentile is two. if we look at mode the least important is networking and it also confirms that the grade is the number one factor that decide a better career in data science. We can see that across all three measures of central tendency across all three calculations we get a similar sort of finding.

We’ve discussed the mean previously about mean or average as a measure of central tendency, let’s explore two other important measures: the median. This measures provide different perspectives on what constitutes a typical observation within a dataset. The median is often referred to as the 50th percentile. It represents the middle number in a dataset when the observations are arranged in ascending or descending order. This middle point divides the dataset into two equal halves, with half of the observations lying below and half above the median. Calculating the median depends on whether the dataset contains an odd or even number of observations:
• If there is an odd number of observations, the median is simply the middle number.
• If there is an even number of observations, the median is the average of the two middle numbers.

For instance, if we have 7 observations, the median would be the 4th observation when arranged in order. If we have 8 observations, the median would be the average of the 4th and 5th observations. The median provides a robust measure of central tendency, particularly useful when dealing with skewed distributions or datasets containing outliers. Unlike the mean, which can be influenced by extreme values, the median remains relatively unaffected by outliers, making it a valuable tool for understanding the typical value or position within a dataset.

Let’s explore the concept of the median further with a practical example. Suppose we have a set of observations: 13, 27, 44, 44, 49, 53, 56, 99, and 100. To find the median, we first arrange these numbers in ascending order: 13, 27, 44, 44, 49, 53, 56, 99, 100. Since we have an odd number of observations (9 in total), the median is simply the middle number, which is 49. This means that half of the observations are below 49 (13, 27, 44, 44) and half are above it (53, 56, 99, 100).

Now, let’s consider an example with an even number of observations: 13, 27, 44, 44, 49, 53, 56, 99. Again, we arrange these numbers in ascending order: 13, 27, 44, 44, 49, 53, 56, 99.

Here, with 8 observations, the two middle numbers are 44 and 49. To find the median, we take the average of these two middle numbers: Median= (44+49)/2 = 46.5. In both cases, whether odd or even, calculating the median involves ordering the observations and identifying the middle value(s). This approach ensures a clear and consistent method for determining the median in any dataset.
When dealing with large datasets, especially those with thousands or millions of observations, statistical software is typically used to compute the median efficiently. Modern tools make it straightforward to find the median or 50th percentile, providing a robust measure of central tendency.

Advantage and Disadvantage
One of the key advantages of using the median as a measure of central tendency is its robustness against extreme values or outliers. Unlike the mean, which can be heavily influenced by outliers, the median remains relatively unaffected. This makes the median a more reliable measure when dealing with datasets that contain extreme values. Another intuitive advantage of the median is its simplicity and interpretability. The median represents the middle value in a dataset, dividing it into two equal parts where half of the observations are lower and half are higher. This straightforward interpretation makes it easy to grasp the central tendency of the data at a glance.

However, the median does have its drawbacks. One significant disadvantage is its limited utility in more complex statistical models. Unlike the mean, which is commonly used in various statistical analyses and models, the median can be more challenging to incorporate due to its simpler calculation and less nuanced representation of the data distribution.

Let’s explore the concept of the median and its resilience against outliers using a example. Suppose we have a dataset with observations: 1, 2, 3, 4, and 5. When calculating the median, we arrange these numbers in ascending order and find the middle value. Since we have an odd number of observations (5 in total), the median is simply the middle number, which in this case is 3. This means there are two observations below 3 and two above 3.

Now, consider if we introduce an outlier into our dataset. Instead of 5, let’s change one observation to 100. So, our dataset now looks like: 1, 2, 3, 4, 100. Despite the presence of the outlier (100), the number of observations remains the same and four out of five values are identical. However, when calculating the median again, it remains unaffected by the outlier. The median still stays at 3.

This example highlights a key strength of the median: its robustness against extreme values or outliers. Unlike the mean, which incorporates information from all data points and can be heavily influenced by outliers, the median focuses solely on the middle value(s) of the dataset. It doesn’t change even if an extreme value is added or modified, as long as the number of observations remains odd.

This robustness makes the median a preferred measure of central tendency in datasets where outliers are present or when the distribution is skewed. It provides a stable and reliable indication of the central value without being skewed by extreme values.

Median explained through survey example
Let’s revisit our survey dataset (link to article 12) where we explored four ranking variables related to perceptions of a successful career in data science. Using statistical software, we’ve calculated the median for each variable without the need for manual ordering or calculations.

Firstly, for the grade’s variable, the median is 2. This means that, in terms of importance, grades are perceived as a central factor among college students, ranking higher relative to other variables.

Next, for networking, the median is 3. This suggests that networking is positioned as a middle-ranking factor in terms of importance for a successful career in data science. While still significant, it may not be as pivotal as other attributes according to the surveyed students.

Interestingly, the variables with the lowest median rankings are grades and projects, both having a median rank of 2. This highlights grades and practical project experience as the most crucial attributes perceived by students for achieving success in data science careers.
These median values provide a snapshot of how students prioritize different aspects when considering their career paths. They offer insights into which factors are perceived as most essential and which ones hold lesser importance.

Mean Vs Median

In earlier article we have calculated mean taking the survey example. Let’s compare the mean and median values for our survey variables related to perceptions of a successful career in data science. Both mean and median serve as measures of central tendency, offering insights into which attributes students deem important.

Starting with grades, both the mean and median indicate that good grades are perceived as the most crucial factor for a successful career in data science. This consistency suggests a strong consensus among students regarding the significance of academic performance.
When we look at the other variables, networking emerges as the least important according to both measures. Communication skills ranked slightly higher but still less essential compared to grades and projects.

Speaking of projects, it’s notable that the median ranks projects as equally important as grades. This implies that, on average, students view hands-on project experience as equally significant as academic achievement when considering career success in data science.
It’s important to note that while the mean and median provide similar findings, they do show slight differences in how they rank the importance of grades versus projects. The mean may show a slight distinction between these two variables, whereas the median treats them equally due to its nature as the middle value in an ordered dataset.

Overall, whether we analyze using mean or median, the key takeaway remains consistent: good grades are paramount for a successful career in data science, followed closely by practical project experience. This alignment in findings scores the critical role of academic achievement and hands-on learning in shaping career aspirations in the field.

To better illustrate the concepts of mean, median and mode, let’s dive into an example survey about what makes a great career in data science, according to undergraduate students. Imagine we surveyed college students, asking them to rank four key parameters they believe are essential for aspiring data scientists. The options they had to rank were: good grades, networking, communication skills, and projects & assignments. By analyzing their responses, we can uncover valuable insights into their perceptions and priorities. Let’s see how mean can help us make sense of this data!

Take a look at this intriguing dataset featuring responses from 362 students—our entire sample size. With 362 rows, each representing an individual student’s input, we dive into various insightful details. The columns reveal a wealth of information, including:
• Age: How old are these future data scientists?
• Gender: Who’s making strides in this field?
• Year of Study: Where are they in their academic journey?
• Location: Where are these students based?

But that’s not all! We’ve also asked these students to rank four key parameters that they believe are crucial for a successful career in data science.

Column Breakdown
But before that let’s categories our variables. Pls, recall we have discussed two kinds of variables in our first class numerical and categorical.

Age: This is a numerical discrete variable. Think of it as the non-negative counting numbers we’ve discussed in class.

Gender: Here, we have a nominal categorical variable. It captures whether a student identifies as male or female, but there’s no ranking involved—it’s purely descriptive.

Year of Study: This one’s an ordinal categorical variable. It tells us which year the students are in—1st, 2nd, 3rd, or 4th. There’s a clear order, reflecting their academic progression.

Location: Despite our earlier discussions, let’s clarify—location is a nominal categorical variable. It indicates whether a school is in a rural, urban, or suburban area, but there’s no inherent ranking among these.

The Final Four Columns: These are where things get really interesting. Each student has ranked four key parameters essential for a successful career in data science. These parameters are:
• Good Grades
• Networking
• Communication Skills
• Projects & Assignments

Each student assigns a rank to these parameters, from 1 to 4. So, the final four columns are numerical discrete variables, capturing the students’ priorities in building a stellar data science career.

Mean of survey variables: Key to great career in data science

In this data set many of our variables are numerical not categorical? While we do have a few categorical variables like gender, year of study, and location, the majority of our columns—like age and the ranking variables—are numerical. Numerical data can take on many different values, making it a bit more complex to describe and understand compared to categorical data.

A natural question arises: What is the average ranking for each of these variables? Essentially, we want to understand, on average, how students prioritize these aspects. Are grades considered most important, or do networking, communication skills, or project experience take precedence?

To answer this, we can calculate the mean ranking for each variable. By computing the mean, we obtain a clear numerical summary that reflects the average perception of students regarding the importance of each factor.

Result of survey
Let’s delve into the results from our dataset regarding what students consider important for a successful career in data science. We’ve analyzed four key variables: grades, networking, communication skills, and projects, each representing how students rank these factors from most to least important.

Starting with the grade variable, we used statistical software to calculate its mean ranking, which turned out to be 2.12. This means that, on average, students rank grades among the top two factors they deem crucial for success in data science. Remember, a lower numerical ranking signifies higher perceived importance — with 1 indicating the most important and 4 the least.

Next, let’s consider networking. The average ranking for networking was found to be 3.2. This suggests that, on average, students view networking as moderately important compared to other factors in the realm of data science careers. Moving on to communication skills, the mean ranking obtained was 2.60. This places communication skills slightly higher than the midpoint of importance among the factors examined.

Lastly, for projects, the average ranking was calculated to be 2.14. This indicates that students generally prioritize hands-on project experience as one of the more important factors for achieving success in data science careers.

These mean rankings provide valuable insights into student perspectives and priorities within the field. They highlight the varying degrees of importance placed on grades, networking, communication skills, and projects. Such insights are crucial for educators, career advisors, and industry professionals seeking to understand and cater to the needs of aspiring data scientists.

To clarify, if we use the mean as a measure of central tendency, grades emerge as the highest-ranking factor in terms of perceived importance for a successful career in data science.

Mean is the one of the most commonly used measures of central tendency. It’s a powerful tool for understanding the typical value in our data, showing us where the center of numerical variables lies. Let us take an example to calculate mean weight. The mean weight is simply the total weight of all individuals divided by the number of individuals. This gives us a sense of the “average” weight. Let’s take an example how we calculate the mean, or average, weight of three students—Ryan, Pushkar, and Dravid. To find the mean of the three students, we add up all the weights and divide by the sample size which is three, we get 142.6 pounds as shown in the above figure.

When we talk about the mean or average, we’re referring to the same concept. In statistics, it’s a fundamental measure of central tendency, helping us understand what the typical value is within a set of observations. By using simple arithmetic—summing up the observations and dividing by the number of observations—we can easily find the mean.

Notation
It’s important to discuss the notation when summarizing data. Generally, when dealing with dataset we deal with hundreds of data node, it can become quite cumbersome to manage if we keep track of individual responses. This is where notation becomes incredibly handy. Instead of tracking each parameter, we use concise notation to streamline our analysis. This method is not only efficient for our dataset but also essential when dealing with datasets containing millions of observations.

When we talk about the mean or other measures, we collect different observations for a variable, which we can call it X. Suppose we have three students—Ryan, Pushkar, and Dravid. Instead of repeatedly writing their names, we use subscripts to label each observation:
x_1: Observation for Ryan
x_2: Observation for Pushkar
x_3: Observation for Dravid

This concise notation allows us to easily reference specific data points. For instance, when we mention x_3, we know we’re talking about Dravid’s weight. Using this notation, we can quickly refer to any student’s data and their corresponding values. This makes our analysis much smoother. With this shorthand notation, we can easily handle massive datasets, making it possible to perform complex analyses without getting bogged down by individual details. This approach allows us to focus on the big picture and derive meaningful insights from our data.

Samples vs. Populations
In statistics, it’s crucial to distinguish between a sample mean and a population mean. Recall that a sample is a subset of a population. There is a complete article that discusses the nuance of sample and population which can be found here (link to article 3). Statisticians and data analysts often work with samples because they’re easier and cheaper to collect compared to an entire population. Conducting a census for a population is costly and time-consuming. Therefore, much of statistics involves taking a sample from a population and making inferences about the population based on that sample.

So, coming back to mean if data set of observations is a sample, then we call the mean x ̅ (pronounced as X bar). It’s just an X with a line over this is just by convention. If the data set is a population then we call the mean µ, this is a Greek letter mu. The important point is that in general when we talk about population parameters and versus sample statistics the population parameters are expressed in Greek letters typically and the sample statistics are represented in Latin or Roman letters.

Mean of a sample vs population
Understanding how to calculate means from samples versus populations is essential in statistics. Although the formula remains the same, the notation differs slightly, which clarifies whether we’re working with a subset or the entire dataset.

When calculating the mean of a sample, denoted as x ̅, you sum up all the observations (represented as x_i, where i ranges from 1 to n, and n is the sample size), and then divide by the sample size, n. This gives x ̅, which represents the average value of sample.

On the other hand, when dealing with populations, we use a different notation. The total number of observations in a population is denoted by capital N. For example, if we’re measuring the weight of all 160 billion people in India, N represents the total population size.

To find the population mean, denoted as mu (μ), we follow the same process: sum up all the observations across the entire population and divide by the population size, N. This results in the population mean, μ, which represents the average value of the entire population.

This distinction in notation—using small n for sample size and capital N for population size—helps to clearly differentiate between working with a subset of data versus the entire dataset.

Advantage & Disadvantage of mean
The mean, as a measure of central tendency, offers several advantages that make it widely used in statistical analysis. Primarily, it provides a single numerical summary that incorporates information from all data points within a dataset. This characteristic makes the mean highly representative of the entire dataset, offering a clear and concise average value.

Moreover, the mean holds universal applicability across various statistical and data analytic methods. Its widespread use from its simplicity and effectiveness in summarizing data, making it a fundamental tool in statistical inference and decision-making processes.

Drawbacks
However, despite its popularity, the mean has notable drawbacks, primarily its sensitivity to extreme values, also known as outliers. When an outlier—such as a significantly larger or smaller value compared to the rest of the dataset—is present, the mean can be heavily influenced. This sensitivity means that the mean may not accurately reflect the typical or central value of the dataset when extreme values are present.

To illustrate this sensitivity, consider an example where a dataset has values of 1, 2, 3, 4, and 5. Calculating the mean without the outlier gives a value of 3 (1+2+3+4+5 = 15; 15 / 5 = 3). However, when the outlier (100) is included, the mean shifts dramatically to 22 (1+2+3+4+100 = 110; 110 / 5 = 22). This stark change demonstrates how the mean can be skewed by just one extreme value, affecting its reliability as a measure of central tendency in such scenarios.

In conclusion, while the mean offers a straightforward and widely applicable measure of central tendency, its susceptibility to outliers underscores the importance of considering other measures—such as the median or mode—depending on the distribution and characteristics of the dataset. Understanding these nuances is crucial for making informed decisions and interpretations in statistical analysis.

Describing Numerical Data
In the previous articles we described two kinds of variable numerical and categorical, also known as quantitative or qualitative variable. The article can be found here. In another article we also saw how do we describe categorical variable though table of count, contingency table which can be found here. We discussed how we can visualize two categorical variables, through staked and dodged chart and through mosaic plot which can be found here. Now, it’s time to tackle numerical data! This article focuses on how do we describe numerical data? To really get a grasp on numerical data, we use measures of center and location, along with some graphical techniques.

Imagine you’ve got a bunch of numbers from a survey or an experiment. How do you make sense of all this data? That’s where descriptive statistics come in. To describe the numerical data, we measure the center and location: the mean, median, and mode. Then, we also use percentiles and quartiles, which give us another way to understand the location of our data points. Next, the measures of variation or spread, determine if our data is scattered or tightly clustered. Finally, we’ll bring our data to life with visualizations, learning how to create and interpret histograms, box plots, and density plots.

Measures of Center: The center of our numerical data helps us identify the typical values. We do this by looking at:
• Mean: The average value, giving us an overall trend.
• Median: The middle value when data is ordered, showing us the central point.
• Mode: The most frequently occurring value, highlighting common trends.

Measures of Location: We also use other measures of location to dig deeper:
• Quartiles: These divide the data into quarters, providing insight into data distribution.
• Percentiles: These show the relative standing of a value within the dataset.

Measures of Dispersion: Understanding how spread out the data is helps us grasp its variability:
• Range: The difference between the maximum and minimum values.
• Interquartile Range (IQR): The range of the middle 50% of the data, offering a focused view of data spread.
• Variance and Standard Deviation: These are crucial for understanding how much the values deviate from the mean.

Visualizing Numerical Data: For numerical data, we use different visualization techniques:
• Histograms: To show the frequency distribution of data.
• Box Plots: To display the data distribution, highlighting the median, quartiles, and potential outliers.

In the previous article we discussed how to visualize two categorical variables through dodged, staked bar graphs. Another way of visualizing a two-by-two table is through a mosaic plot. A mosaic plot can help you to visualize the unconditional and conditional aspects of a contingency table. So, there are two steps to construct a mosaic plot.

First, you just create a square and that square is divided into two vertical bars. There would be as many vertical bars as the number of categorical variables. You should have the same number of vertical bars as you have levels of a categorical variable. The width of these vertical bars is proportional to the unconditional distribution of a categorical variable. So, if you have two levels of a categorical variable then you have two bars.

The next step is we split each vertical bar horizontally proportional to the conditional distributions of some other variable. It’s really important to note that, when you create a mosaic plot, keep track of what you’re conditioning on, as it will change the whole graph.


For better understanding let’s go through an example, so first we’re going to take a square and we’re going to divide it into vertical bars, whose widths are proportional to the unconditional distribution of the first categorical variable.

So, our first categorical variable is gender. If we look at the percentage, we can see that 63 percent of people are female and 37 are male.

We divide the square vertically in two columns as there are two categorical variables, male and female. Then we create square proportional to the width of the vertical bar for those in the female category, it’s going to be a little bit larger. because we have 63% of female as the unconditional distribution. We can see that this blue bar it’s a little bit larger in term terms of its width because 63 percent of people are females and then the orange bar for male is a little bit narrower.

This completes our first step; we we’ve created a square and divided the vertical bars equal to the number of levels for our first categorical variable. Also, the width of each bar is proportional to the unconditional distribution of the first categorical variable.

Conditioning on gender

In the second and final step, we split each bar horizontally, proportional to the conditional distributions of the row categorical variable. so, our row categorical variable is going to be relationship status at three levels. We’re going to look at the distributions of relationship status, conditional on whether you are female or male. This is just reflecting the different conditional distributions of class.

You can see here that the two vertical bars are reflecting the unconditional distribution of whether you are female or male, and the horizontal splits we have a different distribution of whether you are in a relationship, in a complicated relationship or single, conditioned on whether you are female or male. A greater percentage of people are who are single are female (59%). We can see this visually in the Mosaic plot it’s reflecting the distribution of relationship status conditional and gender. The great thing about a mosaic plot is it conveys a lot of information in a relatively concise way.

So, to conclude again on a mosaic plot, the width of the bars or the column variable gives the unconditional percentage or unconditional distribution, that visually reflect in the mosaic plot and we can see that overall, more people are female than male. Vertical splits is the conditional distributions, you can say that among those who are female a greater percentage are single.

Conditioning on relationship status

In the previous case we were conditioning on gender, now we are on relationship status. We can also make a mosaic plot by flipping the column and row variables. So, in that case the relationship status will be the unconditional distribution and then the gender in the row variable as conditional distribution. So as previously we will divide the square in three columns as in a relationship, it’s complicated, single as they represent the unconditional distribution. As a next step we split the columns horizontally as the conditional distribution of gender.

So, in this case also we can make conclusions as well, we’re just conditioning on a slightly different variable i.e. relationship status. We can conclude that overall, a greater percentage of female are single. Also, we can conclude that female have a higher percentage of being in a relationship than male.

The row variable is giving us a set of conditional distributions, so it’s really important when you create a mosaic plot to keep track of what you’re conditioning on, are you conditioning on relationship status or you’re conditioning on gender. The distribution of gender conditional in relationship status is in fact not the same as the distribution of relationship status conditional on gender. Or as posed earlier we can ask among those who are female, what percentage of people were single than complicated relationship which will reflect conditioning on gender. And the next will be among those who were in relationship, what percentage is male? so always ask yourself what am I conditioning on what are these percentages or proportions of that I’m calculating.

Bar Graph-two categorical variable

In the previous article, we have discussed tabulating and graphing one categorical variable through table of count and proportion. We also discussed in another article about tabulating two categorical variables through contingency table. We also saw in another article how the conditioning matters through conditional and marginal distributions by taking example of relationship status and gender. So, in this article of categorical data, we will discuss how to visualize categorical data. We will talk about dodged, staked bar graphs. We will also discuss what is called a Mosaic plot in another article? To visualize the two categorical variables, we use both dodged and stacked bar graphs. Bar graphs can be used to display relative frequencies or proportions or percentages but here we will focus on bar graphs of counts.

Referring to our contingency table of counts we have 169 total number of people. We also have the total counts for those in a relationship, in a complicated relationship or single and total counts of female and male. And then, we have the cells of this contingency table which are the different counts at different levels. So, for example we have 32 people who are in relationship and are female. Then we have 10 people who are in are in relationship are male.

So, for a contingency table of counts, we can think of graphing the marginal or unconditional distributions using a simple bar graph which is displayed. The bar graph is showing the unconditional distribution of those who are female and male. Across the other categorical variable of relationship status, we have three level of those who are in a relationship, in a complicated relationship or single.

Overall, 42, 19, 108 people that’s shown in this bar graph for relationship status, and 107 and 62 as male and female. So, reiterate again, these are the frequencies or counts for the unconditional distributions

Dodged bar graph -side by side bar graph

Plotting the total unconditional distribution is straight forward as we have seen earlier. The above graph shows clearly how to plot these distributions. The question we have is these cells in the middle highlighted in red box that reflect the cross tabulation of the different values for a categorical variable, how do we graph. The solution is to create a bar graph that are either dodged or stacked

Gender with relationship status

Let’s look at a dodged bar graph, this is a bar graph in which we’re looking at those who are female and those who are male and then among those who are female or male, we’re looking at the counts for whether you’re in a relationship, in a complicated relationship or single. We call this dodged Because the actual bars are compressed next to each other, or side by side.

We can see that all we’re doing is we’re just representing the cells here. We have six cells and we have six bars, we’re just grouping the bars based on whether you’re female or male. When looking at these cells, we know we have different heights and we can see that most people in our data set are those in the group, who are female and single, that’s the highest bar out of these cells. We can also look at a different sort of way of creating a dodged bar graph instead of grouping by whether or not you are male or female.

Relationship status with Gender

We can also group the cells based on whether you’re in a relationship, in a complicated relationship or single. This is the same kind of graph as before the heights of the bars are the same, we’re just orienting whether or not you are female or male versus the relationship status. In this case we’re just chunking our data into whether you’re in in a relationship, in a complicated relationship or single than with an each of those levels or categories, so we have three columns rather than two in previous example and those three levels are again bifurcated into whether you are male or female.

It’s just a slightly different way of looking at the data, but we’re doing is we’re just representing the counts or frequencies in the cells in a contingency table and the heights of the bars, just still representing the number of counts for each of those cross tabulations

Stacked bar graph – Gender with relationship status

Now besides dodged bar graphs, we can also think about stacking the bar graph i.e. instead of looking at side by side, we just stack them on top of each other. In this case also what we’re doing is we’re just representing the cells of a table of counts as different bars.

As in the previous case, we’re creating a stacked bar graph in which we’re chunking our data into female or male and then we’re plotting as within each bar the number of counts for those with the relationship status.

So, you can see here that again the greatest bar is represented by females who are single, that just reflects the same information in a dodged bar graph, but sometimes a stacked bar graph can be more informative or just a more concise way of visualizing the cells.

Relationship status with Gender

We could also look at a stacked bar graph by just chunking the relationship status as in the previous case. So, we’re looking at relationship status first, and then within each level we’re looking at the number of counts who are female or male. Both the graphs dodged and stacked conclude the same thing. Also change in the orientation of the graph will also not change anything, it’s just a different way to represent the data.

Previous article discussed the basic contingency table of proportions and percentages. People often want to ask more about relationships between these two categorical variables more explicitly. To put another way, we want to know the percentages, conditioning on particular values. so far, we know the overall percentages and we know the percentages for particular combinations but we want to condition on some set of values to and calculate the percentages. Contingency table is called contingency table because often you want to condition or you want to say things that are contingent upon the values of another variable.

For example, someone might pose a question that, among those who are female, what percentage of people were single than complicated relationship. This is conditioning on gender. In a similar way but in a different way we might condition on relationship status, so you might say among those who were in relationship, what percentage is male?

So, let’s look at a contingency table in which we’re conditioning on Gender, what we do is we divide the cell counts by the corresponding total for each category of the gender, i.e. male and female and then we multiply by 100, this will give us the percentages. so, you can see that we have the total for female category i.e., 107 people who are female and 62 people are male and when we divide each cell by its respective column total, we get these percentages.

We divide for those who are in the relationship is 32, the total overall of those who are female is 107. In a similar way we know that 11 percent were in a complicated relationship both male and female. We’re just taking the cell counts for that for a particular cell and we’re dividing by the total number of counts of people who are female.

To answer the question that we pose earlier, among those who are female, what percentage of people were single and what percentages are in complicated relationship, 59 % are single and 11% are in complicated relationship as it is clearly visible from the above figure.

Two categorical Variables – Conditioning on relationship status

We can take a different approach and condition on relationship status; what we’ll do is we’ll divide the cell counts by the corresponding total for each level of class and then multiply by 100.

Let’s go through a few examples what we’ll do is we’ll take the cell count for this particular combination which is 32 and divided by 42 and that is just the total number of observations overall, of those people who are in relationship. This will tell us among those who were in a relationship and are female, when we are conditioning on class.

To answer the question, among those who were in relationship, what percentage is male? We can see that only 24% are male. While the previous table has just 16 % are male and are in the relationship. The difference is because it all depend on the conditioning whether it is relationship status or gender.

Conditioning on relationship status – Percentages
If you are in a relationship, it is 76% chance that you are female, also if you are female, then you have 30% chance of being in relationship which is higher than male as we have seen previously when conditioning on gender. The results differ as whether we are conditioning on relationship status and gender but gives the same conclusion. This led us to conclude that females are likely to be in are relationship or have greater chance of finding the life partner than male counterpart. There is a relationship between gender and their relationship status. Both the variables are not independent.

Difference in proportion

We can also calculate the difference in proportion, if you look at the previous table, 30% of females in the sample say they are in a relationship while only 16% of males in the sample say they are in a relationship. There is a difference. The difference in proportions is a difference in proportions for one categorical variable calculated for different levels of the other categorical variable

Example: proportion of females in a relationship – proportion of males in a relationship: 0.14

Again, to reiterate it, there is a difference when we ask, what proportion of people in this sample are female and in a relationship? And when we ask What proportion of people in a relationship in this sample are female?
Ans: 32/107

What proportion of people in a relationship in this sample are female?
Ans: 32/42

We can have a word of caution here as the proportion of females in a relationship is NOT THE SAME AS the proportion of people in a relationship who are female!

Distribution

Now it’s a good time to discuss, what a distribution is? A distribution is simply the arrangement of values of a variable showing their relative frequency meaning the proportions or percentages. We think of the distribution all the time, imagine have a group of 10 friends and you ask about their favourite ice cream flavor. The possible flavors are Vanilla, Chocolate, and Strawberry. A bar chart or frequency table can visually represent this distribution, where each bar represents the number of friends who prefer each flavor.

You can talk about a distribution that is unconditional or marginal, joint or conditional. so, when we look at these contingency tables it’s useful to talk about it in a way that it references the idea of these type of distribution.

Unconditional (marginal), joint and conditional distribution

Above figure depicts an unconditional distribution or marginal distribution of that we’ve highlighted. Pls keep in mind that they are in the margin and hence called marginal distribution.

The Joint distribution is at the left side of the plot and is given by particular cells jointly.
At the bottom, we have conditional distributions are given here the distribution of relationship status, conditional on whether you are male or female. The highlighted box in red is conditioned on male. In the similar way we can also condition on the relationship status shown in the extreme right of the table in red box.

The whole point about these distributions is that it can be used to examine the relationship between two categorical variables, when there’s a relationship between two variables, we say that the two variables are not independent.

So, let’s look at another example to understand tabulating the two categorical variables. In previous article we have described presenting one categorical variable. The question that we might pose is, is men more likely to be in a relationship or its easy for female to find a life partner. One way to answer these kinds of answers is through contingency table.

So, to understand this we have two categorical variables. The two categorical variables in this example are relationship status and gender. For plotting two categorical variables, contingency table (or cross tabulation) is drawn which is just a table of counts, proportions, or percentages from two categorical variables. Both the categorical variables are plotted across columns and rows.

It’s called a contingency table because it can tell us how cases are distributed along each variable contingent (or conditional) on one or more categories of the other variable. To analyze what’s going on a contingency table it’s just a cross tabulation; it’s simply a table of counts or proportions or percentages from two categorical variables.

The relationship status can be categorised in three categories- in a relationship, complicated and single. These categories are across the rows of the dataset, while gender which have male and female are in the columns of the dataset. It doesn’t matter which variable is displayed in the rows and which in the columns. As in case of one categorical variable, each cell in contingency table represents the number of times a particular combination of variable categories occurs in the dataset.

Contingency table of counts

So, here’s a table of counts a contingency table of counts that displays total number of males 62 and females 107. Each cell in this contingency table will represent the number of times a particular combination of variable categories occurs in the data set and the rows of a contingency table of counts the other categories for one variable and the columns are the categories of the other variable. One thing to keep in mind is that the totals for each row and column are given in the margins

The row variable is labelled and its relationship status and then the column variable is whether or not the person is male or female. we also have the categories or levels of the variable class in the middle. You can see in the figure that we have cell counts for each combination of the categories. The row and column total gives the total no of cases in the dataset. Also, each combination of two categorical variable in the middle gives the total no of cases in the dataset. For example, if you just look at the row for the level in a relationship and the column female you can see that we have 32 females in our data set who were in a relationship.

Contingency table of Proportion

You might also encounter a contingency table of proportions or percentages. It’s useful to express a contingency table in this way because people want proportions or percentages when they’re looking at these tables. To convert a contingency table of counts to a table of proportions, divide each cell in the table by the total number of cases. To convert a counting table of proportions to a table of percentages, simply multiply each cell by 100.

So, here’s an example of converting from Counts to proportions all we’re doing is we’re taking the count for the cell, we have 32 observations and we divide by the total number of observations i.e. 169 and we get this proportion of 0.19. To convert from proportions to percentages again we just multiply the proportions by 100. So, we can say that 19% of people are in a relationship and female. To summarise the table, we can say that most of the people in the survey were single 64 percent. Also, the percentage of female in the survey is more than male, 63 % vs 37%.

Another sort of conclusion we can make is about the sort of joint combinations of the levels of the two categorical variables. We can say for example that 37 percent of females are single versus 27 percent of male who are single. To give another example, 7% of female are in complicated relationship vs 4% as male.

This kind of tabulation can really guide us or sort of provide an impetus to answering certain questions. So, in these kinds of questions, we can use tables and graphs in particular is contingency tables.

To answer questions that we posed earlier at the beginning of the table were men more likely to be in a relationship or its easy for female to find a life partner. We can see here for men its 6% vs female have 19% chances of getting a life partner. So, answer second question, were people in a complicated relationship are more than those who are single? we can see that people who are in complicated relation are 11 % vs those who are single.

If you have a sample or population of data and you want to understand the different properties, the different relationships among the variables in your data then you need to summarise and visualise it, that’s what descriptive statistics is really all about. Categorical variables deal with the characteristic or traits of the variable i.e., data at hand, generally represented in the column of the dataset.

This article focuses on categorical data, i.e., the different ways to study it both in terms of tables and graphical displays. We will start our journey with tabulating one categorical variable called frequency tables, these are just tables of counts. Later on, we have discussed about tabulating two categorical variables through contingency tables, this is when we have two categorical variables and we are just looking at a cross tabulation of counts or frequencies. We will also discuss way of graphing both one and two categorical variables. We also use mosaic plots for two categorical variable which are quite impressive and handy in graphing the two categorical variables.

This article covers the basic concepts of describing one and two categorical variables. In case of one categorical variable, it discusses summery statistics like proportion, frequency table and relative frequency table. The next portion in one categorical variable discusses bar chart and pie chart. To summarise two categorical variables, it discusses two-way table, difference in proportion and finally it visualizes two categorical variables through side-by-side bar chart, segmented bar chart and mosaic plot. The article is well suited for students, professional, educators and individuals who are just starting out in statistics and data science and wanted a solid grasp in the key fundamental.

Categorical Variable: Frequency Tables of counts
Categorical variables are variables that have values with different categories or qualities sometimes they’re called factors and the different categories are called levels. We have gone through some of these examples before like gender, respondent, approval or disapproval of a social policy, whether or not your regular smoker in our previous session. If you haven’t gone through the article, here is the link.

So, let’s take an example for better understanding, A random sample of adults were surveyed regarding the type of car they own among the four cars Hyundai, Maruti, Tesla, Mercedes, ford. Now we can raise some questions you might ask you know how many people have Mercedes, what proportion of passengers have ford. These kinds of questions where when you look at particular rows of a data set that can really prompt investigation of various ideas.

so, to answer these kinds of questions we will want to create tables and graphs for a single categorical variable, one way to really create a table of counts is through frequency tables a frequency table is just that it’s you take a categorical variable and you just calculate the counts of observations for each category. Frequency is just a fancy word for count so if you ever hear the word frequency you think counts, you’re just counting up how many in each category in terms of observations. So right now, I want us to focus on how many people have ford? So, we can look at the frequency table and say that 42 people have ford and 65 people have tesla and so on.

Relative frequency table of proportion and percentages

Besides the table of counts you can also create what’s called a relative frequency table sometimes a relative frequency is more useful than frequency table. So, a relative frequency is simply a proportion or percentage of each category, so to calculate the proportion you just divide the counts by the total number of cases and then that will give you the proportion. But usually, we want to multiply by 100 because often people like to express proportions as percentages. We call this a relative frequency table and it’s simply a table that gives the categories of a categorical variable and it gives the proportions or percentages of observations for each category.

So here is the example of the no. of people who have different cars in proportion. Please note that the all the numbers in a relative frequency table sum to 1. To express proportions as percentages so the way you do that is you simply multiply the proportions by 100.


so, let’s examine our research questions in the context of creating these relative frequency tables you might say what percentage of people have different types of cars. You could just create a relative frequency table of percentages in terms of the question. You may also ask what percentage of people have no cars? So, you can see that these research questions are quite easily answered through a collection of data and then this kind of basic analysis where you create a table of counts proportions or percentages.

Graphing one categorical variable

Now besides tables it’s often useful to visualize a categorical variable. To Really visualize a single categorical variable, it’s very common to either use a bar graph or a pie chart. Bar graph is usually preferred by statisticians; a bar graph shows the bars whose areas are representing the count of observations for each category of a categorical variable.

A pie chart basically shows how a whole is divided into categories and it shows wedges of a circle and each wedge has an area corresponding to the proportions for each category. These are the most common charts used throughout the academia and industry.

The above picture shows a bar graph, this is a bar graph of counts. So, it’s a frequency bar graph and I’ve presented the frequency table for comparison and you can see that the bar graph just duplicates the findings from the frequency table but sometimes it’s useful just to visualize the difference especially to really understand the differences between the number of observations in each category.

A pie chart shown above is based on a frequency table of counts but the wedges give you an idea of the relative proportion or percentage. A pie chart is replicating the bar graph of frequency counts you do see in the above figure that highest number of people have Maruti.

In a bar graph you can really easily compare the heights of the bars; with the pie chart it can be pretty difficult to really compare the relative sizes especially when the categories are pretty similar. The number of counts or the relative size of the counts meaning the proportions or percentages. I just want to emphasize that if you’re showing a table of counts or frequencies to a statistician or an expert it is better to present a bar graph.

you can see that in the bar graph it’s quite clear that highest number of people have Maruti and then the next highest category is Tesla. On the pie chart the difference between the Hyundai, ford and Mercedes is quite difficult to distinguish given the counts are removed. On the bar graph even, the counts are similar can still distinguish between the three categories meaning it’s you can still tell that there are more people in Hyundai, the pie chart. It really kind of washes that over this is why bar graphs are typically favoured by statisticians and experts in data visualization.