Demo Example
Demo Example
Demo Example
Author

Campusπ

Browsing

There are two main kinds of studies, observational studies and, experimental studies. The statistical process can be viewed in the context of this General process of Investigation that we talked about earlier. The four steps comprise: identifying a question or problem, collecting relevant data, analysing the data and making a conclusion. Study design helps us improve the collection of data, it makes collecting data transparent and also helps us to understand what kinds of inferences or what kind of conclusions we can make based on the data at hand.

observational studies entail some kind of observational measurement but there’s no direct intervention; there’s no attempt to alter what’s going on in terms of social dynamics or in terms of the world around us. In an observational study, there can always be confounding (or lurking) variables affecting the results. This means that observational studies can almost never show causation. It is easier to adjust for confounding variables in an experiment.

In experimental studies the goal is to apply some kind of intervention, so you might give people a pill versus people not receiving a pill. You might try to alter the conditions in which they live or how they live, the key point is that in experimental studies you have some kind of manual intervention by the researcher, in observational studies you don’t have an intervention.

Observational Studies
In observational studies we typically cannot prove cause and effect the reason is that we’re just observing. We’re just measuring what’s going on and there’s no kind of intervention. A survey is a typical kind of observational study, you take a sample from a population and you ask them questions, regarding approval or disapproval of certain drug use or their view on religious affiliations. Surveys are very commonly used as observational studies.

Observational studies are further divided in two parts. In a cross-sectional observational study data are collected at one point in time on a set of individuals. So, if you just collect data in 1955 that is a cross-sectional study. A longitudinal study also known as a panel study is an observational study in which data are collected over time on the same individuals or object. So, if you ask people a question in year 1950 and then you ask the same group of people a question in 1955, 1960 and so on that is a panel study because you’re collecting data over time on the same individuals. Panel studies are much more expensive than cross-sectional studies but they can be useful for analyzing trends of data

One key point with observational studies is that you need to be careful with interpreting the results. There can always be confounding or lurking variables affecting the results, what this means practically is that observational studies can rarely show causation and it requires very strong assumptions about the world around us if you want to take an observational study and say something about cause and effect.

It’s better if we differentiate between association & causation, two variables are associated if values of one variable tend to be related to values of the other variable, that mostly happens in observational studies. Two variables are causally associated if changing the value of the explanatory variable influences the value of the response variable which can be done only through experimental studies.

Cofounding Variables
We can take an example of Obesity to understand the nuance, it’s very tempting to say that you know if you get fat, then you’re causing your friends to get fat, because of social structure. But it’s very possible that maybe a fast-food restaurant opened down the street has changed your eating habits. So, what appeared to be a cause-and-effect pattern due to social structure of friends was really caused by the fact that a fast-food restaurant opened down the street. We like to see cause and effect relationships for better conclusive results, when we have an experiment it’s much easier to prove cause and effect.

Confounding variable in general is a variable not included in the study design that has an effect on the variables studied. So, in the case of social networks and obesity a confounding variable could be the fast-food restaurant that opened down the street. To take another example, if you study sunscreen and cancer, you might say that sunscreen is causing cancer, if you just look at people who use sunscreen in cancer.

So, you might gather sample of data you might observe what’s going on over time and you might say, people who are using sunscreen are more likely to have skin cancer. However, the confounding or lurking variable could be available and you may not be considering the level of sun exposure in your study. So, people who use the sun who go out on the beach more often tend to use sunscreen more often but still that sun exposure those people tend to have higher levels of cancer skin cancer than those people who stay indoors all the time. So, there’s this confounding variable of sun exposure that you’re not adjusting for in your observational study.

Experimental Studies

The second type of statistical study is experimental studies. We use experimental studies to say something about cause and effect. Experimental studies intel the application of some intervention or treatment. The researcher is trying to change something about the world around. The idea behind an experimental study is that subjects are randomly assigned to treatment and control groups, the treatment group receives the intervention; in the control group we do not try to remove effective confounders by controlling the environmental conditions.

so, here’s another example involving neighborhoods and obesity. So, women and children living in poor neighborhoods were randomly assigned to have an opportunity to live in wealthier neighborhoods. So, the experimenters were trying to alter the behavior of people, so there’s intervention. The researcher provides the poor people an opportunity to live in a wealthier neighborhood and by applying this intervention, they examined obesity levels. As a result, they conclude the opportunity to move from a neighbor with a high level of poverty to one with a lower level of poverty was associated with modest but potentially important reductions in the prevalence of extreme obesity and diabetes.

So, they could make this claim about cause and effect, the cause being living in a poor neighborhood and the effect being obesity and diabetes levels. They could make this claim because they were applying some kind of intervention to the world. There are some major disadvantages with experiments first of all they can be unethical especially if you conduct them on vulnerable populations like animals or if the treatment is known to be harmful.

Fundamental questions in data collection

There are two Fundamental Questions in Data Collections that extend the generalizability of a result, first was the sample randomly selected from the population? And second was the explanatory variable/ groups randomly assigned in sample? The answer to both the questions will lead to generalizability and no will make the study hard to generalize. Generalization simply refers that the sample study can be applicable to the mass audience. Remember the statistical inference where we draw sample to say something about the population. Generalization means the sample study is applicable to the population.

There is an important distinction between a sample of data and population of data. So, let’s take a research question example. Does taking aspirin relieve the pain among old age population in India? Another research question you might ask about meditation, does meditating 15 minutes each day improve happiness among college student in Beijing? In the previous question asked we have a population which is Population in India having headache. And there is also a target population which is all old age people, above 60 years of age taking aspirin in India.

Explanatory variable and response variable
Also, here we will have two kinds of variables in research study, if we are using one variable to help us understand or predict values of another variable, we call the former the explanatory variable and the later the response variable. If we want to measure how much motivation effect in getting good grades, then Academic motivation at the beginning of the school year is the explanatory variable, and GPA at the end of the school year is the response variable. In the same way if, the students want to use height to predict age; so, the explanatory variable is height and the response variable is age.

Population

So, let’s talk about what a population is? A population is simply the entire collection of cases or observational units about which information is desired. So, each of these questions has a target population. In general, for population, you are talking about percentage of Buddhist in Japan, the population implied is the population of all people in Japan.

Target Population

A target population is a collection of units a researcher is interested in; a group about which the researcher wishes to draw conclusions. So, let’s look at some research questions that we pose and let’s think about what is the population might be? So, when we ask what is the average number of words spoken daily among 10-year school children in Delhi? If you are linguist and you are asking that question your implicit target population is all 10-year-old children who attend school Delhi.

For another research question if you are sociologist and you want to know about legalising use of certain drug and what people in India think about approving or disapproving such a policy. The target population is all citizens of India. That is all people who live in India.

In the same way the third research question, of the meditation the target population implied is all college students in Beijing. These research question imply target population.

Launching Sports shoes for runners
We take a final example to differentiate between population and target population. Suppose a company is launching a new line of sports shoes designed specifically for runners. Here the population would be all individuals who wear shoes, regardless of whether they are runners or not. And the target population (Subset of population) for this product would be runners, particularly those who engage in activities such as jogging, sprinting, or marathons. Also, it is useful to remember that the population encompasses a broader range of people compared to the target population.

Census

If a researcher wants to collect data on the entire population that is called conducting a census. Census have a long history, they have existed from centuries, they are existed for the purpose of empire building. A state or empire have collected information about all other citizens for collecting money for taxing them or try to encourage a population for reproduction because a growing population had led to a stronger empire. So, when you conduct census, you collect data on all people or cases in this case. The data collected could include name, gender, age birthplace and weapons that they kept in household.

There are few problems with a census as it is often difficult and expensive as you have to collect data on everybody. Especially hard to reach groups such as undocumented worker, homeless people. There are some ethical concerns with the census because when we collect data on people you are forcing everybody to participate, so there may be some ethical dilemma of violating one’s privacy. Suppose someone actually don’t want to participate but when conducted census you have to collect information on every individual. This is why many countries conduct census once in every 10 years. Because census is expensive, collecting data on everybody will cost a lot and its difficult.

Alternative to census: Sample

One solution an alternative to collecting a census is to collect a sample of the population. So here we have a population and a sample, and what we are doing is collecting a subset of that population. We take samples all the time, this is not what only statistician do, suppose you have a giant buffet with hundreds of different entrees, most people will try some subset of entrees before making a conclusion what to eat? So, you can might take chicken, or green beans just some sample of buffet and based on sample, you decide you really liked the green beans and not chickens, I am just going to have that for my meal.

When you listen to music you might scan few stations of the radio and really focus on one music, or when making decisions you might ask few people and make decisions or when buying things. The underline essence is sampling is pretty natural where you take sample of cases and you use that sample of cases to say something about a larger set of cases.

Sampling Steps

Let’s sum up the complete steps that we use to sample from a population. Let’s go back to our previous example of a Company which is launching a new line of sports shoe for runners. The population for the research is all individuals who wear shoes while the target population would be runners or joggers. In practicality we need to find our target population but the way we approach is we see who we have access to. We could get access to people who are runners through club membership, directory, or Facebook groups that is called a sampling frame. We draw a sample from using the access to people that we have. And finally, we have respondents who are those who actually responded the questions asked for the research. In the left hand we have the complete sampling steps starting from population to respondents.

The Bigger Picture

A lot of Statistics is really about going from a sample to a population the idea is to make a sample generalizable to the population. Please keep in mind that the answer or conclusion to the research problem does not make sense if it is applicable for sample, which could be generally 300 or 500. The result finding should be applicable to a wider audience that is population.

Samples are taken so frequently we like to distinguish between descriptive and inferential statistics descriptive statistics is about organizing and presenting data from a sample or population. So, you have a sample of data or you have a population of data when you’ve collected a census. In descriptive statistics you just describe the data at hand. Inferential statistics it’s about making conclusions about a population based only on data from a sample. So, in descriptive statistics you might say I’m just describing a sample of data; inferential statistics is saying something about a population based on a sample of data.

Descriptive Statistics

Let’s talk a little bit about descriptive statistics. A lot of descriptive statistics it’s about presenting data so you might present data in the form of tables or some visual tools such as bar graphs and so on. Another aspect of descriptive statistics is about summarizing data so you might look at the average height of respondents in a sample and have a numerical summary of a sample of data. You might also look at the percentage of men in a country based on some Census data in descriptive statistics. The key point is that you’re using numerical summaries to say something about the data at hand either a sample or population

Inferential Statistics

Inferential statistics involves estimating. So, for example you might use the average height from a sample to estimate the average height in a population. Inferential statistic often also involves hypothesis testing, so you might use a sample of data to test some claim. For example, you might consider the average height in a population is 72 inches, so you collect a sample and you test that hypothesis and you test that claim using that sample of data. We define Statistical inference as the process of using data from a sample to gain information about the population.

Sample Statistics and Population Parameters: Notation for sample and Population

when we talk about samples and populations, we like to distinguish between sample statistics and population parameters. A sample statistic or statistic for short is a numerical summary based on a sample for example you might say what’s the average weight in a sampled hospital patients in London that is a statistic. A parameter is a numerical summary of a population sometimes we call these population parameters, so for example the average weight of all hospital patients in London that is a parameter or population parameter.

It’s also useful to distinguish between sample size and population size, the idea is that you want to keep track of whether you have a sample of data or population. If your data set is a sample of a population the number of cases the total number of rows in your data will be labelled with a lowercase n, if your data set consists of the entire population in the number of cases is labelled capital N and this is just so we can remind ourselves whether we’re dealing with a population or a sample.

The sample statistics are numerical summaries from a sample of data are usually but not always expressed in Latin letters. For example, if you look at the average from a sample this is labelled X bar it’s pronounced X bar and this is by convention. Population parameters are generally expressed in Greek letters, so for example the average from a population is referred to by the Greek letter mu. So, you can see that we like to in general to distinguish between population parameters expressed in Greek letters and sample statistics expressed in Latin letters.

Data is set of observation. This observation can be collected in various ways such as field notes, survey or experiments. There are lots of different ways to collect data about the world around us. So, in Social Sciences and biology we often have data on people, but in subjects like physics we may not have people as set of observation.

In statistics we have a very transparent way of organising data, we organise data into something called dataset or data-frame. The rows of the data are called cases or observational unit sometimes rows are simply referred as observation. The columns of the dataset are simply called variables. The variables are the characteristic or traits of cases. To put in another way each row is a different person or state or other kind of observational unit and each column in some kind of characteristic or trait for each one of the observational units.

There are two main type of data and two subtypes for each of these main types. We can think of variables as either being qualitative or quantitative or to put it in another way, we can have numerical or categorical as same thing.

Numerical Variable: Continuous Variable Vs Discrete Variable

Numerical variables are numbers that can take values simply that have numbers. So, for numerical variable, it usually makes sense to add, subtract or take numerical summery such as average with those values. Some examples would be, age of the respondent, miles per gallon in a car or the ranking of different collages.

For numerical variable, its useful to distinguish further between a continuous and discrete numerical variable. For a continuous numerical variable, it can pick any number. So, for example height in inches, percentage of children having dyslexia, number of miles moved away from childhood home. Each of those variables in continuous, in a sense that it can have many decimal values. So, height in inches might be 70.23 inches.

In contrast of discrete numerical variable, they can take out values that are only non-negative counting numbers. So, 0,1,2,3,4,5 and so on. These kinds of variables are discrete, they have certain jumps between each of these values that it can take on. So discrete numerical variables cannot take value 1.5, it doesn’t make sense. It can only take values of non-negative counting numbers. Some examples could be numbers of votes for a politician, when you count votes for a politician the votes are non-negative counting numbers, a politician might get 3 votes or 500 votes, but a politician can’t get 3.5 votes.

If you take a different example, rating of restaurant on a five-point scale, where scale just consist of 1,2,3,4 and 5 that is a discrete numerical variable. Another example would be the count of arrivals at airport, how many arrivals occurs at airport at a given day, well that could be count. That could be discrete numerical variable, you might have 500 arrivals at an airport but you can’t have 500.5. So, the key point between continuous and discrete numerical variable is, it has to be, whether they can take or not all sort of numbers or non-negative counting numbers.

Exercise: So, let’s go through few examples, Annual household income of respondent in thousands of us dollars, well that’s an example of a continuous numerical variable because you could have these decimal places that dollars can take on like 10$, 101.65$, 108.77$. A rating of 1 to 4 stars of a movie, assuming that the stars can take on values 1,2,3 or 4 that’s a discrete numerical variable. To give another example, the number of murders last year in a neighborhood would be a discrete numerical variable because for that kind of variable, you can only take on values 0,1,2,3,4 and so on.

Categorical Variable

Besides numerical variables, and their two subtypes continuous and discrete, it is useful to talk about other main type of variable that you will encounter that is categorical variable. Categorical variables are sometimes called, qualitative variables, they can take on values that have different categories, or qualities. Here are some examples of categorical variables, gender of respondent, in general it could be male or female. So, you have these two categories, you might talk about approval or disapproval of a social policy, you might talk about whether somebody is regular smoker or not so yes or no, you smoke regularly or not these different kinds of variables are categorical in a sense that there are these different categories or qualities associated with them.

Nominal Vs Ordinal
The two subtypes of categorical variables are nominal or ordinal, for nominal categorical variables they are no inherent order to their categories or levels. So, for example religious identification, political party affiliation, the names of countries you have visited past year, each of these are nominal categorical variables because there is no inherent order it does not make sense to say that for religious identification of catholic is somehow higher or lower than Jews. So, for nominal categorical variable, there is no inherent order.

For ordinal categorical variable there is usually an expected ordering there is natural inherent order to the different categories or levels. So, for example you might talk about, a highest education degree a person has attend, level of approval in country economy policy from strongly disapprove to strongly approve, each of these ordinal variable have ordering to it, so when you talk about highest educational degree someone has obtained a graduate degree is higher than high school degree or if you talk about approval there are orders to different levels that whether they strongly approve is the top most category, below that would be, somewhat approve, then approve to somewhat disapprove to strongly disapprove. For ordinal categorical variable, it makes sense to order the different levels, however it’s not always clear cut, because one-person idea of what natural ordering is might be culturally specific, if you go to a different person or different society there is no inherent ranking that is obvious to people for different levels of a categorical variable. So, keep in mind that if you are unclear about whether you are unclear about a nominal vs ordinal you should probably stick with nominal just because in ordinal variable there is certain subjective component to saying that certain level is higher than others.


Here is an example to help you explain more the different types of variables.

The Challenger disaster occurred on January 28, 1986, during the launch of the Space Shuttle Challenger. The Space Shuttle Challenger was a reusable spacecraft, and prior to the disaster, it had completed nine successful missions. However, on that fateful day, various factors came into play that would lead to a catastrophic event.

On the day of the launch, unusually cold temperatures were recorded at Kennedy Space Center. The shuttle’s boosters were not designed to operate in low temperatures. During the day of the launch the outside temperature was unusually low (31◦F). The previous shuttles were launched at temperatures between 53◦F and 81◦F. The Challenger’s solid rocket boosters were equipped with O-rings, which were intended to seal joints in the rocket’s segments. However, these O-rings became brittle in the cold weather, increasing the risk of failure.

Statistical Engineers, expressed concerns about launching in such cold conditions. They presented data showing the correlation between low temperatures and O-ring failure. Statistical model had showed association between cold temperatures and O-ring failures, but the evidence was not conclusive. The scientist ignored the statistical finding.

Despite these concerns, NASA managers faced significant pressure to proceed with the launch. The decision to launch was made, emphasizing incomplete and inconclusive data. Just 73 seconds into the flight, the shuttle broke apart, leading to the deaths of all seven crew members. This chart illustrates the correlation between low temperatures and O-ring issues. The data clearly indicates a relationship, but correlation does not imply causation.

Statistics as a process of scientific investigation

We can think statistics as the process of scientific investigation. The first step is to identify a question or problem. If we take the example of space shuttle the question that we pose is successful launch of space shuttle in details the correlation between O ring failure and temperature. Question identification is the first step in of scientific investigation refers to the basic problem that researcher is trying to solve.

The second step is to data collect which refers to the various ways through which researcher collect relevant data related to the question. The data collected is launch of space shuttle at different temperature. In case of space shuttle data had small sample size and there were no data points below a 53◦F temperature.

The third step is to analyze data. We plot using scatter plot that shows the datapoints on X axis we have temperature and on Y axis we have the probability of success where 0 indicates failure and 1 success. To analyze the data generally we use statistical software’s and run some test on the data at hand. In the fourth and last step, we draw inferences or conclusion from the data and try to make sense out of the test that we ran.

We see statistics as a general process of scientific investigation. It is the study of how to collect, analyze and draw conclusion from the data. The goal of statistics is to take the general process of scientific investigation and to improve it. Especially step 2,3, and 4, data collection, data analyzation and conclusion and to make it more reliable, reproducible and rigorous. The space shuttle example highlighted how crucial a statistical use can be. In day-to-day life we pose many questions in life, when used data to come to a conclusion, it can push odds in our favor and help in wise decision making.