Lesson 18: What’s Your Z-Score?
Lesson 18: What’s Your Z-Score?
Objective
You will understand that a z-score can be used to measure how far away - or how many standard deviations - an observation is away from the mean. Z-scores will usually range between -3 and +3. For simulations involving shuffling, if you compute a z-score that lies far away from the mean, then you might conclude that the outcome was not due to chance. If you see a z-score that lies close to the mean, then you might conclude it was by chance.
Vocabulary
Empirical Rule, z-score, standardized score
Essential Concepts
Lesson 18 Essential Concepts
Z-scores offer us a way to measure how extreme a value is, regardless of the units of measurement. Z-scores usually range between -3 and +3, so values that are at or are more extreme than -3 or +3 standard deviations are considered extremely rare.
Lesson
-
Answer the following question in your IDS Journal:
What do you remember about normal distributions? What are some real-life examples of variables that produce normal distributions?
-
Another characteristic of normal distributions is given in the statement, “All normal distributions are bell-shaped, but not all bell-shaped distributions are normal.” Normal distributions have special properties.
-
Some of those special properties are stated by the Empirical Rule:
• Approximately 68% of the observations in a normal distribution fall within one standard deviation of the mean
• Approximately 95% of the observations in a normal distribution fall within 2 standard deviations of the mean
• Approximately 99.7% of the observations in a normal distribution fall within 3 standard deviations of the mean
Insert image to illustrate the Empirical Rule.
-
Open RStudio and load the cdc data. Use the codes provided to do the following:
-
Create a new variable and call it height_in:
> cdc <- mutate(cdc, height_in = height * 39.3701)
-
Subset the data for the males:
> males <- filter(cdc, gender == “Male”)
-
Create a histogram for the new variable, height_in:
> histogram(~height_in, data = males, nint = 30)
-
-
Answer the following questions in your IDS Journal:
-
Does the distribution of teenage male heights look approximately normal? Explain.
-
What do you approximate the mean height of the distribution to be? Standard deviation?
-
-
Use RStudio to calculate:
-
The mean:
> mean(~height_in, data = males)
-
The standard deviation:
> sd(~height_in, data = males)
Compare your approximations from 5b to the actual values.
-
-
In your IDS Journal, draw a number line with seven equally spaced intervals and label it Teen male height in inches. Make sure you leave about 5 centimeters of space above the number line to draw a normal curve. Label the middle tick mark with the mean male height (round to the nearest tenth of an inch=69 inches). See the example below:
Insert number line example.
Then answer the following questions in your IDS Journal:
-
What height is one standard deviation above the mean? To obtain your answer, add the standard deviation value to the mean value.
-
What height is one standard deviation below the mean? To obtain your answer, subtract the standard deviation value to the mean value.
Label your number line with these values.
-
-
Continue filling your number line with the corresponding heights that are 2 and 3 standard deviations from the mean.
-
According to the Empirical Rule, if the distribution of male heights is approximately normal, about 68% of males should be between 65.6 inches and 72.4 inches tall.
-
Use RStudio to confirm whether or not the distribution of male heights is approximately normal. The following code will subset the males whose height is between 65.6 inches and 72.4 inches tall:
> one_sd_males <- filter(males, height_in > 65.6, height_in < 72.4)
There are _ males in this sample of 7749 males whose heights are one standard deviation from the mean, so _/7749 = __. This means that around ___% of males’ heights in this sample fall within one standard deviation from the mean male height. This is close to 68%, so it seems that the distribution of male heights is approximately normally distributed.
-
Now that you have verified that a normal distribution is an appropriate model for this distribution, draw a normal curve above the number line from step 7. Below is a suggested method for obtaining a decent normal curve:
• Step 1: Draw a dot 4 centimeters above the mean height
• Step 2: Draw dots 2.4 cm above the heights that are 1 standard deviation from the mean
• Step 3: Draw dots 0.36 cm above the heights that are 2 standard deviations from the mean
• Step 4: Draw dots right above the number line for the heights that are 3 standard deviations from the mean
• Step 5: Connect the dots with a smooth curve
-
You're using this normal curve as a model to represent the distribution of all teenage male heights. This will allow you to make comparisons, draw conclusions, and make predictions about male heights. Answer the following questions in your IDS Journal:
-
What proportion of teenage males are shorter than 69 inches? Explain.
-
What proportion of teenage males are between 69 and 72.4 inches tall?
-
What proportion of males are taller than 72.4 inches?
-
-
You will now investigate the distribution of teenage female heights. Use RStudio to run each of the following functions. Once you've run the second function, check if the distribution of teen female heights looks approximately normal. The approach you'll take to verify whether it is or not is to overlay the histogram with a normal curve. That can be done by running the third function.
> females <- filter(cdc, gender == “Female”)
> histogram(~height_in, data = females)
> histogram(~height_in, data = females, fit = “normal”)
> mean(~height_in, data = females)
> sd(~height_in, data = females)
-
Repeat steps 7, 8, and 11 with the distribution of teenage female heights.
-
Statisticians use something called a z-score to compare values. A z-score tells us how many standard deviations above or below the mean an observation is. Another name for z-score is a standardized score.
-
Observe the formula below for calculating a z-score, where z represents the z-score, x represents the value of an observation, bar{x} represents the mean of the observations, and s represents the standard deviation.
The following examples demonstrate how to find the z-score for a female height and a male height:
Insert examples how to find the z-score for a female height and a male height.
-
Z-scores answer the question, "How typical is x?" If x is the same as the typical value (the mean), then z = 0. If x is one standard deviation away from the mean, then z = -1 if it is below, or +1 if it is above. Recall from the normal curve that as you move farther from the center (or the mean), there are fewer observations. Therefore, a large z-score is considered an unusual value.
-
Calculate your z-score and record it in your IDS Journal. Then answer the following:
-
What does a negative z-score mean? A negative z-score means the x value is _ the mean. This means that the height is __ average.
-
What does a positive z-score mean?
-
What is the most negative z-score you think we will find? What is the most positive z-score?
-
-
Where do you fall within the distribution of height for your gender? Find your height (in inches) on the x-axis of the normal curve corresponding to your gender, then draw a vertical line from the x-axis until it intersects the normal curve. Shade the area under the curve to the left of the vertical line.
-
The shaded area represents your percentile in the distribution. A percentile is the exact value in which the desired proportion of observations lie below the specific value in a distribution. For example, with regard to people’s heights, the 70th percentile would be the height that is taller than exactly 70% of the observations.
Now use RStudio to find the percentile for the height for a teen male who is 70 inches tall.
> pnorm(70, mean = 69, sd = 3.4) = 0.615666
Use the following sentence frame to interpret the percentile:
This male student is at the _ percentile in the distribution of teen male heights. That means that he is taller than ___ of all teen males, but shorter than _ of all teen males.
-
Calculate and interpret your percentile in the distribution of height for your gender. You will be using RStudio during the next few days to practice using normal models.
Reflection
What are the essential learnings you are taking away from this lesson?
Next 2 Days
LAB 2I: R’s Normal Distribution Alphabet
Complete Labs 2H and 2I prior to the End of Unit Project.