Lecture 3 : Normal Distributions

Check-In Here

Look at the distribution below.

  • What’s the shape of this distribution? What does this shape tell you?

  • What would you consider to be an outlier in the distribution? Why?

d <- read.csv("../datasets/Protestant Work Ethic/data.csv", sep = "\t")

hist(d$Q1E, breaks = 100, 
     main = "RT for Answering Q1",
     xlab = "Response Time (RT) in ms",
     col = 'black', bor = 'white')

# abline(v = mean(d$Q1E, na.rm = T), col = 'red', lwd = 5)

Question: What other information would you want to know to determine who is an outlier on this question? (And why would this information be relevant?)

  • student response go here.
  • student response go here.
  • student response go here.
  • student response go here.

IN R : finding and removing outliers

Announcements & Agenda

  • Brain Exam : in two weeks.

  • Labs :

    • keys posted. use the keys to learn / ask questions.

    • quarto…..a bit of a mess.

    • like seeing questions on the Discord.

  • R Script For Today

  • Goal : understand why and how the normal distribution is “normal”; focus more on understanding the mean as a prediction of this distribution vs. a prediction of the population, and learn one method of estimating how well the mean describes the population.

    • 9:10 - 9:30 | Check-In and Removing Outliers

    • 9:30 - 10:00 | Lab 2 Review

    • 10:00 - 10:30 | Mean and Normal Distributions

    • 10:30 - 10:40 | BREAK TIME

    • 10:40 - 11:00 | Presentation

    • 11:00 - 12:00 | Sampling Error

RECAP : Lab 2

Loading a Different Kind of Dataset

  • Professor did not give good instructions on…

    • …how to turn the 0s into NAs

    • …how to load data that are not comma separated

  • Professor Method

## Loading Dataset
selfes <- read.csv("../datasets/Self-Esteem Dataset/data.csv",
                   stringsAsFactors = T,
                   na.strings = "0", sep = "\t")
  • Other Methods (or Processes to figuring things out?)

    • student examples go here maybe

Creating a Scale [skipping for time…see key?]

  1. Organize your items; reverse-score; evaluate reliability.
## Creating the Scale
poskey.df <- selfes[,c(1:2,4,6,7)] # pos-keyed items (from the codebook)
negkey.df <- selfes[,c(3,5,8:10)] # neg-keyed items (from the codebook)
negkeyR.df <- 5-negkey.df # reverse scoring the neg-keyed items
SELFES.DF <- data.frame(poskey.df, negkeyR.df) # bringing it all 2gether.

library(psych) # loading the library
alpha(SELFES.DF) # alpha reliability.

Reliability analysis   
Call: alpha(x = SELFES.DF)

  raw_alpha std.alpha G6(smc) average_r S/N     ase mean  sd median_r
      0.91      0.91    0.92      0.52  11 0.00058  2.6 0.7     0.52

    95% confidence boundaries 
         lower alpha upper
Feldt     0.91  0.91  0.91
Duhachek  0.91  0.91  0.91

 Reliability if an item is dropped:
    raw_alpha std.alpha G6(smc) average_r  S/N alpha se  var.r med.r
Q1       0.90      0.90    0.91      0.51  9.5  0.00064 0.0089  0.51
Q2       0.91      0.91    0.91      0.52  9.7  0.00063 0.0085  0.52
Q4       0.91      0.91    0.91      0.53 10.3  0.00061 0.0081  0.53
Q6       0.90      0.90    0.90      0.50  9.2  0.00067 0.0087  0.51
Q7       0.90      0.90    0.91      0.51  9.3  0.00066 0.0089  0.51
Q3       0.90      0.90    0.91      0.51  9.3  0.00066 0.0094  0.51
Q5       0.90      0.91    0.91      0.52  9.6  0.00065 0.0098  0.51
Q8       0.91      0.91    0.92      0.54 10.7  0.00059 0.0064  0.54
Q9       0.90      0.91    0.91      0.52  9.6  0.00065 0.0085  0.52
Q10      0.90      0.90    0.90      0.51  9.3  0.00067 0.0086  0.51

 Item statistics 
        n raw.r std.r r.cor r.drop mean   sd
Q1  47876  0.76  0.77  0.75   0.70  3.0 0.87
Q2  47658  0.73  0.74  0.71   0.66  3.1 0.79
Q4  47751  0.66  0.68  0.62   0.59  2.9 0.81
Q6  47809  0.81  0.81  0.79   0.75  2.6 0.92
Q7  47758  0.79  0.79  0.77   0.74  2.4 0.93
Q3  47751  0.79  0.79  0.76   0.73  2.7 0.95
Q5  47781  0.76  0.76  0.72   0.69  2.6 0.98
Q8  47797  0.64  0.63  0.56   0.54  2.3 0.96
Q9  47728  0.76  0.75  0.73   0.69  2.2 0.99
Q10 47772  0.81  0.80  0.78   0.74  2.4 1.07

Non missing response frequency for each item
       1    2    3    4 miss
Q1  0.06 0.18 0.44 0.32 0.00
Q2  0.04 0.13 0.50 0.33 0.01
Q4  0.05 0.21 0.50 0.24 0.00
Q6  0.14 0.33 0.37 0.17 0.00
Q7  0.18 0.34 0.35 0.14 0.00
Q3  0.13 0.28 0.37 0.22 0.00
Q5  0.14 0.32 0.32 0.22 0.00
Q8  0.21 0.41 0.24 0.14 0.00
Q9  0.27 0.40 0.20 0.14 0.01
Q10 0.24 0.33 0.22 0.22 0.00
  1. Average the items into one variable; graph & describe.
selfes$SELFES <- rowMeans(SELFES.DF, na.rm = T) # creating the scale
hist(selfes$SELFES, col = 'black', bor = 'white', # the graph
     main = "Histogram of Self-Esteem", 
     xlab = "Self-Esteem Score", breaks = 15)

Mean is a Prediction (of the Sample)

lils <- selfes[sample(1:nrow(selfes), 100),]# 100 random data points
plot(lils$SELFES, 
     ylab = "Self-Esteem (100 Points)",
     xlab = "Index") 

plot(lils$SELFES,
     ylab = "Self-Esteem (100 Points)",
     xlab = "Index") 
abline(h = mean(lils$SELFES, na.rm = T), lwd = 5)

## quantifying errors (residuals)
residuals <- selfes$SELFES - mean(selfes$SELFES, na.rm = T)
SST <- sum(residuals^2, na.rm = T)
SST
[1] 23401.4
SST/length(residuals) # average of squared residuals (variance)
[1] 0.4877935
sqrt(SST/length(residuals)) # average of residuals, unsquared (standard deviation)
[1] 0.6984221
sd(selfes$SELFES, na.rm = T) # 
[1] 0.6987572

The “Normal” Distribution

When do we see a “normal” distribution?

  1. When “life is complex” (multiple influences on an outcome.)
  2. That complexity is independent.

Discussion : Why is this variable almost Normal?

  1. What are the multiple & independent influences on people’s self-esteem?
    1. cultural values
    2. social environment (community)
    3. other personality factors
    4. life experiences (bullied; you were the bully and feel AWESOME about it;
    5. physiological factors
    6. mental health
    7. you rich (not broke & struggling)
    8. family
  2. What are some of the non-independent influences on people’s self-esteem (that might make this not perfectly normal?)
    1. all AMERICAN and in our CULTURE we VALUE self-esteem
    2. people with lower self-esteem may be less likely to take the survey (sampling bias)
hist(selfes$SELFES, col = 'black', bor = 'white')

mean(selfes$SELFES, na.rm = T)
[1] 2.629648

Activity : Let’s simulate a normal distribution in R

Goal : define an object called “life” that simulates what 1000 people’s lives look like, if each life is the summation of 10 random coin flips (heads = 0, tails = 1).

Code you will need :

  • coinflip <- c(0,1) # defining a coin-flip.

  • sample(x, n) # randomly sample from x n times

  • replicate(n, expr) # to repeat an expression n times

  • for(){} # our good friend the foor loop.

Professor Code Goes Here.

coinflip <- c(0,1)
coinflip
[1] 0 1
sample(1:10, 1)
[1] 3
sample(coinflip, 1)
[1] 0
life <- array()

x <- replicate(10, sample(coinflip, 1))
sum(x)
[1] 7
life[3] <- sum(x)
life
[1] NA NA  7
life <- array() # WHAT HAPPENS IF I DO THIS????
for(i in c(1:1000)){
  x <- replicate(10, sample(coinflip, 1))
  life[i] <- sum(x)
}

life
   [1]  3  7  3  6  7  6  6  4  6  4  6  5  5  6  6  6  6  5  7  4  3  7  6  6
  [25]  4  5  5  5  6  6  2  5  4  4  4  6  8  7  4  7  4  2 10  4  5  4  4  3
  [49]  2  7  6  7  5  4  2  5  6  3  5  9  5  4  5  6  3  6  6  6  3  6  4  2
  [73]  8  6  7  5  4  5  3  4  6  7  3  4  5  0  5  4  5  6  6  8  8  4  3  3
  [97]  4  5  3  4  6  6  5  4  5  6  6  6  4  4  6  3  5  5  3  7  3  2  6  5
 [121]  7  3  5  4  5  3  7  7  6  2  5  2  4  4  6  6  4  3  5  6  4  8  4  2
 [145]  5  3  5  5  3  5  4  1  5  5  3  3  5  7  5  7  6  5  6  6  5  3  3  6
 [169]  5  4  8  5  5  4  3  5  7  5  6  3  5  8  4  5  5  7  5  4  5  7  6  6
 [193]  6  4  5  4  7  4  8  6  5  2  4  6  5  3  6  4  3  2  6  6  4  5  6  3
 [217]  6  4  4  2  5  3  6  6  3  5  5  0  7  6  3  6  6  3  5  3  7  7  2  5
 [241]  5  6  4  3  7  4  4  7  7  6  3  5  5  6  5  7  4  4  2  5  7  2  5  5
 [265]  5  7  5  6  5  8  1  6  4  6  5  4  6  3  7  6  6  4  7  6  3  4  7  7
 [289]  6  6  6  6  5  4  7  4  5  4  6  7  5  5  2  8  4  3  7  5  6  7  4  5
 [313]  6  6  5  3  4  5  6  7  6  4  6  5  6  6  4  3  6  6  5  3  5  8  4  4
 [337]  6  6  3  3  5  4  6  3  5  4  7  2  7  6  6  2  3  8  5  6  5  3  2  7
 [361]  5  2  4  4  2  5  5  3  4  4  6  4  7  5  7  4  2  6  7  4  6  3  8  5
 [385]  2  5  3  6  5  3  5  5  6  5  7  4  5  3  3  4  3  4  5  2  6  5  4  7
 [409]  6  3  4  6  6  4  4  5  6  5  4  4 10  5  6  5  6  5  6  7  3  4  3  6
 [433]  4  5  5  3  6  6  8  4  4  5  6  8  6  6  5  6  4  3  4  5  3  2  9  3
 [457]  7  5  6  4  7  4  6  2  5  4  5  3  6  6  8  7  6  4  5  3  8  5  5  4
 [481]  5  6  5  5  4  4  5  5  6  6  6  4  9  2  5  5  6  4  3  5  6  5  7  7
 [505]  3  4  6  7  6  3  4  4  5  3  6  6  4  7  2  4  5  2  5  3  5  6  3  4
 [529]  4  7  4  5  5  6  7  6  6  6  5  9  4  3  3  5  8  2  5  6  2  3  3  6
 [553]  8  4  6  3  6  3  4  3  9  6  6  4  8  3  6  8  7  5  5  7  4  6  4  6
 [577]  7  2  3  7  6  7  3  5  4  6  6  5  2  6  5  6  2  4  6  9  1  6  3  5
 [601]  7  4  6 10  3  3  2  7  4  4  4  6  3  4  5  8  5  5  2  3  5  5  5  7
 [625]  2  6  4  7  5  4  5  6  4  5  6  8  7  4  7  6  6  4  6  3  5  5  6  5
 [649]  5  5  5  5  4  6  4  5  7  4  6  2  5  4  7  6  2  6  5  8  3  5  4  6
 [673]  5  5  5  5  5  7  7  2  6  6  5  6  5  7  5  3  8  6  4  6  6  3  6  7
 [697]  5  8  5  2  4  7  3  7  4  8  6  3  6  7  6  5  5  3  1  4  6  4  3  5
 [721]  5  5  5  5  5  7  7  6  3  5  5  4  4  5  4  5  5  4  7  4  5  7  3  6
 [745]  4  3  6  3  5  6  4  6  4  4  5  4  4  6  5  5  6  6  3  5  4  6  5  4
 [769]  4  7  4  7  4  5  6  7  6  7  5  6  5  5  8  5  6  6  6  2  5  7  1  9
 [793]  5  4  6  3  4  6  5  7  6  6  5  7  7  7  5  4  5  8  5  6  4  5  3  3
 [817]  5  3  8  4  5  6  5  5  6  4  2  6  4  7  4  5  5  6  3  6  5  5  5  6
 [841]  8  6  7  3  7  8  5  6  8  4  6  4  4  1  8  5  4  6  6  6  3  7  9  7
 [865]  5  6  3  7  2  2  6  5  5  4  8  5  7  3  5  6  5  5  5  6  6  5  6  5
 [889]  4  4  4  5  4  5  4  4  6  4  3  4  5  7  8  8  4  6  5  7  5  4  6  5
 [913]  7  5  1  3  5  4  3  2  4  5  4  4  7  4  6  4  7  4  5  7  2  3  5  4
 [937]  2  6  4  3  5  6  3  4  4  3  7  2  3  5  5  6  3  6  7  7  5  5  4  2
 [961]  4  1  7  3  6  5  4  5  6  6  5  4  7  6  5  6  2  4  6  5  7  4  7  7
 [985]  5  7  3  5  5  4  7  5  6  7  5  5  6  4  4  4
hist(life, xlim = c(0,10))

mean(life)
[1] 4.985

What if the Coin is Biased?

Modify the code to simulate 1000 coin flips where there’s a 80% chance of flipping one option (i.e., increase the probability of flipping either heads = 0 or tails = 1).

What type of distribution do you expect to see? Why??

Note : the sample() function can take another argument (prob) that can adjust the probability.

sample(coinflip, 1)
[1] 1
badlife <- array() # WHAT HAPPENS IF I DO THIS????
for(i in c(1:1000)){
  x <- replicate(10, sample(coinflip, 1, prob = c(.7, .3)))
  badlife[i] <- sum(x)
}

badlife
   [1] 1 4 1 1 1 4 3 3 4 2 2 3 1 5 1 2 1 4 4 2 1 4 3 1 3 2 3 4 4 3 0 1 2 2 4 6 3
  [38] 2 5 1 3 3 3 2 4 1 2 2 4 6 3 3 5 3 2 3 4 6 3 2 4 4 4 4 5 4 4 3 4 2 4 2 2 5
  [75] 2 3 5 2 4 2 1 5 2 3 2 6 2 2 2 6 5 4 1 4 4 7 0 5 5 3 4 1 4 4 1 1 3 3 6 2 4
 [112] 3 4 2 5 3 3 5 3 4 4 5 2 1 4 4 6 3 1 2 5 3 1 3 5 3 4 4 6 7 6 6 2 3 3 1 4 5
 [149] 4 3 0 3 1 4 3 2 3 4 0 4 1 1 2 3 4 3 4 4 7 1 1 1 4 4 1 4 3 3 4 2 4 4 4 2 1
 [186] 4 5 2 3 4 4 5 3 2 0 6 3 1 0 2 5 4 2 5 4 1 2 6 1 4 3 0 4 6 4 3 5 1 4 3 4 4
 [223] 8 5 2 3 1 1 3 2 1 5 4 2 2 4 3 3 5 5 3 2 3 4 4 3 5 3 2 3 3 4 4 3 3 4 3 4 5
 [260] 5 4 1 1 1 6 6 6 6 2 5 4 4 3 2 2 2 3 3 4 3 4 3 5 3 4 4 1 0 3 3 3 3 5 1 5 2
 [297] 1 3 5 2 5 2 3 2 3 3 2 5 4 5 1 4 3 6 5 5 1 3 3 1 5 1 2 4 3 4 3 2 3 3 2 0 2
 [334] 3 2 4 1 6 3 3 3 5 2 2 1 3 1 3 4 4 2 3 1 1 6 4 2 1 2 5 6 4 2 2 5 4 3 4 2 1
 [371] 5 2 4 5 3 3 3 4 3 3 2 1 6 2 4 2 4 6 4 2 4 5 3 1 3 4 1 4 1 2 3 0 2 2 2 3 1
 [408] 2 4 1 2 1 5 3 2 3 6 3 2 0 3 4 3 2 4 3 5 2 3 0 3 5 4 0 4 4 2 4 2 4 4 3 2 1
 [445] 3 5 3 3 6 6 3 2 5 3 4 3 2 1 1 1 5 3 3 3 4 3 4 7 1 5 2 2 1 4 3 3 3 1 2 2 4
 [482] 6 5 3 1 4 3 2 4 3 3 2 3 3 1 3 0 2 6 3 1 4 0 2 4 2 2 5 1 5 2 1 3 6 3 1 3 2
 [519] 4 4 4 2 6 5 2 1 4 1 1 2 2 0 4 3 3 2 2 3 5 2 1 5 2 3 6 2 2 3 2 6 3 1 4 3 1
 [556] 2 3 2 5 4 3 3 3 2 3 4 4 4 1 1 1 4 3 3 3 4 2 2 3 3 3 3 2 5 4 5 2 3 2 5 4 2
 [593] 3 2 1 5 4 2 4 2 4 2 2 2 5 2 1 1 2 7 3 4 1 1 5 4 4 4 3 5 3 0 4 3 0 3 5 2 3
 [630] 2 2 2 4 4 4 1 4 1 7 3 3 4 4 1 1 4 4 2 3 3 1 6 5 2 5 3 1 2 4 3 6 3 2 1 4 2
 [667] 0 5 3 2 3 3 3 5 3 1 2 3 2 4 2 2 1 3 4 3 2 4 3 2 3 4 3 6 4 3 3 2 1 2 4 4 3
 [704] 1 2 5 3 4 5 2 4 4 4 4 4 0 0 2 3 2 3 4 3 3 4 2 3 3 1 5 2 4 4 4 2 3 3 1 5 2
 [741] 2 1 4 3 5 5 1 5 3 3 3 3 5 2 4 2 1 1 4 1 5 4 3 3 2 2 0 1 4 4 2 2 2 4 6 4 4
 [778] 3 4 6 4 5 1 4 4 1 3 5 3 1 4 2 3 2 3 1 4 3 4 7 3 2 5 4 1 4 3 5 4 1 2 4 4 4
 [815] 4 5 0 2 3 1 0 3 5 3 3 1 3 5 2 3 2 4 3 3 4 2 3 3 2 5 4 3 1 2 2 3 3 5 1 2 5
 [852] 3 4 2 3 3 0 2 4 5 1 2 0 1 1 1 2 3 4 3 4 7 2 2 2 3 6 2 4 2 3 3 7 4 2 4 2 2
 [889] 1 4 3 3 3 1 3 4 4 4 3 2 2 4 4 2 4 2 4 3 3 4 4 3 5 3 2 4 3 5 3 4 4 2 2 3 4
 [926] 3 3 4 5 4 1 5 3 3 2 3 4 1 4 2 2 3 2 3 1 3 2 3 1 4 3 4 3 4 4 1 3 1 3 2 2 5
 [963] 1 1 1 3 2 4 4 2 2 1 0 2 3 2 2 3 5 2 2 4 4 5 4 3 3 3 1 2 3 2 5 5 6 3 4 1 5
[1000] 7
hist(badlife, xlim = c(0,10))

!!! Critical Race Theory DEI Alert!!!

Francis Galton was a super racist and inventor of eugenics, and influenced (or invented) many statistics that we use today. For example, he defined the “central limit theorem” with the Galton Board (see image on the right). Whereas before, scholars considered the “average” to be the ideal state of humanity (it is closest to all the people; the Platonic Ideal!), Galton considered the goal of humanity to achieve to be better than average - something we have internalized today.

Indeed, Galton had a motivated agenda to use statistics to demonstrate there was a hierarchy to individual “eminence.” In his own words:

To conclude, the range of mental power between—I will not say the highest Caucasian and the lowest savage—but between the greatest and least of English intellects, is enormous. … I propose in this chapter to range men according to their natural abilities, putting them into classes separated by equal degrees of merit, and to show the relative number of individuals included in the several classes…..The method I shall employ for discovering all this, is an application of the very curious theoretical law of “deviation from an average.” First, I will explain the law, and then I will show that the production of natural intellectual gifts comes justly within its scope. - Galton, Hereditary Genius (1869). Linked here.

Why does it matter that a super racist invented statistics? Does it? I have a few ideas, but would like to hear your thoughts first :)

  • reasons relevant :
    • angela : important to acknowledge that the author was super racist; possible theories that we have are not the most objective…they are products of a tool that was developed for a specific purpose (ranking men).
    • kevin : good reminder to not just adopt the “best fit”; the “mean” as “the best”…outliers are also humans too, and maybe different for important reasons; maybe even more important to study these people who don’t fit our models.
  • reasons not relevant:
    • this is a great way to measure people; quantify; predict.
    • hannah : lots of things that we use in society that don’t come from the best intentions - important to remember that history - but we still need to use them and not worth throwing out the tools because they have a purpose.
  • questions / other comments :
    • namrata : can we separate the intention of creation from the practice of creation? is statistics in some way inherently biased????
    • angela : galton reveals his bias in his language (“Caucaisan…savage”)

BREAK TIME MEET BACK AT 12:50 & PRESENTATIONS

Sampling Error (Conceptual)

Scientific Method Stuff

  • Sample v. Population
    • Population : All the people relevant to your research question.
    • Sample : The people in your study.
    • KEY IDEA : Our sample will never equal the population!
      • Sampling Bias : our sample differs from the population in predictable ways.
      • Sampling Error : our sample differs from the population in random ways.
      • Random : Each individual in the population has an equal probability of being in our sample.
  • For Lab 3 : Find an article; is the sample representative (probably not)? How might bias influence the results?!

Sampling Error in R

  1. Be an omnipotent higher power who can create an entire world of individuals.
# rnorm(10000000, mean = 100, sd = 20)
fakey <- rnorm(10000000, mean = 100, sd = 30)
length(fakey)
[1] 10000000
head(fakey)
[1]  66.22543  54.71000 118.79119 101.65919 110.86523 125.29143
hist(fakey)

  1. Run some stats on these data as we do.
mean(fakey, na.rm = T)
[1] 100.0124
hist(fakey)
abline(v = mean(fakey), lwd = 5)

  1. Take a random sample from this population.
?sample # our friend, the sample function
sample(1:10, 1) # are you vibing with R?
[1] 2
sample(1:10000000, 1) # are you REALLY vibing?
[1] 5118661
sample(1:length(fakey), 1) # the numbers to sample from, another way. why better?
[1] 1308036
sample(1:length(fakey), 10) # a small sample
 [1]  123677 3696652 2619442  977569 5699794 5184034  837685 2487447 3857079
[10] 3003041
fakey[10]
[1] 84.74665
fakey[sample(1:length(fakey), 10)] # ten random individuals from fakey.
 [1]  85.08747 124.39944  96.60862 121.28514 113.59255 133.89456  68.53753
 [8]  93.90437 161.30144  92.72760
fakey[10000001] # what would happen if I do this????
[1] NA
lilfakey <- fakey[sample(1:length(fakey), 10)] # ten random individuals from fakey.
lilfakey
 [1] 113.74490  49.28128 129.33400  86.83650  53.49326  75.78368 105.55479
 [8]  79.64550 119.01793 121.03340

Check in with your buddy….what’s going on??? What are we doing and why are we doing this????

  • WHAT WE ARE DOING :

    • fakey : an object : a fake dataset of 10million values with a mean of 100 and sd of 30.

    • sample() : taking 10 random participant IDs….making sure they are valid IDs (within the possible range of fakey)

    • indexing [] : using the 10 random IDs to find 10 random people in fakey

    • result : getting ten random values from these imaginary people that we created as a population.

  • WHY ARE WE DOING THIS??? : to try and see how much the statistics from our RANDOM SAMPLE will vary from our population.

  1. Run some stats on this sample.
mean(lilfakey)
[1] 93.37252
hist(lilfakey)
abline(v = mean(lilfakey), lwd = 5)

  1. Repeat These Steps Until You Get “THE TRUTH”
lilfakey <- fakey[sample(1:length(fakey), 10)] # ten random individuals from fakey.
hist(lilfakey, xlim = c(0,200), ylim = c(0,10),
     breaks = 5,
     main = paste(c("mean=", round(mean(lilfakey), 4)), sep = ""))
abline(v = mean(lilfakey), lwd = 5)

WHAT DID Y’ALL NOTICE :

  • sampling error :

    • we NEVER got the mean we were supposed to get (100).

    • we NEVER got the same mean with each sample (lack of RELIABILITY in our results)

  • our means weren’t ALL over the place…

    • they were always within our SD

    • they were all kinda around 100.

    • no pattern / bias to where the means were.

  1. Doing this 1000 times
    truthbucket <- array()
    for(i in c(1:1000)){
      lilfakey <- fakey[sample(1:length(fakey), 10)] # ten random individuals from fakey.
      truthbucket[i] <- mean(lilfakey)
    }
    length(truthbucket)
[1] 1000
    hist(truthbucket)

    mean(truthbucket)
[1] 99.65602

What’s the Point, Professor??? (Sampling Error Edition)

Bootstrapping to Estimate Sampling Error

Okay, let’s work through a real example of using bootstrapping to estimate sampling error, and why this might be useful.

Remember that in the onboarding survey, we saw people rated their own skills as lower than their classmates’ skills.

d <- read.csv("../datasets/Onboarding Data/honor_onboard_FA25.csv", stringsAsFactors = T, na.strings = "")
par(mfrow = c(1,2))

hist(d$self.skills, breaks = c(0:5), 
     col = 'black', bor = 'white', main = "Computer Skills\n(Self-Perceptions)")

hist(d$class.skills, breaks = c(0:5),
     col = 'black', bor = 'white', main = "Computer Skills\n(Perceptions of Classmates)")

But would we expect to observe this same difference in a different sample of students???

Let’s use bootstrapping to test this.

For Lab 3 :

  1. Keep getting practice working with datasets and interpreting variables in R.
  2. Try using Quarto.
  3. Use (and adapt) the bootstrapping for-loop to estimate how much sampling error we can expect in variables, and in the difference between variables.