Lab 5 - Mini Practice - KEY

Instructions.

Use the dataset to answer the questions below. Make sure your report a) includes the code that you used, b) shows the result / output in a clear and organized way (e.g., do not include junk output that we don’t need to see!), and c) answers the questions to explain your code / the result. This is representative of the kind of work you’ll do on the actual exam, though the specific questions / tasks will change depending on the dataset. The key has been (or will be) posted - try on your own, but use as a guide if you get stuck. Yeah!

The Dataset and Problem.

Load the “Narcissism” dataset (on Dropbox) into R. Note that these data are not ‘comma’ separated values, but separated by tabs; you’ll need to import this using the following argument. Check to make sure the data loaded correctly, and report the sample size. Look over the codebook (also posted to Dropbox).

Dr. Professor wants to see whether there’s a relationship between narcissism (the DV; variable = score in the dataset) and age (the IV; variable = age). Use the dataset to test Dr. Professor’s prediction that as people get older, they get less narcissistic.

Problem 1. Data Loading and Cleaning

Load the data and check to make sure the data loaded correctly. Report the sample size.

  • Professor Note : the codebook doesn’t specify that zeros are missing data, but it’s also clear that certain answers should not have a zero, and I don’t get correct answers when treating zeros as data. I missed this at first, but realized this was an issue when I created the Narcissism scale, and got values
n <- read.csv("~/Dropbox/!GRADSTATS/Datasets/Narcissism/data.csv", stringsAsFactors = T,
              na.strings = 0) # removing zero values
head(n) # looks good
  score Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20
1    18  2  2  2  2  1  2  1  2  2   2   1   1   2   1   1   1   2   1   1   1
2     6  2  2  2  1  2  2  1  2  1   1   2   2   2   1   2   2   1   1   2   1
3    27  1  2  2  1  2  1  2  1  2   2   2   1   1   1   1   1   2   2   1   1
4    29  1  1  2  2  2  1  2  1  1   2   1   1   1   1   1   1   2   2   1   2
5     6  1  2  1  1  1  2  1  2  1   2   2   2   2   2   1   1   1   1   1   1
6    19  1  2  2  1  2  1  1  1  2   2   1   1   1   2   1   1   1   1   1   1
  Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Q31 Q32 Q33 Q34 Q35 Q36 Q37 Q38 Q39
1   1   1   1   2   2   2   1   2   2   2   1   2   1   1   1   2   2   2   1
2   2   2   1   2   2   2   2   1   2   2   2   1   2   2   1   2   2   2   2
3   2   2   2   2   1   2   1   1   2   1   2   2   1   1   2   1   1   2   1
4   1   1   1   2   1   2   1   2   2   1   1   2   1   1   2   1   2   2   1
5   1   2   1   2   2   1   2   1   2   2   2   1   2   2   1   2   2   2  NA
6   1   1   1   1   2   1   1   1   2   1   1   2   1   2   1   1   2   2   2
  Q40 elapse gender age
1   2    211      1  50
2   1    149      1  40
3   2    168      1  28
4   1    230      1  37
5   1    389      1  50
6   2    361      1  27
nrow(n) # seems right
[1] 11243

The variable elapse describes how long participants took to complete the survey (time started - time submitted). Decide on a rule - justify this using a mix of logic and statistics - and use this rule to remove these individuals from your dataset. (Note that you should remove the entire individual from the dataset, not just their elapse score.) Graph the variable elapse after removing the outliers, and report the number of individuals who were removed from the dataset using your rule.

  • Professor Note : It’s hard to get descriptive statistics of the mean and standard deviation because there are a few MEGA outliers that are skewing even the outlier data….so I might remove those first. I’ll get rid of anyone who spent 12 hours doing this.
n$elapse[n$elapse > 60*60*12] <- NA
hist(n$elapse)

summary(n$elapse)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    8.0   204.0   257.0   378.5   341.0 35178.0      10 
sd(n$elapse, na.rm = T)
[1] 1073.626
  • Professor Note : So, there’s 44 questions, and let’s say it takes - at a mininum, 1 second to read and answer each question….I’ll set my minimum possible time to 45 seconds. This removes just a few people.
n$elapse[n$elapse < 45] # haven't removed them yet; just looking.
 [1] NA  8 NA NA NA 33 NA 24 NA NA NA NA NA
  • Professor Note : For the maximum value, I’m thinking super high values will be people who wandered off from the task…hard to say whether they are “bad” data, but let’s say if this took you longer than an hour (3600 seconds) I will consider your data bad. Note that 3600 seconds is {r}3600/sd(n$elapse, na.rm = T) standard deviations above the mean, but this estimate of standard deviation includes some extreme outliers, which is another reason why it’s kind of hard / silly to use SD as “THE” rule for removing outliers (since outliers will influence the SD). Anyway, this is somewhat arbitrary; but I’m using logic (and some stats) to justify my ideas. And in any case if some reviewer / advisor / peer complains, I have my code, am being transparent, and can just make adjustments as needed (then re-run everything.)
n$elapse[n$elapse < 45 | n$elapse > 3600] # finding the outliers in the variable.

n[n$elapse < 45 | n$elapse > 3600, ] # finding the outliers in the dataset
# n[n$elapse < 45 | n$elapse > 3600, ] <- NA # I get an error message when I run this! Stupid missing values. Had to comment this out because Quarto wouldn't render an error message. SIGH.

n[c(n$elapse < 45 | n$elapse > 3600) & !is.na(n$elapse), ] # finding just the missing values.

## gotta reload the data cuz I've removed some missing values already to explore, and want a total count.
n <- read.csv("~/Dropbox/!GRADSTATS/Datasets/Narcissism/data.csv", stringsAsFactors = T, 
              na.strings = 0) # 

toolong <- n[c(n$elapse < 45 | n$elapse > 3600) & !is.na(n$elapse), ]
nrow(toolong) # 81 folks will be removed.
  • Professor Note : Okay, now I need to remove these outliers.
n[c(n$elapse < 45 | n$elapse > 3600) & !is.na(n$elapse), ] <- NA # exclude missing values from my rule.
hist(n$elapse) # they gone.

Problem 2. Descriptive Statistics Graphs.

Graph the variables needed to test Dr. Professor’s theory. Make the graphs look nice, as if ready for a publication. Below the graphs, report the relevant descriptive statistics and describe what these statistics / the graphs tell you about these variables. What did you learn?

par(mfrow = c(1,2))
hist(n$score, 
     col = 'black', bor = 'white', 
     xlab = "Narcissism Score (0 - 40)",
     main = "Histogram of Narcissistic Personality Inventory")
hist(n$age, 
     col = 'black', bor = 'white', 
     xlab = "Age",
     main = "Histogram of Age")

  • Professor Note : Gah, more outliers!!! They will need to be removed.
n$age[n$age < 18 | n$age > 100] # removing anyone under the age of 18, and over the age of 100
  [1]  NA  16  17  17  15  16  17  15  15  15  17  16  17  17  17  17  15  17
 [19]  17  17  17  17  17  17  17  17  NA  16  17  17  15  16  17  17  NA  16
 [37]  NA  16  NA  15  17  15  14  17  16  17  NA  17  NA  17  16  15  14  16
 [55]  17  17 366  17  14  16  16  15  15  NA  17  17  NA  15  17  16  NA  17
 [73]  17  17  17  17  17  16  NA  16  17  14  17  16  17  16  NA  17  NA  NA
 [91]  17  16  16  17  16  15  14  17  NA  17  17  16  16  16  17  17  16  17
[109]  NA  NA  16  15  NA  14  16  17  NA  16  16  17  17  17  NA  NA  16  15
[127]  15  15  16  NA  17  14  17  15  17  16  17  17  17  15  16  17  16  17
[145]  15  17  16  14  16  17  NA  16  16  16  16  NA  17  17  14  14  15  15
[163]  16  17  14  14  14  14  16  17  16  NA  15  16  17  14  17  NA  16  16
[181]  14  NA  16  NA  17  NA  14  NA  17  17  16  NA  NA  15  17  17  NA  15
[199]  17  17  NA  17  NA  17  17  NA  15  15  16  15  17  15  17  NA  17  16
[217]  16  17  17  16  NA  16  16  15  16  17  17  17  15  16  17  NA  16  17
[235]  15  16  NA  14  15  NA  15  17  NA  15  16  17  NA  17  NA  NA  14  14
[253]  14  16  17  17  16  15  NA  16  16  17  16  17  16  17  17  16  16  17
[271]  17  17  16  16  17  17  17  15  NA  17  14  17  17  15  17  16  17  15
[289]  15  17  15  17  15  16  17  16  16  17  17  NA  17  16  16  15  16  17
[307]  15  16  17  17  17  16  17  17  17  17  17  16  NA  15  17  17  NA  15
[325]  16  16  17  17  17  15  17  16  17  17  17  17  17  16  15  14  15  13
[343]  16  14  14  15  17  17  NA  NA  16  14  15  16  17  17  17  17  16  17
[361]  17  16  16  14  14  16  17  17  NA  14  16  16 148  16  16  15  15  13
[379]  15  12  16  16  NA 117  13  17  14  17 509  17  17  13  15  16  15  17
[397]  13  NA  17  14  16  NA  NA  15  15  NA  NA  16  14  15  NA  17  17  14
[415]  17  17  15  NA  14  NA  15  15  16  16  13  14  15  16  17  14  15  12
[433]  16  NA  16  14  13  15  15  15  13  15  17  14  NA  16  16  15  17  17
[451]  15  15  17  16 190  16  14  16  16  17  17  17  NA  13  13  16  14  14
[469]  16  16  15  17  17  14  16  16  NA  16  13  17  NA  17  11  16  17  14
[487]  17  13  16  17  13  16  15  17  17  17  16  NA  16  17  16  16  15  15
[505]  16  16  16  15  13  15  15  16  13  17  NA  15  14  NA  14  15  17  13
[523]  16  15  17  17  16  NA  17  14  NA  17  13  NA  17  13  16  15  NA  15
[541]  17  14  15  14  17  17  15  15  16  16  16  17  NA  16  15  13  16  14
[559]  15  13  16  14  NA  17   6  16  17  14  16  14  15  NA  17  NA  NA  17
[577]  13  17  15  NA  17  16  16  17  15  17  15  NA  13  15  17  16  14  NA
[595]  17  17  NA  15  13  16  15  15  17  16  15  16  16  16  17  16  16  17
[613]  16  17  14  14  17  17  NA  NA  16  15  16  17  13  17  16  14  16  15
[631]  16  17  17  14  15  16  16  16  15  15  16  NA  17  17  16  17  14  17
[649]  NA  15  16  17  14  10  NA  17  17  16  15  17  15  17  NA  16  16  15
[667]  14  16  16  16  16  NA  17  16  16  NA  17  13  17  15  NA  15  16  17
[685]  14  13   2  NA  17  17  15  16  17  17
n[c(n$age < 18 | n$age > 100) & !is.na(n$age), ] <- NA # removing the missing data when filtering.

par(mfrow = c(1,2))
hist(n$score, 
     col = 'black', bor = 'white', 
     xlab = "Narcissism Score (0 - 40)",
     main = "")
hist(n$age, 
     col = 'black', bor = 'white', 
     xlab = "Age",
     main = "")

  • Professor Note : these descriptive statistics may differ from yours depending on how you removed outliers in age and time elapsed!
library(psych)
describe(n$score)
   vars     n  mean   sd median trimmed mad min max range skew kurtosis   se
X1    1 10480 13.22 8.39     12   12.37 8.9   1  40    39 0.85      0.2 0.08
describe(n$age)
   vars     n  mean    sd median trimmed   mad min max range skew kurtosis   se
X1    1 10549 34.99 13.48     32   33.62 14.83  18 100    82 0.77    -0.23 0.13
  • Professor Note : Average narcissism is 13.17, with an SD of 8.36 and range of 1 to 40 (good b/c that’s within the natural range of the scale.) The data are right skewed; most folks are relatively lower in narcissism. I’m not an expert on narcissism (outside of witnessing it across some family dynamics…) but know this scale is not meant to measure Narcissistic Personality Disorder, so hard to say when someone is a “Narcissist” vs. just likes themselves in a healthy way.
  • Professor Note : Average age is 35, with a range of 18 to 100 and an SD of 13.5. Data are right skewed; fewer old-folks taking online narcissism surveys about themselves (GENERATION ME, AM I RIGHT?!?!) But seems like there’s a fairly broad distribution of data across the lifespan, and large sample size will help estimate even the less-frequent elder ages.

Problem 3. Linear Models.

Define a linear model to test Dr. Professor’s question. What is the slope, intercept, and \(R^2\) of this model (in raw and z-scored units)? Then, use bootstrapping to estimate the 95% Confidence Interval for the slope and estimate the power researchers had to detect any observed relationship. Graph the relationship between these variables, making sure to illustrate the linear model (and lines to illustrate the 95% Confidence Interval) on your graph. Below the graph, describe what these statistics tell you about the relationship between these two variables. Finally, evaluate the assumptions of this linear model.

I’ll start with the graph, since….

library(ggplot2)

Attaching package: 'ggplot2'
The following objects are masked from 'package:psych':

    %+%, alpha
library(ggthemes) # more pretty graphs
ggplot(n, aes(y = score, x = age)) + 
  geom_point(size = .85) + 
  geom_smooth(method = "lm") + 
  ggthemes::theme_tufte() + ylab("Narcissism Score") + xlab("Age")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 773 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 773 rows containing missing values or values outside the scale range
(`geom_point()`).

mod <- lm(score ~ age, data = n)
coef(mod)
(Intercept)         age 
 18.1496471  -0.1410376 
summary(mod)$r.squared
[1] 0.05120374
  • Professor Note : I see that there’s moderate negative relationship between age and narcissism. As people get older, they get less narcissistic. There’s a lot of error in this prediction; age only explains about 5% of the variation in narcissism (or, in other words, my linear model that uses age to make predictions of narcissism reduces error in my predictions by 5% compared to using the mean of narcissism to make predictions.)

Let’s do some bootstrapping to see how much sampling error in this small relationship I might expect to find.

bucket <- array()
for(i in c(1:1000)){
  n2 <- n[sample(1:nrow(n), nrow(n), replace = T), ] # new sample; based on the old sample
  boot.mod <- lm(score ~ age, data = n2) # new model, based on the new sample.
  bucket[i] <- coef(boot.mod)[2] # my slope, saved to bucket 
}
sd(bucket)
[1] 0.00646186
coef(mod)[2] # slope from my original sample
       age 
-0.1410376 
coef(mod)[2] + 1.96*sd(bucket) # upper limit of 95% CI for slope
       age 
-0.1283723 
coef(mod)[2] - 1.96*sd(bucket) # lower limit
       age 
-0.1537028 
  • Professor Note : Okay, so while in my original sample, I found a slope of -.14, I’d expect this slope to vary between -.13 and -.16 due to random sampling. My estimated sampling error is small (.006), because I have such a large sample.

Regarding the various assumptions :

  1. Validity. Hmm, can be hard to trust self-reports, though I think people would know the answer to these questions about themselves, and less motivated to lie with an online survey than an in-person interview. Some sampling bias no doubt; would want to know more about where these people are coming from (guessing mostly the US; think that’s a variable in the dataset? something to look at later :)).
  2. Reliable Measures. The data appear reliable; the NPI showed high internal consistency. Would wonder about
  3. Independence. Yeah, the data are considered independent; this was a between-person study, so one individual doesn’t tell us anything about another.
  4. Linearity & 5. Equal Variance. The diagnostic plot look okay; there’s a lot of error, but it seems mostly evenly distributed around the fitted values. I see some lumpiness around the Q-Q residual plot on the tails; these are for the folks who are higher in narcissism than we’d predict, so I wonder if there’s maybe a quadratic effect of age (like, the younguns are super narcissistic?) Or maybe we are missing some other variable. I do see there is one individual data point that could be exerting a lot of leverage if the dataset weren’t so large; not really worried about them. And my predictions about the super old people in the model is off. Looking at the graphs, I thikn that I should really remove the person whose age was 100 - a) it’s unlikely this is a valid age, b) even if this is a valid age,they are very different from the other old folks whose max age is 80) and c) this person is not well predicted by the model (as evidenced by their location on the far left or far right of some of the model diagnostic plots). I’m not going to rerun the analyses with them removed, so this little mini-lecture on outliers can exist for students to read. Are you there student? It’s me, professor.
par(mfrow = c(2,2))
plot(mod)

  1. Normality of Errors. Hmm, the residuals aren’t perfectly normal. That’s not a huge deal, but makes me think that age is not helping us predict folks who are high in narcissism.
hist(mod$residuals)

Problem 4. Creating a Scale.

Dr. Professor is worried that the narcissism variable (score) was not calculated correctly. In this study, narcissism was measured by giving people 40-items with two options and adding up the total number of “narcissism” options that people selected. At the end of the codebook, the researchers have identified which responses were used in calculating the score; however this code will not work in R. Re-create the narcissism score from these 40-items and report the alpha reliability of this scale. Then, confirm that your calculation of narcissism is exactly the same as the one that was already calculated in the dataset (score).

Heads up. This question took professor 30 minutes to figure out…so okay if you struggle a little! Feel free to peek at my key (and there’s definitely a simpler way to do this, but I was feeling stubborn about the approach I initially took). I think this represents the kind of weird data you might be asked to interpret. But I do like to have a small challenge problem on each exam, worth only 1/21 points, just to keep things interesting. FWIW, the likert scale you do on the actual exam will be more straightforward.

  • Professor Note : Okay, first I arranged the narcissism items into a dataframe. This was where I realized that missing values were coded as zero and needed to be removed when importing the dataset. Tricky! I know!!
NARC.df <- n[,2:41]
head(NARC.df)
  Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21
1  2  2  2  2  1  2  1  2  2   2   1   1   2   1   1   1   2   1   1   1   1
2  2  2  2  1  2  2  1  2  1   1   2   2   2   1   2   2   1   1   2   1   2
3  1  2  2  1  2  1  2  1  2   2   2   1   1   1   1   1   2   2   1   1   2
4  1  1  2  2  2  1  2  1  1   2   1   1   1   1   1   1   2   2   1   2   1
5  1  2  1  1  1  2  1  2  1   2   2   2   2   2   1   1   1   1   1   1   1
6  1  2  2  1  2  1  1  1  2   2   1   1   1   2   1   1   1   1   1   1   1
  Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Q31 Q32 Q33 Q34 Q35 Q36 Q37 Q38 Q39 Q40
1   1   1   2   2   2   1   2   2   2   1   2   1   1   1   2   2   2   1   2
2   2   1   2   2   2   2   1   2   2   2   1   2   2   1   2   2   2   2   1
3   2   2   2   1   2   1   1   2   1   2   2   1   1   2   1   1   2   1   2
4   1   1   2   1   2   1   2   2   1   1   2   1   1   2   1   2   2   1   1
5   2   1   2   2   1   2   1   2   2   2   1   2   2   1   2   2   2  NA   1
6   1   1   1   2   1   1   1   2   1   1   2   1   2   1   1   2   2   2   2
summary(NARC.df)[,1:4] # just looking at the first few variables.
       Q1              Q2              Q3              Q4       
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:1.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:1.000  
 Median :1.000   Median :2.000   Median :2.000   Median :1.000  
 Mean   :1.389   Mean   :1.795   Mean   :1.837   Mean   :1.172  
 3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:1.000  
 Max.   :2.000   Max.   :2.000   Max.   :2.000   Max.   :2.000  
 NA's   :696     NA's   :702     NA's   :701     NA's   :709    
  • Professor Note : Then, I defined two variables; ifONE means that the “Is Narcissistic” response option was 1 (e.g., 1 = I will never be satisfied until I get all that I deserve = narcissistic! Whereas 2 = I take my satisfactions as they come = HEALTHIER APPROACH!) ifTWO means that the “Is Narcissistic” response option was 2. Note that I used setdiff() to find the numbers that were different between 1 and 40, so my eyes only had to glaze over when checking (and double checking) the codebook for ifOne.
ifONE <- c(1:3,6,8,11:14,16,21,24:25,27,29:31,33:34,36:39)
ifTWO <- setdiff(1:40, ifONE)
ifTWO # this looks right. eyes still glazing over.
 [1]  4  5  7  9 10 15 17 18 19 20 22 23 26 28 32 35 40
  • Professor Note : Okay, here’s where things got trickier and melted my brain a little and there’s probably a better way. The codebook says the final narcissism score was defined by the sum of the narcissistic responses. So the narcissitic items = 1 (ifONE) would need to be subtracted from 2, so the non-narcissitic items (which were selected as = 2) will become zero (e.g., 2-2 = 0; 2-1 = 1). If the narcissitic items = 2 (ifTWO), then I need to subtract 1 from this value, so the non narcissitic items (which were selected as = 1) will become zero (e.g., 1-1 = 0; 2-1 = 1). I’m realizing now as I type this there’s a much easier way to do this using an ifelse() statement…..anyway. I have a cold and got this to work with my confusing method and so will move on.
head(cbind(NARC.df[ifTWO]-1, NARC.df[ifTWO])) # checking my confusing logic to see the changes that I made.
  Q4 Q5 Q7 Q9 Q10 Q15 Q17 Q18 Q19 Q20 Q22 Q23 Q26 Q28 Q32 Q35 Q40 Q4 Q5 Q7 Q9
1  1  0  0  1   1   0   1   0   0   0   0   0   1   1   1   0   1  2  1  1  2
2  0  1  0  0   0   1   0   0   1   0   1   0   1   0   0   0   0  1  2  1  1
3  0  1  1  1   1   0   1   1   0   0   1   1   1   0   1   1   1  1  2  2  2
4  1  1  1  0   1   0   1   1   0   1   0   0   1   1   1   1   0  2  2  2  1
5  0  0  0  0   1   0   0   0   0   0   1   0   0   0   0   0   0  1  1  1  1
6  0  1  0  1   1   0   0   0   0   0   0   0   0   0   1   0   1  1  2  1  2
  Q10 Q15 Q17 Q18 Q19 Q20 Q22 Q23 Q26 Q28 Q32 Q35 Q40
1   2   1   2   1   1   1   1   1   2   2   2   1   2
2   1   2   1   1   2   1   2   1   2   1   1   1   1
3   2   1   2   2   1   1   2   2   2   1   2   2   2
4   2   1   2   2   1   2   1   1   2   2   2   2   1
5   2   1   1   1   1   1   2   1   1   1   1   1   1
6   2   1   1   1   1   1   1   1   1   1   2   1   2
head(cbind(NARC.df[ifONE], 2-NARC.df[ifONE]))
  Q1 Q2 Q3 Q6 Q8 Q11 Q12 Q13 Q14 Q16 Q21 Q24 Q25 Q27 Q29 Q30 Q31 Q33 Q34 Q36
1  2  2  2  2  2   1   1   2   1   1   1   2   2   1   2   2   1   1   1   2
2  2  2  2  2  2   2   2   2   1   2   2   2   2   2   2   2   2   2   2   2
3  1  2  2  1  1   2   1   1   1   1   2   2   1   1   2   1   2   1   1   1
4  1  1  2  1  1   1   1   1   1   1   1   2   1   1   2   1   1   1   1   1
5  1  2  1  2  2   2   2   2   2   1   1   2   2   2   2   2   2   2   2   2
6  1  2  2  1  1   1   1   1   2   1   1   1   2   1   2   1   1   1   2   1
  Q37 Q38 Q39 Q1 Q2 Q3 Q6 Q8 Q11 Q12 Q13 Q14 Q16 Q21 Q24 Q25 Q27 Q29 Q30 Q31
1   2   2   1  0  0  0  0  0   1   1   0   1   1   1   0   0   1   0   0   1
2   2   2   2  0  0  0  0  0   0   0   0   1   0   0   0   0   0   0   0   0
3   1   2   1  1  0  0  1  1   0   1   1   1   1   0   0   1   1   0   1   0
4   2   2   1  1  1  0  1  1   1   1   1   1   1   1   0   1   1   0   1   1
5   2   2  NA  1  0  1  0  0   0   0   0   0   1   1   0   0   0   0   0   0
6   2   2   2  1  0  0  1  1   1   1   1   0   1   1   1   0   1   0   1   1
  Q33 Q34 Q36 Q37 Q38 Q39
1   1   1   0   0   0   1
2   0   0   0   0   0   0
3   1   1   1   1   0   1
4   1   1   1   0   0   1
5   0   0   0   0   0  NA
6   1   0   1   0   0   0
reNARC.df <- data.frame(2-NARC.df[ifONE], NARC.df[ifTWO]-1) # combining my work into one data.frame
head(reNARC.df)
  Q1 Q2 Q3 Q6 Q8 Q11 Q12 Q13 Q14 Q16 Q21 Q24 Q25 Q27 Q29 Q30 Q31 Q33 Q34 Q36
1  0  0  0  0  0   1   1   0   1   1   1   0   0   1   0   0   1   1   1   0
2  0  0  0  0  0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0
3  1  0  0  1  1   0   1   1   1   1   0   0   1   1   0   1   0   1   1   1
4  1  1  0  1  1   1   1   1   1   1   1   0   1   1   0   1   1   1   1   1
5  1  0  1  0  0   0   0   0   0   1   1   0   0   0   0   0   0   0   0   0
6  1  0  0  1  1   1   1   1   0   1   1   1   0   1   0   1   1   1   0   1
  Q37 Q38 Q39 Q4 Q5 Q7 Q9 Q10 Q15 Q17 Q18 Q19 Q20 Q22 Q23 Q26 Q28 Q32 Q35 Q40
1   0   0   1  1  0  0  1   1   0   1   0   0   0   0   0   1   1   1   0   1
2   0   0   0  0  1  0  0   0   1   0   0   1   0   1   0   1   0   0   0   0
3   1   0   1  0  1  1  1   1   0   1   1   0   0   1   1   1   0   1   1   1
4   0   0   1  1  1  1  0   1   0   1   1   0   1   0   0   1   1   1   1   0
5   0   0  NA  0  0  0  0   1   0   0   0   0   0   1   0   0   0   0   0   0
6   0   0   0  0  1  0  1   1   0   0   0   0   0   0   0   0   0   1   0   1
  • Professor Note : Alright, I’m ready to estimate the reliability of this scale, create the scale, and then show that they are equivalent. I can show equivalence a few ways; the scatterplot shows a perfect positive relationship.
library(psych)
psych::alpha(reNARC.df)[1] # psych comes before alpha() so R doesn't get confused and try to use the ggplot2 alpha() function, which is about colors. I'm adding the index [1] here to prevent R from showing me EVERYTHING about the alpha function, which is making this report longer than it needs to be.
$total
 raw_alpha std.alpha   G6(smc) average_r      S/N         ase      mean
 0.9069391 0.9082589 0.9179566 0.1984006 9.900236 0.001248322 0.3290839
        sd  median_r
 0.2115142 0.1933424
n$NARCSCORE <- rowSums(reNARC.df, na.rm = T) # my created scale.

plot(NARCSCORE ~ score, data = n) # one way to show equivalence.

  • I could also create a difference score, and show these differences are equal to zero.
diff <- n$NARCSCORE - n$score
plot(diff) # all zeros.

  • I could also use a conditional statement to evaluate whether the two variables are the same.
test <- n$score == n$NARCSCORE
summary(test) # neat.
   Mode    TRUE    NA's 
logical   10480     763 
  • Probably other ways to do this; point is it is good to confirm that something given to you works like you think it does!