SAPA Project Blog

A weekly review of a randomly chosen article.

Analyze Student Exam Items Using IRT

Item Response Theory can be used to evaluate the effectiveness of exams given to students. One distinguishing feature from other paradigms is that it does not assume that every question is equally difficult (or that the difficulty is tied to what the researcher said). In this way, it is an empirical investigation into the effectiveness of a given exam and can help the researcher 1) eliminate bad or problematic items and 2) judge whether the test was too difficult or the students simply didn’t study.

In the following tutorial, we’ll use R (R Core Team, 2013) along with the psych package (Revelle, W., 2013) to look at a hypothetical exam.

Before we get started, remember that R is a programming language. In the examples below, I perform operations on data using functions like cor and read.csv. We can also save the output as objects using the assignment arrow, <-. It’s a bit different from a point-and-click program like SPSS, but you don’t need to know how to program to analyze exams using IRT!

First, load the the psych package. Then load the students’ grades into R using read.csv() from the psych package.

1
2
3
4
# Load the psych package
library(psych)
# Read the csv file and save it as grades
grades <- read.csv(file = "~/Downloads/Grades.csv")

Notice that we are using item-level grades, where each row is a given student and each cell is the number of points received on that question. Your matrix or data frame should look like this:

1
head(grades)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
##   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 1  2  2  2  2  4  2  3  2  4   2   2   3   5   3   6   2   4   4   4   3
## 2  2  0  2  2  0  0  3  0  4   2   0   1   3   0   0   2   2   1   0   0
## 3  2  2  2  2  0  2  3  2  4   0   0   3   5   0   5   2   4   2   4   2
## 4  2  1  2  2  3  0  0  2  4   2   2   3   1   1   2   0   4   1   4   3
## 5  2  2  2  2  4  2  3  2  4   2   2   3   5   3   4   2   4   4   4   3
## 6  1  0  2  2  0  2  0  2  0   2   1   2   5   0   3   2   4   2   4   3
##   V21 V22 V23 V24 V25 V26 V27 Total
## 1   2   2   4   4   6   6   6    91
## 2   0   0   0   3   3   6   6    42
## 3   2   1   2   4   6   5   6    72
## 4   0   2   4   4   0   0   0    49
## 5   2   2   4   4   5   6   6    88
## 6   0   0   4   4   6   4   3    58

Next, compute the polychoric correlations on the raw grades (not including the Total column). By using polychoric correlations, we estimate the normal distribution of latent content knowledge, which can be underestimated if Pearson correlations are instead used on polytomous items.

1
2
# Perform a polychoric correlation on grades and save it as grades.poly
grades.poly <- polychoric(x = grades[, 1:27], polycor = TRUE)

Now that we have the polychoric correlations, we can run irt.fa() on the dataset to see the item difficulties and information.

1
2
3
4
# Using grades.poly, perform an irt and save it as grades.irt
grades.irt <- irt.fa(x = grades.poly, plot = TRUE)
# Plot the output from grades.irt
plot(grades.irt)

1
plot.irt(type = "test", grades.irt)

Thus, we have some great items that have a lot of information about students of average and low content knowledge (e.g., V24, V17, V18), but not enough to distinguish the high-knowledge students. In redesigning an exam for next semester or year, we might save the best performing questions while trying to rewrite the existing questions or trying new questions. At the same time, the second plot shows the test performance. We have great reliability for distinguishing who didn’t study (lowers end of our latent trait), but overall the test may have been too hard.

Rescaling the test

While many students in our hypothetical dataset did very well on the exam, instructors may need to rescale their exam so that the mean grade is an 85% or 87.5%. Using the scale() function (see also rescale() in psych package), we can ensure that the rank-order distribution of the students is preserved (allowing us to distinguish those who studied well), while scaling the sample distribution to fit in with other classes in your department.

Currently, our scores are in cumulative raw points. Notice that we divide the Total points column by 91 to convert the histogram into grade percentages. Let’s plot a histogram to see the distribution of scores.

1
2
3
4
5
# Load the ggplot2 library
library(ggplot2)
# Plot a histogram
qplot((Total/91)*100, data=grades, geom="histogram",xlab='Raw Grades',
  ylab='# of Students',main='Distribution of student grades')

The distribution has a mean 71.68 percent and a standard deviation of 15.54. Given grade inflation, it may look like your students are doing poorly when in fact the distribution is similiar to other courses being taught. Next, we can rescale the grades, creating a mean of 87.5 and a standard deviation of 7.5. These numbers are arbitrary so use your best judgement.

1
2
3
scaled.grades <- scale(grades$Total) * 7.5 + 87.5
qplot(scaled.grades,xlab='Scaled Grades',
  ylab='# of Students',main='Distribution of student grades')

The second distribution may be preferred, depending on your needs. With the raw distribution, we would have had 45% of the students receiving grades below a C- (assuming a normal distribution). Now, 0.9% of students would fall below the 70% cutoff. Again, my mean and standard deviation chosen in the above example are arbitrary.