SAPA Project Blog

A weekly review of a randomly chosen article.

Descriptive Statistics With the Psych Package in R

So, you’ve figured out how to get your data in R and you want to see some basic descriptive statistics. No sweat.

Are you using the psych package? If not, download it and install it (as described in my post on importing data). Once that’s done, the describe function (in the psych package) should give you all you need, and possibly more:

Descriptive Statistics with the psych package in R
1
describe(mydata)

Obviously, you will substitute “mydata” in the command above for the name you have given your own data frame. If you’re unsure of what you called it, try typing ls() in R.

The output will look something like this:

Output from describe(mydata)
1
2
3
4
5
6
   var   n  mean    sd median trimmed  mad min max range  skew kurtosis   se
gender   1 200  1.62  0.49      2    1.66 0.00   1   2     1 -0.51    -1.75 0.03
age      2 200 25.07 10.38     20   23.29 4.45  14  59    45  1.37     1.00 0.73
height   3 200 66.99  4.21     67   66.96 4.45  57  77    20  0.10    -0.71 0.30
q_6      4  21  5.19  1.03      6    5.35 0.00   3   6     3 -0.88    -0.60 0.22
q_22     5  27  3.30  1.38      4    3.30 1.48   1   6     5 -0.01    -1.27 0.27

Common Questions

Now, it’s not uncommon to get an error like this:

Error due to non-numeric variables
1
Error in describe(mydata) : non-numeric argument to 'describe'

This is because some of your variables (i.e., the columns) are stored as “character” strings. If you have some columns of data with text, then this may be appropriate. This sometimes also occurs with data that you expected to be “numeric” because of the way your data were originally entered. This can happen in Excel without being obvious. We have this issue occasionally with data being pulled out of SQL databases.

It’s not hard to fix. This command will help us identify the problematic variables:

Identify non-numeric columns using sapply() and class()
1
sapply(mydata, class)

The output will show you the class for each of the variables in the data frame. Now, you have options. If you want to change the class of a variable (presumably because they are “character” despite containing all numbers), the transform() function is very useful. For example:

transform() non-numeric columns
1
mydataTransformed <- transform(mydata, height = as.integer(height), weight = as.integer(weight))

Note that you’ll only refer to the variables being changed when using transform(). And there may be some cases where it’s not a good idea to go about changing classes willy-nilly (R will give you a message if NAs result — worry about this if/when it comes up).

If you want to leave variables as they are because the character class is appropriate, then just tell the describe() function to ignore those columns. For example, if the 3rd variable contained character strings, you could leave that column out when running describe():

describe() a subset of data
1
describe(mydata[,c(1,2,4:9)])

And finally, you may only want to get a subset of the information returned from describe(). Since describe() returns an object, we can use the colnames() command to see what’s inside (i.e., the structure):

Using colnames() to look into describe()
1
colnames(describe(mydata))
Using colnames() to look into describe()
1
2
3
[1] "var"      "n"        "mean"     "sd"       "median"   "trimmed"
[7] "mad"      "min"      "max"      "range"    "skew"     "kurtosis"
[13] "se"

We see that column 1 corresponds to the variable number, column 2 is the sample size (n), and so on. For example, if you only want means, standard deviations, and medians:

1
describe(mydata)[,c(3,4,5)]

Gives something like this:

Get only the means, standard deviations, and medians.
1
2
3
4
5
6
    mean    sd median
gender  1.62  0.49      2
age    25.07 10.38     20
height 66.99  4.21     67
q_6     5.19  1.03      6
q_22    3.30  1.38      4

All done.