Data Types Used in R

1The workhorse data types of R are the vector and the data frame. Recall that (almost) everything in R is an object and a vector. Numbers and strings are 1 element vectors (that is length(n) == length(s) is true). Vectors can be numeric (c(1,2,3)) or character (c(“WoW”, “Good”, “Bad”)) or mixed (c(1, “two”, 3)). Mixed vectors are always considered to be character. Factors are categorical variables. If the available data doesn’t include a particular label, it can be supplied as the 2nd argument to the factor() command. Lists are comprised of a set of named vectors. In the example above, we have defined two character vectors, levels and ratings. We create a factor, f, using ratings as our values and levels as the allowed levels, and then create a list structure using our ratings vector and a new vector for critics. You can write your own functions in R. You can alias an existing R function as demonstrated in the example above: std(x) simply calls the R function sd() to compute the standard deviation of a vector, or your function can be arbitrarily complex. See help(“function”) in the on-line help for more details.

 

R structured types are the matrix, the table, and the data frame. The matrix is what you think it is: an N by M array usually consisting of numeric values. Tables are our old friend contingency tables, especially useful for observing nominal or ordinal data. Finally, data frames are the real workhorse of R. These structures reflect most directly a dataset view of the world, where each row (record) contains several data fields. Usually rows are ordered by number (1..n) as opposed to tables, where rows are named entities (“High”, “Medium”, Low”). There are several ways to extract data from a structured type. You can select as subset of rows (dfm[1:10,]) or a subset of column (dfm[,3:4]). You can assign a column to a vector, and that vector will take on the resulting type (numeric, character, etc.) These “slices” can be transformed into other types by using the as. function (e.g. dfm <- as.data.frame(t)).

Why does this matter? There are two reasons:

1. knowing what the class of an R variable is (via class(v)) helps us understand where and when it can be used in a function, or it may need to be converted into a different representation (foo <- (as.data.frame(t…))

2. Knowing the type of the underlying data helps us understand when data conversion is needed. Sometimes what appears to be numeric data is encoded as character strings (“12345” != 12345). Hence, in order to perform certain calculations, we may need to convert data (as.numeric(t$age)).