Review on basic data analytics methods using R

R

R is a big, complicated, messy, powerful, extensible framework for computing and graphing statistics. Written as a freeware version of the S language, it’s widespread availability and use have resulted in several vendors supplying R interfaces to their products.

There are five things that you should remember about R. Doing so will help you in thinking about how to work with R, and, more importantly, when R proves stubborn and insists that it doesn’t know what you’re talking about.

First thing to remember is that underneath it all, R is an object oriented language. That means, for example, that the expression “x <- 3” is actually invoking a function of the x instance: e.g. x.assign(3).

Second, almost everything in R is expressed as a vector or a group of vectors. Although x <- 1 looks like a scalar variable, it’s actually a 1-dimensional array (vector) with length 1. Similarly, v <- c(1,2,3,4,5) is a 1×5 vector (length(v) is 5).

As regards data structures, almost everything in R is defined as a vector: each element of a vector can be addressed by a numerical index (e.g. v[3] … subscripts in R are 1-based as in Python, not 0-based as in Perl or C). That means that scalar values (such as x) are actually a vector of length 1. The command (function) to create a vector is c(), and can contain all numbers, all character strings, or a mixture of the two.

Third, all commands in R are actually functions. Hence, you must type in either quit() or q() to exit R. q is a variable within a R workspace. Simply typing in q will provide you with a definition of that function (the same as str(q)).

 

Fourth, since R is object oriented, and since we have said various operators are implemented as functions, it’s no surprise that there are multiple commands in R that are much like virtual functions in other OO languages. Consider the summary() function. It’s behavior will differ markedly depending on the class of the object passed as an argument. For instance, summary(x) will print basic summary statistics about each row if x is a data frame, but may generate a mosaic plot if x is a table. The plot() function works the same way. (We’ll see an example of that in one of our labs). Finally, most commands in R have a large number of default arguments. For example, the lm() function (univariate regression), looks like this:

lm(formula, data, subset, weights, na.action, method = “qr”, model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, …)

For simple regression, the usual invocation is lm(var1 ~ var2, data=). Other parameters take on a default value, or may not be needed based on the type of variables provided in the function call. Usually the simple invocation “just works” given the choice of default values. However, you may need to apply one of more of these parameters in order to get different visual results. The command help(lm) or ?lm will provide more detail in the style of *nix manual pages – the documentation will describe the arguments, types, default values, etc., but it won’t explain how or when to use this particular function.