Recall that a vector is a 1-dimensional array with a single data type (either character or numeric). We can perform several different transforms on a vector: multiplying each value by a scalar, creating a new vector by multiplying one vector by another, etc. We also can transform the contents of a vector by performing a transform on each element. If I have a vector called d$population, I can create a new vector as radius <- sqrt(d$population)/pi. A
n example of this kind of manipulation is illustrated by creating a table using a factor from a larger dataset. This results in a table where each element of the factor has a count of the number of times it appears in that dataset. We can then create another vector containing percentages using the statement pct <- t/sum(t)*100, and create a second row in the tables via the t <- rbind(t,pct). Logical vectors are created whenever an expression is used as an index. In the case above, a new vector is created with values of TRUE if the value of a particular element of v is < 10000. Any element of v that is marked as true is then added to the new vector. This is useful for creating subsets of larger data sets, as we shall see later on in this module. The subset() function provides another way to create a subset of values; the use of a specific range of indexes can be used as well (here we create a new vector consisting of values corresponding to the first six values of a Fibonacci sequence.)
One of the first things to consider when receiving a dataset is to validate your assumption. Is the data clean? Does it make sense? I personally use head(ds) and tail(ds) to look at the 1st and last values.
The next command is summary() that provide the minimum, maximum, median, mean and the 1st and 3rd quartile values. (Compare this against the values returned from the fivenum(ds) function.)
Other functions include sd (standard deviation), var (variance), range (low value and high values), and IQR that displays the interquartile range (difference between 1st and 3rd quartiles). The cor() function computes the correlation between variables in the dataset, or, more specifically, the vectors provided as the values of x and y.