There are times when it’s useful to see multiple values of a dataset in context in order to visually represent data relationships so as to magnify differences or to show patterns hidden within the data that summary statistics don’t reveal. In the graphic represented above, the variable sepal length, sepal width, petal length and petal width are compared with three species of irises (the key is not listed in the graphic). Colors are used to represent the different species, allowing us to compare differences across species for a particular combination of variables. Consider the values encoded in the second square from the top right, where sepal length is compared with petal length. Values for petal length are encoded across the bottom; values for sepal length are encoded on the right hand side of the graphic. We can observe that the green and blue species are well matched, although the blue species has longer petals in the main. The petal length for the red species, however, remain markedly the same, and vary only in the lower half of sepal length values.
As an exercise, imagine fitting a regression line to each of these individual. What would you make of the relationship between sepal length and sepal width?
The R code for generating the plot is:
pairs(iris[1:4], main = “Anderson’s Iris Data — 3 species”, pch = 21, bg = c(“red”, “green3”, “blue”)[unclass(iris$Species)] )
and uses the iris dataset included with the R standard distribution. Here colors include the species, as well as proving the spirit of APL is alive and well.