Beginner's guide to R: Painless data visualization
- 06 June, 2013 14:28
One of the most appealing things about R is its ability to create data visualizations with just a couple of lines of code.
For example, it takes just one line of code -- and a short one at that -- to plot two variables in a scatterplot. Let's use as an example the mtcars data set installed with R by default. To plot the engine displacement column disp on the x axis and mpg on y:
You really can't get much easier than that.
[This story is part of Computerworld's "Beginner's guide to R." To read from the beginning, check out the introduction; there are links on that page to the other pieces in the series.]
Of course that's a pretty no-frills graphic. If you'd like to label your x and y axes, use the parameters xlab and ylab. To add a main headline, such as "Page views by time of day," use the parameter main:
plot(mtcars$disp, mtcars$mpg, xlab="Engine displacement", ylab="mpg", main="MPG compared with engine displacement")
If you find having the y-axis labels rotated 90 degrees annoying (as I do), you can position them for easier reading with the las=1 argument:
plot(mtcars$disp, mtcars$mpg, xlab="Engine displacement", ylab="mpg", main="MPG vs engine displacement", las=1)
What's las and why is it 1? las refers to label style, and it's got four options. 0 is the default, with text always parallel to its axis. 1 is always horizontal, 2 is always perpendicular to the axis and 3 is always vertical. For much more on plot parameters, run the help command on par like so:
In addition to the basic dataviz functionality included with standard R, there are numerous add-on packages to expand R's visualization capabilities. Some packages are for specific disciplines such as biostatistics or finance; others add general visualization features.
Why use an add-on package if you don't need something discipline-specific? If you're doing more complex dataviz, or want to pretty up your graphics for presentations, some packages have more robust options. Another reason: The organization and syntax of an add-on package might appeal to you more than do the R defaults.
In particular, the ggplot2 package is quite popular and worth a look for robust visualizations. ggplot2 requires a bit of time to learn its "Grammar of Graphics" approach.
But once you've got that down, you have a tool to create many different types of visualizations using the same basic structure.
If ggplot2 isn't installed on your system yet, install it with the command:
You only need to do this once.
To use its functions, load the ggplot2 package into your current R session -- you only need to do this once per R session -- with the library() function:
Onto some ggplot2 examples.
ggplot2 has a "quick plot" function called qplot() that is similar to R's basic plot() function but adds some options. The basic quick plot code:
qplot(disp, mpg, data=mtcars)
generates a scatterplot.
The qplot default starts the y axis at a value that makes sense to R. However, you might want your y axis to start at 0 so you can better see whether changes are truly meaningful (starting a graph's y axis at your first value instead of 0 can sometimes exaggerate changes).
Use the ylim argument to manually set your lower and upper y axis limits:
qplot(disp, mpg, ylim=c(0,35), data=mtcars)
Bonus intermediate tip: Sometimes on a scatterplot you may not be sure if a point represents just one observation or multiple ones, especially if you've got data points that repeat -- such as in this example that ggplot2 creator Hadley Wickham generated with the command:
qplot(cty, hwy, data=mpg)
The "jitter" geom parameter introduces just a little randomness in the point placement so you can better see multiple points:
qplot(cty, hwy, data=mpg, geom="jitter")
As you might have guessed, if there's a "quick plot" function in ggplot2 there's also a more robust, full-featured plotting function. That's called ggplot() -- yes, while the add-on package is called ggplot2, the function is ggplot() and not ggplot2().
The code structure for a basic graph with ggplot() is a bit more complicated than in either plot() or qplot(); it goes as follows:
ggplot(mtcars, aes(x=disp, y=mpg)) + geom_point()
The first argument in the ggplot() function, mtcars, is fairly easy to understand -- that's the data set you're plotting. But what's with "aes()" and "geom_point()"?
"aes" stands for aesthetics -- what are considered visual properties of the graph. Those are things like position in space, color and shape.
"geom" is the graphing geometry you're using, such as lines, bars or the shapes of your points.
Now if "line" and "bar" also seem like aesthetic properties to you, similar to shape, well, you can either accept that's how it works or do some deep reading into the fundamentals behind the Grammar of Graphics. (Personally, I just take Wickham's word for it.)
Want a line graph instead? Simply swap out geom_point() and replace it with geom_line() , as in this example that plots temperature vs pressure in R's sample pressure data set:
ggplot(pressure, aes(x=temperature, y=pressure)) + geom_line()
It may be a little confusing here since both the data set and one of its columns are called the same thing: pressure. That first "pressure" represents the name of the data frame; the second, "y=pressure," represents the column named pressure.
In these examples, I set only x and y aesthetics. But there are lots more aesthetics we could add, such as color, axes and more.
You can also use the ylim argument with ggplot to change where the y axis starts. If mydata is the name of your data frame, xcol is the name of the column you want on the x axis and ycol is the name of the column you want on the y axis, use the ylim argument like this:
ggplot(mydata, aes(x=xcol, y=ycol), ylim=0) + geom_line()
Perhaps you'd like both lines and points on that temperature vs. pressure graph?
ggplot(pressure, aes(x=temperature, y=pressure)) + geom_line() + geom_point()
The point here (pun sort of intended) is that you can start off with a simple graphic and then add all sorts of customizations: Set the size, shape and color of the points, plot multiple lines with different colors, add labels and a ton more. See Bar and line graphs (ggplot2) for a few examples, or the The R Graphics Cookbook by Winston Chang for many more.
To make a bar graph from the sample BOD data frame included with R, the basic R function is barplot(). So, to plot the demand column from the BOD data set on a bar graph, you can use the command:
Add main="Graph of demand" if you want a main headline on your graph:
barplot(BOD$demand, main="Graph of demand")
To label the bars on the x axis, use the names.arg argument and set it to the column you want to use for labels:
barplot(BOD$demand, main="Graph of demand", names.arg = BOD$Time)
Sometimes you'd like to graph the counts of a particular variable but you've got just raw data, not a table of frequencies. R's table() function is a quick way to generate counts for each factor in your data.
The R Graphics Cookbook uses an example of a bar graph for the number of 4-, 6- and 8-cylinder vehicles in the mtcars data set. Cylinders are listed in the cyl column, which you can access in R using mtcars$cyl.
Here's code to get the count of how many entries there are by cylinder with the table() function; it stores results in a variable called cylcount:
cylcount <- table(mtcars$cyl)
That creates a table called cylcount containing:
4 6 8
11 7 14
Now you can create a bar graph of the cylinder count:
ggplot2's qplot() quick plotting function can also create bar graphs:
However, this defaults to an assumption that 4, 6 and 8 are part of a variable set that could run from 4 through 8, so it shows blank entries for 5 and 7.
To treat cylinders as distinct groups -- that is, you've got a group with 4 cylinders, a group with 6 and a group with 8, not the possibility of entries anywhere between 4 and 8 -- you want cylinders to be treated as a statistical factor:
To create a bar graph with the more robust ggplot() function, you can use syntax such as:
ggplot(mtcars, aes(factor(cyl))) + geom_bar()
Histograms work pretty much the same, except you want to specify how many buckets or bins you want your data to be separated into. For base R graphics, use:
hist(mydata$columnName, breaks = n)
where columnName is the name of your column in a mydata dataframe that you want to visualize, and n is the number of bins you want.
The ggplot2 commands are:
qplot(columnName, data=mydata, binwidth=n)
For quick plots and, for the more robust ggplot():
ggplot(mydata, aes(x=columnName)) + geom_histogram(binwidth=n)
You may be starting to see strong similarities in syntax for various ggplot() examples. While the ggplot() function is somewhat less intuitive, once you wrap your head around its general principles, you can do other types of graphics in a similar way.
Additional graphics options
There are many more graphics types in R than these few I've mentioned. Boxplots, a statistical staple showing minimum and maximum, first and third quartiles and median, have their own function called, intuitively, boxplot(). If you want to see a boxplot of the mpg column in the mtcars data frame it's as simple as:
To see side-by-side boxplots in a single plot, such as the x, y and z measurements of all the diamonds in the diamonds sample data set included in ggplot2:
boxplot(diamonds$x, diamonds$y, diamonds$z)
Creating a heat map in R is more complex but not ridiculously so. There's an easy-to-follow tutorial on Flowing Data.
Looking at nothing but black and white graphics can get tiresome after a while. Of course, there are numerous ways of using color in R.
Colors in R have both names and numbers as well as the usual RGB hex code, HSV (hue, saturation and value) specs and others. And when I say "names," I don't mean just the usual "red," "green," "blue," "black" and "white." R has 657 named colors. The colors() or colours() function -- R does not discriminate against either American or British English -- gives you a list of all of them. If you want to see what they look like, not just their text names, you can get a full, multi-page PDF chart with color numbers, colors names and swatches, sorted in various ways. Or you can find just the names and color swatches for each.
There are also R functions that automatically generate a vector of n colors using a specific color palette such as "rainbow" or "heat":
So, if you want five colors from the rainbow palette, use:
For many more details, check the help command on a palette such as:
Now that you've got a list of colors, how do you get them in your graphic? Here's one way. Say you're drawing a 3-bar barchart using ggplot() and want to use 3 colors from the rainbow palette. You can create a 3-color vector like:
mycolors <- rainbow(3)
Or for the heat.colors pallette:
mycolors <- heat.colors(3)
Now instead of using the geom_bar() function without any arguments, add fill=mycolors to geombar() like this:
ggplot(mtcars, aes(x=factor(cyl))) + geom_bar(fill=mycolors)
You don't need to put your list of colors in a separate variable, by the way; you can merge it all in a single line of code such as:
ggplot(mtcars, aes(x=factor(cyl))) + geom_bar(fill=rainbow(3))
But it may be easier to separate the colors out if you want to create your own list of colors instead of using one of the defaults.
The basic R plotting functions can also accept a vector of colors, such as:
You can use a single color if you want all the items to be one color (but not monochrome), such as
Chances are, you'll want to use color to show certain characteristics of your data, as opposed to simply assigning random colors in a graphic. That goes a bit beyond beginning R, but to give one example, say you've got a vector of test scores:
testscores <- c(96, 71, 85, 92, 82, 78, 72, 81, 68, 61, 78, 86, 90)
You can do a simple barplot of those scores like this:
And you can make all the bars blue like this:
But what if you want the scores 80 and above to be blue and the lower scores to be red? To do this, create a vector of colors of the same length and in the same order as your data, adding a color to the vector based on the data. In other words, since the first test score is 96, the first color in your color vector should be blue; since the second score is 71, the second color in your color vector should be red; and so on.
Of course, you don't want to create that color vector manually! Here's a statement that will do so:
testcolors <- ifelse(testscores >= 80, "blue", "red")
If you've got any programming experience, you might guess that this creates a vector that loops through the testscores data and runs the conditional statement: 'If this entry in testscores is greater than or equal to 80, add "blue" to the testcolors vector; otherwise add "red" to the testcolors vector.'
Now that you've got the list of colors properly assigned to your list of scores, just add the testcolors vector as your desired color scheme:
Note that the name of a color must be in quotation marks, but a variable name that holds a list of colors should not be within quote marks.
Add a graph headline:
barplot(testscores, col=testcolors, main="Test scores")
And have the y axis go from 0 to 100:
barplot(testscores, col=testcolors, main="Test scores", ylim=c(0,100))
Then use las-1 to style the axis label to be horizontal and not turned 90 degrees vertical:
barplot(testscores, col=testcolors, main="Test scores", ylim=c(0,100), las=1)
And you've got a color-coded bar graph.
By the way, if you wanted the scores sorted from highest to lowest, you could have set your original testscores variable to:
testscores <- sort(c(96, 71, 85, 92, 82, 78, 72, 81, 68, 61, 78, 86, 90), decreasing = TRUE)
The sort() function defaults to ascending sort; for descending sort you need the additional argument: decreasing = TRUE.
If that code above is starting to seem unwieldy to you as a beginner, break it into two lines for easier reading, and perhaps also set a new variable for the sorted version:
testscores <- c(96, 71, 85, 92, 82, 78, 72, 81, 68, 61, 78, 86, 90)
testscores_sorted <- sort(testscores, decreasing = TRUE)
If you had scores in a data frame called results with one column of student names called students and another column of scores called testscores, you could use the ggplot2 package's ggplot() function as well:
ggplot(results, aes(x=students, y=testscores)) + geom_bar(fill=testcolors, stat = "identity")
Why stat = "identity"? That's needed here to show that the y axis represents a numerical value as opposed to an item count.
ggplot2's qplot() also has easy ways to color bars by a factor, such as number of cylinders, and then automatically generate a legend. Here's an example of graph counting the number of 4-, 6- and 8-cylinder cars in the mtcars data set:
qplot(factor(cyl), data=mtcars, geom="bar", fill=factor(cyl))
But, as I said, we're getting somewhat beyond a beginner's overview of R when coloring by factor. For a few more examples and details for many of the themes covered here, you might want to see the online tutorial Producing Simple Graphs with R. For more on graphing with color, check out a source such as the R Graphics Cookbook. The ggplot2 documentation also has a lot of examples, such as this page for bar geometry.
Exporting your graphics
You can save your R graphics to a file for use outside the R environment. RStudio has an export option in the plots tab of the bottom right window.
If you are using "plain" R in Windows, you can also right-click the graphics window to save the file.
To save a plot with R commands and not point-and-click, first create a container for your image using a function such as jpeg(), png(), svg() or pdf(). Those functions need to have a file name as one argument and optionally width and height, such as:
jpeg("myplot.jpg", width=350, height=420)
Generate the plot with whatever graphics functions it requires, such as:
And then issue the command:
That will save your graphic to its container.
If you are using ggplot2, you can also use the function ggsave(), which defaults to saving the last plot you created using ggplot2 at the size you displayed it. Based on the filename you pass to the ggsave() function, it will save in the appropriate format -- myplot.jpg saves as a JPEG, myplot.png saves as a PNG and so on.
One final note: If you're working in RStudio and would like to see a larger version of your plot, click the Zoom button and a larger window with the current graphic will open. And, also in RStudio, to see previous plots from your session, click the back arrow.
Join the Computerworld Australia group on Linkedin. The group is open to IT Directors, IT Managers, Infrastructure Managers, Network Managers, Security Managers, Communications Managers.
Thanks a million, Drupal
Optus goes over the top with VoIP service
Turnbull asks how the NBN got that way
U.S. retailers insist on PIN requirement in smartcard rules
Yelp speeds database access with flash storage