4 data wrangling tasks in R for advanced beginners

Learn how to add columns, get summaries, sort your results and reshape your data.

Bonus special case: Grouping by date range

If you've got a series of dates and associated values, there's an extremely easy way to group them by date range such as week, month, quarter or year: R's cut() function.

Here are some sample data in a vector:

vDates <- as.Date(c("2013-06-01", "2013-07-08", "2013-09-01", "2013-09-15"))

Which creates:

[1] "2013-06-01" "2013-07-08" "2013-09-01" "2013-09-15"

The as.Date() function is important here; otherwise R will view each item as a string object and not a date object.

If you want a second vector that sorts those by month, you can use the cut() function using the basic syntax:

vDates.bymonth <- cut(vDates, breaks = "month")

That produces:

[1] 2013-06-01 2013-07-01 2013-09-01 2013-09-01
Levels: 2013-06-01 2013-07-01 2013-08-01 2013-09-01

It might be easier to see what's happening if we combine these into a data frame:

dfDates <- data.frame(vDates, vDates.bymonth)

Which creates:

  vDates vDates.bymonth
1 2013-06-01 2013-06-01
2 2013-07-08 2013-07-01
3 2013-09-01 2013-09-01
4 2013-09-15 2013-09-01

The new column gives the starting date for each month, making it easy to then slice by month.

Sorting your results

For a simple sort by one column in base R, you can get the order you want with the order() function, such as:

companyOrder <- order(companiesData$margin)

This tells you how your rows would be reordered, producing a list of line numbers such as:

6 1 9 2 5 3 4 7 8

Chances are, you're not interested in the new order by line number but instead actually want to see the data reordered. You can use that order to reorder rows in your data frame with this code:

companiesOrdered <- companiesData[companyOrder,]

where companyOrder is the order you created earlier. Or, you can do this in a single (but perhaps less human-readable) line of code:

companiesOrdered <- companiesData[order(companiesData$margin),]

If you forget that comma after the new order for your rows you'll get an error, because R needs to know what columns to return. Once again, a comma followed by nothing defaults to "all columns" but you can also specify just certain columns like:

companiesOrdered <- companiesData[order(companiesData$margin),c("fy", "company")]

To sort in descending order, you'd want companyOrder to have a minus sign before the ordering column:

companyOrder <- order(-companiesData$margin)

And then:

companiesOrdered <- companiesData[companyOrder,]

I find dplyr's arrange() to be much more readable. It uses the format arrange(mydata, col1, col2) to arrange a data frame first by col1 and then col2, or arrange(mydata, desc(col1), col2) if you want the first column to be in descending order. (Add desc() for any column that should be sorted in descending order.)

With dplyr, sorting companiesData by margin in descending order is as easy as

companiesOrdered <- arrange(companiesData, desc(margin))

  fy company revenue profit margin
8 2011 Microsoft 69943 23150 33.1
7 2010 Microsoft 62484 18760 30.0
4 2010 Google 29321 8505 29.0
3 2012 Apple 156508 41733 26.7
5 2011 Google 37905 9737 25.7
2 2011 Apple 108249 25922 23.9
9 2012 Microsoft 73723 16978 23.0
1 2010 Apple 65225 14013 21.5
6 2012 Google 50175 10737 21.4

Join the newsletter!


Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.

More about AdvancedAppleGitHubGoogleMicrosoft

Show Comments