4 data wrangling tasks in R for advanced beginners

Learn how to add columns, get summaries, sort your results and reshape your data.

Reshaping: Wide to long (and back)

Different analysis tools in R -- including some graphing packages -- require data in specific formats. One of the most common -- and important -- tasks in R data manipulation is switching between "wide" and "long" formats in order to use a desired analysis or graphics function. For example, it is usually easier to visualize data using the popular ggplot2() graphing package if it's in long or "tidy" format with one value measurement per row. Wide means that you've got multiple measurement columns across each row, like we've got here:

  fy company revenue profit margin
1 2010 Apple 65225 14013 21.5
2 2011 Apple 108249 25922 23.9
3 2012 Apple 156508 41733 26.7
4 2010 Google 29321 8505 29.0
5 2011 Google 37905 9737 25.7
6 2012 Google 50175 10737 21.4
7 2010 Microsoft 62484 18760 30.0
8 2011 Microsoft 69943 23150 33.1
9 2012 Microsoft 73723 16978 23.0

Each row includes a column for revenue, for profit and, after some calculations above, profit margin.

While long means that there's only one measurement per row, there can be multiple categories with measurements, as you see below:

  fy company variable value
1 2010 Apple revenue 65225.0
2 2011 Apple revenue 108249.0
3 2012 Apple revenue 156508.0
4 2010 Google revenue 29321.0
5 2011 Google revenue 37905.0
6 2012 Google revenue 50175.0
7 2010 Microsoft revenue 62484.0
8 2011 Microsoft revenue 69943.0
9 2012 Microsoft revenue 73723.0
10 2010 Apple profit 14013.0
11 2011 Apple profit 25922.0
12 2012 Apple profit 41733.0
13 2010 Google profit 8505.0
14 2011 Google profit 9737.0
15 2012 Google profit 10737.0
16 2010 Microsoft profit 18760.0
17 2011 Microsoft profit 23150.0
18 2012 Microsoft profit 16978.0
19 2010 Apple margin 21.5
20 2011 Apple margin 23.9
21 2012 Apple margin 26.7
22 2010 Google margin 29.0
23 2011 Google margin 25.7
24 2012 Google margin 21.4
25 2010 Microsoft margin 30.0
26 2011 Microsoft margin 33.1
27 2012 Microsoft margin 23.0

Please trust me on this (I discovered it the hard way): Once you thoroughly understand the concept of wide to long, actually doing it in R becomes much easier.

If you find it confusing to figure out what's a category and what's a measurement, here's some advice: Don't pay too much attention to definitions that say long data frames should contain only one "value" in each row. Why? For people with experience programming in other languages, pretty much everything seems like a "value." If the year equals 2011 and the company equals Google, isn't 2011 your value for year and Google your value for company?

For data reshaping, though, the term "value" is being used a bit differently.

I like to think of a "long" data frame as having only one "measurement that would make sense to plot on its own" per row. In the case of these financial results, would it make sense to plot that the year changed from 2010 to 2011 to 2012? No, because the year is a category I set up in advance to decide what measurements I want to look at.

Even if I'd broken down the financial results by quarter -- and quarters 1, 2, 3 and 4 certainly look like numbers and thus "values" -- it wouldn't make sense to plot the quarter changing from 1 to 2 to 3 to 4 and back again as a "value" on its own. Quarter is a category -- perhaps a factor in R -- that you might want to group data by. However, it's not a measurement you would want to plot by itself.

Join the newsletter!

Or

Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.

More about AdvancedAppleGitHubGoogleMicrosoft

Show Comments