UCLA Academic Technology Services HomeServicesClassesContactJobs
Help the Stat Consulting Group by giving a gift             
Loading

R FAQ:
How can I "collapse" my data in R?

Users of Stata are likely familiar with the concept of collapsing data: reducing the number of observations in your data, keeping one observation for each value or combination of values from one or more variables and calculating some summary of the other variables at this level. For examples of this in Stata, see our Stata Learning Module on collapse.

This data management step can also be done in R using the summaryBy command in the doBy package. Let's consider the dataset hsb2. We can start by collapsing the data by prog and calculating the mean of socst. The default summary calculation is the mean, as indicated by the output.

library(doBy)
summaryBy(socst ~ prog, data=hsb2)

  prog socst.mean
1    1   50.60000
2    2   56.69524
3    3   45.02000

We can easily make our collapse more complex, creating one observation for each combination of prog and female and ses, calculating both the mean and standard deviation of several variables, and saving this as a new object.

collapse1 <- summaryBy(socst + math ~ prog + ses + female, FUN=c(mean,sd), data=hsb2)
collapse1

   prog ses female socst.mean math.mean  socst.sd   math.sd
1     1   1      0   47.57143  46.71429  6.502747  8.118175
2     1   1      1   49.00000  48.22222  5.894913  7.546154
3     1   2      0   50.50000  51.10000  8.959787  9.267026
4     1   2      1   52.10000  51.10000  9.938142  6.026792
5     1   3      0   57.25000  54.00000 17.500000  4.320494
6     1   3      1   49.60000  50.40000 10.714476  7.700649
7     2   1      0   46.75000  50.00000 11.926860  6.683313
8     2   1      1   55.13333  55.00000 10.098562 10.281745
9     2   2      0   55.54545  58.13636  8.985318 10.086745
10    2   2      1   55.40909  54.50000  8.511388  7.255540
11    2   3      0   59.33333  57.42857  7.799573  7.743200
12    2   3      1   59.61905  59.42857  9.058014  8.152125
13    3   1      0   32.50000  46.75000  4.725816  5.251984
14    3   1      1   38.25000  42.25000  8.189715  3.918819
15    3   2      0   44.00000  48.20000 11.464230  9.540889
16    3   2      1   51.00000  46.06250  8.755950  7.540723
17    3   3      0   50.25000  42.25000  6.946222  3.947573
18    3   3      1   46.00000  55.66667 10.000000 10.503968

If we wish to summarize our data in a way that does not already exist as a function, we can write the function and then pass this to summaryBy. Below, we write a quick function halfmean and then apply it to math and socst within each combination of prog, ses, and female.

halfmean <- function(x) return(mean(x)/2)
summaryBy(socst + math ~ prog + ses + female, FUN=halfmean, data=hsb2sort)


   prog ses female socst.halfmean math.halfmean
1     1   1      0       23.78571      23.35714
2     1   1      1       24.50000      24.11111
3     1   2      0       25.25000      25.55000
4     1   2      1       26.05000      25.55000
5     1   3      0       28.62500      27.00000
6     1   3      1       24.80000      25.20000
7     2   1      0       23.37500      25.00000
8     2   1      1       27.56667      27.50000
9     2   2      0       27.77273      29.06818
10    2   2      1       27.70455      27.25000
11    2   3      0       29.66667      28.71429
12    2   3      1       29.80952      29.71429
13    3   1      0       16.25000      23.37500
14    3   1      1       19.12500      21.12500
15    3   2      0       22.00000      24.10000
16    3   2      1       25.50000      23.03125
17    3   3      0       25.12500      21.12500
18    3   3      1       23.00000      27.83333

How to cite this page

Report an error on this page or leave a comment

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.