Help the Stat Consulting Group by giving a gift

How can I subset a data set?

The R program (as a text file) for all the code on this page.

Subsetting is a very important component of data management and there are several ways that one can subset data in R. This page aims to give a fairly exhaustive list of the ways in which it is possible to subset a data set in R.

First we will create the data frame that will be used in all the examples. We will call this data frame

x.dfand it will be composed of 5 variables (V1-V5) where the values come from a normal distribution with a mean 0 and standard deviation of 1; as well as, one variable (y) containing integers from 1 to 5.

x <- matrix(rnorm(30, 1), ncol = 5) y <- c(1, seq(5))#combining x and y into one matrixx <- cbind(x, y)#converting x into a data frame called x.dfx.df <- data.frame(x) x.dfV1 V2 V3 V4 V5 y 1 -1.6862356 1.3950211 1.35898920 1.8492410 1.75368860 1 2 0.8610318 -0.5698281 -0.01984841 0.3570547 -0.93262483 1 3 -1.3736436 0.1280908 0.17866428 1.6930332 0.42080132 2 4 0.7557265 1.8622043 -0.29684582 1.0555782 0.09372863 3 5 0.6296957 1.7943359 2.16226397 0.1604166 0.37218504 4 6 0.4694073 1.3096533 1.90324318 1.9372227 1.43930020 5

In order to verify which names are used for the variables in the data frame we use the

namesfunction.

names(x.df)[1] "V1" "V2" "V3" "V4" "V5" "y"

Subsetting rows using the subset function

Thesubsetfunction with a logical statement will let you subset the data frame by observations. In the following example thex.subdata frame contains only the observations for which the values of the variableyis greater than 2.

x.sub <- subset(x.df, y > 2) x.subV1 V2 V3 V4 V5 y 4 0.7557265 1.862204 -0.2968458 1.0555782 0.09372863 3 5 0.6296957 1.794336 2.1622640 0.1604166 0.37218504 4 6 0.4694073 1.309653 1.9032432 1.9372227 1.43930020 5

Subsetting rows using multiple conditional statements

There is no limit to how many logical statements may be combined to achieve the subsetting that is desired. The data framex.sub1contains only the observations for which the values of the variableyis greater than 2 and for which the variableV1is greater than 0.6.

x.sub1 <- subset(x.df, y > 2 & V1 > 0.6) x.sub1V1 V2 V3 V4 V5 y 4 0.7557265 1.862204 -0.2968458 1.0555782 0.09372863 3 5 0.6296957 1.794336 2.1622640 0.1604166 0.37218504 4

Subsetting both rows and columns

It is possible to subset both rows and columns using thesubsetfunction. Theselectargument lets yousubsetvariables (columns). The data framex.sub2contains only the variablesV1andV4and then only the observations of these two variables where the values of variableyare greater than 2 and the values of variableV2are greater than 0.4.

x.sub2 <- subset(x.df, y > 2 & V2 > 0.4, select = c(V1, V4)) x.sub2V1 V4 4 0.7557265 1.0555782 5 0.6296957 0.1604166 6 0.4694073 1.9372227

In the data frame

x.sub3contains only the observations in variablesV2-V5for which the values in variableyare greater than 3.

x.sub3 <- subset(x.df, y > 3, select = V2:V5) x.sub3V2 V3 V4 V5 5 1.794336 2.162264 0.1604166 0.372185 6 1.309653 1.903243 1.9372227 1.439300

Subsetting rows using indices

Another method for subsetting data sets is by using the bracket notation which designates the indices of the data set. The first index is for the rows and the second for the columns. Thex.sub4data frame contains only the observations for which the values of variableyare equal to 1. Note that leaving the index for the columns blank indicates that we wantx.sub4to contain all the variables (columns) of the original data frame.

x.sub4 <- x.df[x.df$y == 1, ] x.sub4V1 V2 V3 V4 V5 y 1 -1.6862356 1.395021 1.35898920 1.8492410 1.7536886 1 2 0.8610318 -0.569828 -0.01984841 0.3570547 -0.9326248 1

Subsetting rows selecting on more than one value

We use the%in%notation when we want to subset on multiple values ofy. Thex.sub5data frame contains only the observations for which the values of variableyare equal to either 1 or 4.

x.sub5 <- x.df[x.df$y %in% c(1, 4), ] x.sub5V1 V2 V3 V4 V5 y 1 -1.6862356 1.395021 1.35898920 1.8492410 1.7536886 1 2 0.8610318 -0.569828 -0.01984841 0.3570547 -0.9326248 1 5 0.6296957 1.794336 2.16226397 0.1604166 0.3721850 4

Subsetting columns using indices

We can also use the indices to subset the variables (columns) of the data set. Thex.sub6data frame contains only the first two variables of thex.dfdata frame. Note that leaving the index for the rows blank indicates that we wantx.sub6to contain all the rows of the original data frame.

x.sub6 <- x.df[, 1:2] x.sub6V1 V2 1 -1.6862356 1.3950211 2 0.8610318 -0.5698281 3 -1.3736436 0.1280908 4 0.7557265 1.8622043 5 0.6296957 1.7943359 6 0.4694073 1.3096533

The

x.sub7data frame contains all the rows but only the 1st, 3rd and 5th variables (columns) of thex.dfdata set.

x.sub7 <- x.df[, c(1, 3, 5)] x.sub7V1 V3 V5 1 -1.6862356 1.35898920 1.75368860 2 0.8610318 -0.01984841 -0.93262483 3 -1.3736436 0.17866428 0.42080132 4 0.7557265 -0.29684582 0.09372863 5 0.6296957 2.16226397 0.37218504 6 0.4694073 1.90324318 1.43930020

Subsetting both rows and columns using indices

Thex.sub8data frame contains the 3rd-6th variables ofx.dfand only observations number 1 and 3.

x.sub8 <- x.df[c(1, 3), 3:6] x.sub8V3 V4 V5 y 1 1.3589892 1.849241 1.7536886 1 3 0.1786643 1.693033 0.4208013 2

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.