UCLA Academic Technology Services HomeServicesClassesContactJobs

R Class Notes
Managing Data


1.0 R functions used in this unit

mean calculates the mean
names lists the variable names of a data frame
table creates a frequency table
rbind combines rows of data
merge match merges two data frames
dimnames lists or assigns names of data frames
cbind combines columns of data

2.0 Keeping and Dropping Variables

Read in the hs1 data via the Internet using the read.table function.

hs1 <- read.table("http://www.ats.ucla.edu/stat/R/notes/hs1.csv", header=T, sep=",")

attach(hs1)

Keeping only the observations where the reading score is 60 or higher.

hs1.read.well <- hs1[read >= 60, ]

Comparing means of read in the original hs1 data frame and the new smaller hs1.read.well data frame. To keep from getting confused we will use the convention of using the data name, dollar sign, variable name. For example, hs1$read is the read variable from the hs1 data.

mean(hs1.read.well$read)
[1] 65.48214

mean(hs1$read)
[1] 52.23

Keeping only the variables id, female, read and write from the hs1.read.well data frame. We first look at the variables names in the data frame using the names function and then we use the column indices corresponding to these four variables to indicate that we want only these variables in the new data frame called hs1.kept. We use the names function again to verify that hs1.kept consists of only the four variables that we wanted to keep.

names(hs1.read.well)
 [1] "female"  "id"      "race"    "ses"     "schtyp"  "prgtype" "read"   
 [8] "write"   "math"    "science" "socst"   "prog"
 
hs1.kept <- hs1.read.well[ , c(1, 2, 7, 8)]
names(hs1.kept)
[1] "female" "id"     "read"   "write" 

Dropping the variables ses and prog from the hs1.read.well data frame by using the column indices corresponding to these two variables with a negative sign.

names(hs1.read.well)
 [1] "female"  "id"      "race"    "ses"     "schtyp"  "prgtype" "read"   
 [8] "write"   "math"    "science" "socst"   "prog" 
 
hs1.drop <- hs1.read.well[ , -c(4, 12)]
names(hs1.drop)
 [1] "female"  "id"      "race"    "schtyp"  "prgtype" "read"    "write"  
 [8] "math"    "science" "socst" 

3.0 Master's Thesis Example--Stacking Files

We have been given a file with the information for the males (called hsmale) and a file with the information for the females (called hsfemale). We need to combine these two files by stacking them in order to be able to analyze them together. First read in the two files using the read.table function.

detach()
hsfemale <- read.table('http://www.ats.ucla.edu/stat/R/notes/hsfemale.txt', header=T, sep=",")
hsmale <- read.table('http://www.ats.ucla.edu/stat/R/notes/hsmale.txt', header=T, sep=",")

Just to make sure that each file contains only one gender let's look at the frequency table for female for each data frame.

table(hsfemale$female)
  1 
109 
table(hsmale$female)
 0 
91

We use the rbind function when we stack data because we are essentially combining rows of data. We check that the combined data contains the information for both genders by looking at the frequency table for female for the new combined data frame called hsmasters.

hsmasters <- rbind(hsfemale, hsmale)
table(hsmasters$female)
  0   1 
 91 109

4.0 Dissertation Example--Match Merging Files

Now we are working on a dissertation. We have again been given two data files. One file contains the demographic information (called hsdem) and the other contains the test scores (called hstest). We wish to merge these files together so that they match on the id variable found in both files. First read in the hsdem and hstest data using the read.table function.

detach()
hsdem <- read.table('http://www.ats.ucla.edu/stat/R/notes/hsdem.txt', header=T, sep=",")
hstest <- read.table('http://www.ats.ucla.edu/stat/R/notes/hstest.txt', header=T, sep=",")

Let's look at the two data frames.

hsdem
   id female prog
1  11      0    2
2  20      0    2
3  12      0    3
4  16      0    3
5   7      0    2
6  15      0    3
7   9      0    3
8  18      0    3
9   5      0    2
10 14      0    2
11  3      0    2
12  8      1    2
13  1      1    3
14  4      1    2
15  2      1    3
16 19      1    1
17 17      1    2
18  6      1    2
19 13      1    3
hstest
   id read write
1  11   34    46
2  20   60    52
3  12   37    44
4  16   47    31
5   7   57    54
6  15   39    39
7   9   48    49
8  18   50    33
9   5   47    40
10 14   47    41
11  3   63    65
12  8   39    44
13  1   34    44
14  2   39    41
15 19   28    46
16  6   47    41
17 10   47    54
18 13   47    46

The merge function allows us to merge two data frames on a variable (or a list of variables).  In this case the variable in common is id which has the same name in both data sets. Specifying T in the all argument indicates that we want to keep all the observations from each data set rather than only keeping the observations that came from both data sets.

hsdiss <- merge(hstest, hsdem, by="id", all=T)
hsdiss
   id read write female prog
1   1   34    44      1    3
2   2   39    41      1    3
3   3   63    65      0    2
4   4   NA    NA      1    2
5   5   47    40      0    2
6   6   47    41      1    2
7   7   57    54      0    2
8   8   39    44      1    2
9   9   48    49      0    3
10 10   47    54     NA   NA
11 11   34    46      0    2
12 12   37    44      0    3
13 13   47    46      1    3
14 14   47    41      0    2
15 15   39    39      0    3
16 16   47    31      0    3
17 17   NA    NA      1    2
18 18   50    33      0    3
19 19   28    46      1    1
20 20   60    52      0    2

If the variable that we were merging on had different names in each data frame then we could use the by.x and by.y arguments. In the by.x argument we would list the name of the variable(s) that was in the data frame listed first in the merge function (in this case in hstest) and in the by.y argument we would name the variable(s) that was in the data frame listed second (in this case hsdem).

hsdiss.1 <- merge(hstest, hsdem, by.x="id", by.y="id", all=T)
hsdiss.1

Creating an indicator of which data set the observations came from is a little more complicated. We would first create an indicator variable called from, in each data frame to be merged. Then we merge the two data sets. Finally, we create a variable both which would indicate which data frame or both the observation came from. It is generally easier to note that when a data frame did not contribute to the observation in the combined data frame then the variables from that data frame will have missing values (NA's) for that observation.

from <- data.frame(rep(1, length(hsdem$id)))
dimnames(from)[[2]] <- "from"
hsdem.1 <- cbind(hsdem, from)
hsdem.1
   id female prog from
1  11      0    2    1
2  20      0    2    1
3  12      0    3    1
4  16      0    3    1
5   7      0    2    1
6  15      0    3    1
7   9      0    3    1
8  18      0    3    1
9   5      0    2    1
10 14      0    2    1
11  3      0    2    1
12  8      1    2    1
13  1      1    3    1
14  4      1    2    1
15  2      1    3    1
16 19      1    1    1
17 17      1    2    1
18  6      1    2    1
19 13      1    3    1

from <- data.frame(rep(1, length(hstest$id)))
dimnames(from)[[2]] <- "from"
hstest.1 <- cbind(hstest, from)
hstest.1
   id read write from
1  11   34    46    1
2  20   60    52    1
3  12   37    44    1
4  16   47    31    1
5   7   57    54    1
6  15   39    39    1
7   9   48    49    1
8  18   50    33    1
9   5   47    40    1
10 14   47    41    1
11  3   63    65    1
12  8   39    44    1
13  1   34    44    1
14  2   39    41    1
15 19   28    46    1
16  6   47    41    1
17 10   47    54    1
18 13   47    46    1

hsdiss.2 <- merge(hstest.1, hsdem.1, by.x="id", by.y="id", all=T, suffix=c("test", "dem"))

attach(hsdiss.2)

hsdiss.2$both[!is.na(fromtest) & !is.na(fromdem)] <- "both"
hsdiss.2$both[is.na(fromtest)] <- "dem" 
hsdiss.2$both[is.na(fromdem)] <- "test"
hsdiss.2
   id read write fromtest female prog fromdem both
1   1   34    44        1      1    3       1 both
2   2   39    41        1      1    3       1 both
3   3   63    65        1      0    2       1 both
4   4   NA    NA       NA      1    2       1  dem
5   5   47    40        1      0    2       1 both
6   6   47    41        1      1    2       1 both
7   7   57    54        1      0    2       1 both
8   8   39    44        1      1    2       1 both
9   9   48    49        1      0    3       1 both
10 10   47    54        1     NA   NA      NA test
11 11   34    46        1      0    2       1 both
12 12   37    44        1      0    3       1 both
13 13   47    46        1      1    3       1 both
14 14   47    41        1      0    2       1 both
15 15   39    39        1      0    3       1 both
16 16   47    31        1      0    3       1 both
17 17   NA    NA       NA      1    2       1  dem
18 18   50    33        1      0    3       1 both
19 19   28    46        1      1    1       1 both
20 20   60    52        1      0    2       1 both

Unless you are going to continue working with the hsdiss.2 data frame it is generally a good idea to detach all attached data frames.

detach()

6.0 For More Information


How to cite this page

Report an error on this page

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.