|
|
|
||||
|
|
|||||
| mean | calculates the mean |
| names | lists the variable names of a data frame |
| table | creates a frequency table |
| rbind | combines rows of data |
| merge | match merges two data frames |
| dimnames | lists or assigns names of data frames |
| cbind | combines columns of data |
Read in the hs1 data via the Internet using the read.table function.
hs1 <- read.table("http://www.ats.ucla.edu/stat/R/notes/hs1.csv", header=T, sep=",")
attach(hs1)
Keeping only the observations where the reading score is 60 or higher.
hs1.read.well <- hs1[read >= 60, ]
Comparing means of read in the original hs1 data frame and the new smaller hs1.read.well data frame. To keep from getting confused we will use the convention of using the data name, dollar sign, variable name. For example, hs1$read is the read variable from the hs1 data.
mean(hs1.read.well$read) [1] 65.48214 mean(hs1$read) [1] 52.23
Keeping only the variables id, female, read and write from the hs1.read.well data frame. We first look at the variables names in the data frame using the names function and then we use the column indices corresponding to these four variables to indicate that we want only these variables in the new data frame called hs1.kept. We use the names function again to verify that hs1.kept consists of only the four variables that we wanted to keep.
names(hs1.read.well) [1] "female" "id" "race" "ses" "schtyp" "prgtype" "read" [8] "write" "math" "science" "socst" "prog" hs1.kept <- hs1.read.well[ , c(1, 2, 7, 8)] names(hs1.kept) [1] "female" "id" "read" "write"
Dropping the variables ses and prog from the hs1.read.well data frame by using the column indices corresponding to these two variables with a negative sign.
names(hs1.read.well) [1] "female" "id" "race" "ses" "schtyp" "prgtype" "read" [8] "write" "math" "science" "socst" "prog" hs1.drop <- hs1.read.well[ , -c(4, 12)] names(hs1.drop) [1] "female" "id" "race" "schtyp" "prgtype" "read" "write" [8] "math" "science" "socst"
We have been given a file with the information for the males (called hsmale) and a file with the information for the females (called hsfemale). We need to combine these two files by stacking them in order to be able to analyze them together. First read in the two files using the read.table function.
detach()
hsfemale <- read.table('http://www.ats.ucla.edu/stat/R/notes/hsfemale.txt', header=T, sep=",")
hsmale <- read.table('http://www.ats.ucla.edu/stat/R/notes/hsmale.txt', header=T, sep=",")
Just to make sure that each file contains only one gender let's look at the frequency table for female for each data frame.
table(hsfemale$female) 1 109 table(hsmale$female) 0 91
We use the rbind function when we stack data because we are essentially combining rows of data. We check that the combined data contains the information for both genders by looking at the frequency table for female for the new combined data frame called hsmasters.
hsmasters <- rbind(hsfemale, hsmale) table(hsmasters$female) 0 1 91 109
Now we are working on a dissertation. We have again been given two data files. One file contains the demographic information (called hsdem) and the other contains the test scores (called hstest). We wish to merge these files together so that they match on the id variable found in both files. First read in the hsdem and hstest data using the read.table function.
detach()
hsdem <- read.table('http://www.ats.ucla.edu/stat/R/notes/hsdem.txt', header=T, sep=",")
hstest <- read.table('http://www.ats.ucla.edu/stat/R/notes/hstest.txt', header=T, sep=",")
Let's look at the two data frames.
hsdem id female prog 1 11 0 2 2 20 0 2 3 12 0 3 4 16 0 3 5 7 0 2 6 15 0 3 7 9 0 3 8 18 0 3 9 5 0 2 10 14 0 2 11 3 0 2 12 8 1 2 13 1 1 3 14 4 1 2 15 2 1 3 16 19 1 1 17 17 1 2 18 6 1 2 19 13 1 3 hstest id read write 1 11 34 46 2 20 60 52 3 12 37 44 4 16 47 31 5 7 57 54 6 15 39 39 7 9 48 49 8 18 50 33 9 5 47 40 10 14 47 41 11 3 63 65 12 8 39 44 13 1 34 44 14 2 39 41 15 19 28 46 16 6 47 41 17 10 47 54 18 13 47 46
The merge function allows us to merge two data frames on a variable (or a list of variables). In this case the variable in common is id which has the same name in both data sets. Specifying T in the all argument indicates that we want to keep all the observations from each data set rather than only keeping the observations that came from both data sets.
hsdiss <- merge(hstest, hsdem, by="id", all=T) hsdiss id read write female prog 1 1 34 44 1 3 2 2 39 41 1 3 3 3 63 65 0 2 4 4 NA NA 1 2 5 5 47 40 0 2 6 6 47 41 1 2 7 7 57 54 0 2 8 8 39 44 1 2 9 9 48 49 0 3 10 10 47 54 NA NA 11 11 34 46 0 2 12 12 37 44 0 3 13 13 47 46 1 3 14 14 47 41 0 2 15 15 39 39 0 3 16 16 47 31 0 3 17 17 NA NA 1 2 18 18 50 33 0 3 19 19 28 46 1 1 20 20 60 52 0 2
If the variable that we were merging on had different names in each data frame then we could use the by.x and by.y arguments. In the by.x argument we would list the name of the variable(s) that was in the data frame listed first in the merge function (in this case in hstest) and in the by.y argument we would name the variable(s) that was in the data frame listed second (in this case hsdem).
hsdiss.1 <- merge(hstest, hsdem, by.x="id", by.y="id", all=T) hsdiss.1
Creating an indicator of which data set the observations came from is a little more complicated. We would first create an indicator variable called from, in each data frame to be merged. Then we merge the two data sets. Finally, we create a variable both which would indicate which data frame or both the observation came from. It is generally easier to note that when a data frame did not contribute to the observation in the combined data frame then the variables from that data frame will have missing values (NA's) for that observation.
from <- data.frame(rep(1, length(hsdem$id)))
dimnames(from)[[2]] <- "from"
hsdem.1 <- cbind(hsdem, from)
hsdem.1
id female prog from
1 11 0 2 1
2 20 0 2 1
3 12 0 3 1
4 16 0 3 1
5 7 0 2 1
6 15 0 3 1
7 9 0 3 1
8 18 0 3 1
9 5 0 2 1
10 14 0 2 1
11 3 0 2 1
12 8 1 2 1
13 1 1 3 1
14 4 1 2 1
15 2 1 3 1
16 19 1 1 1
17 17 1 2 1
18 6 1 2 1
19 13 1 3 1
from <- data.frame(rep(1, length(hstest$id)))
dimnames(from)[[2]] <- "from"
hstest.1 <- cbind(hstest, from)
hstest.1
id read write from
1 11 34 46 1
2 20 60 52 1
3 12 37 44 1
4 16 47 31 1
5 7 57 54 1
6 15 39 39 1
7 9 48 49 1
8 18 50 33 1
9 5 47 40 1
10 14 47 41 1
11 3 63 65 1
12 8 39 44 1
13 1 34 44 1
14 2 39 41 1
15 19 28 46 1
16 6 47 41 1
17 10 47 54 1
18 13 47 46 1
hsdiss.2 <- merge(hstest.1, hsdem.1, by.x="id", by.y="id", all=T, suffix=c("test", "dem"))
attach(hsdiss.2)
hsdiss.2$both[!is.na(fromtest) & !is.na(fromdem)] <- "both"
hsdiss.2$both[is.na(fromtest)] <- "dem"
hsdiss.2$both[is.na(fromdem)] <- "test"
hsdiss.2
id read write fromtest female prog fromdem both
1 1 34 44 1 1 3 1 both
2 2 39 41 1 1 3 1 both
3 3 63 65 1 0 2 1 both
4 4 NA NA NA 1 2 1 dem
5 5 47 40 1 0 2 1 both
6 6 47 41 1 1 2 1 both
7 7 57 54 1 0 2 1 both
8 8 39 44 1 1 2 1 both
9 9 48 49 1 0 3 1 both
10 10 47 54 1 NA NA NA test
11 11 34 46 1 0 2 1 both
12 12 37 44 1 0 3 1 both
13 13 47 46 1 1 3 1 both
14 14 47 41 1 0 2 1 both
15 15 39 39 1 0 3 1 both
16 16 47 31 1 0 3 1 both
17 17 NA NA NA 1 2 1 dem
18 18 50 33 1 0 3 1 both
19 19 28 46 1 1 1 1 both
20 20 60 52 1 0 2 1 both
Unless you are going to continue working with the hsdiss.2 data frame it is generally a good idea to detach all attached data frames.
detach()
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services
The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.