|
|
|
||||
|
|
|||||
| sapply | applies a function to elements of a list |
| factor | creates a categorical variable with value labels if desired |
| table | creates frequency table |
| bwplot | trellis boxplots |
| anova | generic function used here to extract the anova table from an lm object |
| lm | fits a linear model |
Read in the hs0 data via the Internet using read.table function.
Note: If you have done the previous sections then the hs0 data may already be available to you.
hs0 <- read.table("http://www.ats.ucla.edu/stat/R/notes/hs0.csv", header=T, sep=",")
attach(hs0)
We use the sapply function with the is.factor function to check if any of the variables in the hs0 data frame are factor variables.
sapply(hs0, is.factor) gender id race ses schtyp prgtype read write math FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE science socst FALSE FALSE
Creating a factor (categorical) variable called schtyp.f for schtyp with value labels.
Note: Every time we modify a data set that we have attached, we need to detach and attach it again in order for the new variables that have been created to be available. Therefore, it is generally a good idea to do all data management, manipulation, recoding, etc. in one step and once done then detach and attach the data. In this section we are demonstrating the effect of each data manipulation and thus we must detach and attach multiple times. The problem with detaching and attaching multiple times is that you want to be very sure of which object you are detaching. This is especially important in R where you download different packages of functions and data sets. The detach function detaches the second object in the search path and if you have downloaded various packages while doing data manipulation you could end up detaching a package rather than the data set you had just finished manipulating. Thus, if you end up having to do data management at various stages we strongly recommend using the search function to verify that the object you are detaching is the intended object. If your data set is not the second in the search path then you can detach the data set by using the pos argument in the detach function.
schtyp.f <- factor(schtyp, levels=c(1, 2), labels=c("public", "private"))
# to check that hs0 is the second object in the search path
search()
[1] ".GlobalEnv" "hs0" "package:methods"
[4] "package:stats" "package:graphics" "package:grDevices"
[7] "package:utils" "package:datasets" "Autoloads"
[10] "package:base"
detach()
attach(hs0)
Checking the factor variable schtyp.f in a frequency table.
table(schtyp.f)
schtyp.f
public private
168 32
Creating a factor variable called female from gender with value labels.
female <- factor(gender, levels=c(0, 1), labels=c("male", "female"))
detach()
attach(hs0)
Checking the factor variable female in a frequency table.
table(female)
female
male female
91 109
Recoding race=5 to be NA (to be missing).
table(race) 1 2 3 4 5 24 11 20 143 2 race[race==5] <- NA detach() attach(hs0) table(race) race 1 2 3 4 24 11 20 143
Creating a variable called total = read + write+ socst.
total <- read+write+socst detach() attach(hs0) mean(total) [1] 157.41
Creating a variable called grade based on total.
grade <- 0 grade[total >= 80 & total < 110] <- 1 grade[total >= 110 & total < 140] <- 2 grade[total >= 140 & total < 170] <- 3 grade[total >= 170] <- 4 detach() attach(hs0) table(grade) grade 1 2 3 4 6 48 76 70
Creating a factor variable called grade.f based on grade.
grade.f <- factor(grade, levels=0:4, labels=c("F", "D", "C", "B", "A"))
detach()
attach(hs0)
is.factor(grade.f)
[1] TRUE
table(grade.f)
grade.f
F D C B A
0 6 48 76 70
Labels can be very useful in graphs.
Note: We need to load the lattice package in order to be able to use the trellis graph functions.
library(lattice) # boxplot, no labels bwplot(grade ~ write, hs0)# boxplot with labels bwplot(grade.f ~ write, hs0)
This is why there are no F's.
summary(total) Min. 1st Qu. Median Mean 3rd Qu. Max. 95.0 139.0 159.0 157.4 177.0 207.0
Labels are also nice when looking at frequency tables.
# without labels
table(schtyp, gender)
gender
schtyp 0 1
1 77 91
2 14 18
# with labels
table(schtyp.f, female)
female
schtyp.f male female
public 77 91
private 14 18
Factor variables can also be used in statistical models.
# The variable grade is treated as a continuous variable, hence 1 degree of freedom.
anova(lm(write~gender+grade, hs0))
Analysis of Variance Table
Response: write
Df Sum Sq Mean Sq F value Pr(>F)
gender 1 1176.2 1176.2 41.089 1.052e-09 ***
grade 1 11063.3 11063.3 386.475 < 2.2e-16 ***
Residuals 197 5639.4 28.6
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
# The variable grade is treated as a categorical variable, thus 3 degrees of freedom.
anova(lm(write~female+grade.f, hs0))
Analysis of Variance Table
Response: write
Df Sum Sq Mean Sq F value Pr(>F)
female 1 1176.2 1176.2 40.754 1.234e-09 ***
grade.f 3 11074.7 3691.6 127.907 < 2.2e-16 ***
Residuals 195 5628.0 28.9
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Unless you are going to continue working with the hs0 data frame it is generally a good idea to detach all attached data frames.
detach()
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services
The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.