UCLA Academic Technology Services HomeServicesClassesContactJobs

R Class Notes
Modifying Data


1.0 R functions used in this unit

sapply applies a function to elements of a list
factor creates a categorical variable with value labels if desired
table creates frequency table
bwplot trellis boxplots
anova generic function used here to extract the anova table from an lm object
lm fits a linear model

2.0 Factor Variables

Read in the hs0 data via the Internet using read.table function.
Note: If you have done the previous sections then the hs0 data may already be available to you.

hs0 <- read.table("http://www.ats.ucla.edu/stat/R/notes/hs0.csv", header=T, sep=",")

attach(hs0)

We use the sapply function with the is.factor function to check if any of the variables in the hs0 data frame are factor variables.

sapply(hs0, is.factor)
 gender      id    race     ses  schtyp prgtype    read   write    math 
  FALSE   FALSE   FALSE   FALSE   FALSE    TRUE   FALSE   FALSE   FALSE 
science   socst 
  FALSE   FALSE

Creating a factor (categorical) variable called schtyp.f for schtyp with value labels.
Note: Every time we modify a data set that we have attached, we need to detach and attach it again in order for the new variables that have been created to be available. Therefore, it is generally a good idea to do all data management, manipulation, recoding, etc. in one step and once done then detach and attach the data. In this section we are demonstrating the effect of each data manipulation and thus we must detach and attach multiple times. The problem with detaching and attaching multiple times is that you want to be very sure of which object you are detaching. This is especially important in R where you download different packages of functions and data sets. The detach function detaches the second object in the search path and if you have downloaded various packages while doing data manipulation you could end up detaching a package rather than the data set you had just finished manipulating. Thus, if you end up having to do data management at various stages we strongly recommend using the search function to verify that the object you are detaching is the intended object. If your data set is not the second in the search path then you can detach the data set by using the pos argument in the detach function.

schtyp.f <- factor(schtyp, levels=c(1, 2), labels=c("public", "private"))
# to check that hs0 is the second object in the search path
search()
 [1] ".GlobalEnv"        "hs0"               "package:methods"  
 [4] "package:stats"     "package:graphics"  "package:grDevices"
 [7] "package:utils"     "package:datasets"  "Autoloads"        
[10] "package:base" 
detach()
attach(hs0)

Checking the factor variable schtyp.f in a frequency table.

table(schtyp.f)
schtyp.f
 public private 
    168      32

Creating a factor variable called female from gender with value labels.

female <- factor(gender, levels=c(0, 1), labels=c("male", "female")) 
detach()
attach(hs0)

Checking the factor variable female in a frequency table.

table(female)
female
  male female 
    91    109 

3.0 Recoding

Recoding race=5 to be NA (to be missing).

table(race)
  1   2   3   4   5 
 24  11  20 143   2
 
race[race==5] <- NA
detach()
attach(hs0)
table(race)
race
  1   2   3   4 
 24  11  20 143 

Creating a variable called total = read + write+ socst.

total <- read+write+socst 
detach()
attach(hs0)
mean(total)
[1] 157.41

Creating a variable called grade based on total.

grade <- 0
grade[total >= 80 & total < 110] <- 1
grade[total >= 110 & total < 140] <- 2
grade[total >= 140 & total < 170] <- 3
grade[total >= 170] <- 4 
detach()
attach(hs0)
table(grade)
grade
 1  2  3  4 
 6 48 76 70

Creating a factor variable called grade.f based on grade.

grade.f <- factor(grade, levels=0:4, labels=c("F", "D", "C", "B", "A")) 
detach()
attach(hs0)
is.factor(grade.f)
[1] TRUE
table(grade.f)
grade.f
 F  D  C  B  A 
 0  6 48 76 70 

4.0 Examples of the use of variable labels

Labels can be very useful in graphs.
Note: We need to load the lattice package in order to be able to use the trellis graph functions.

library(lattice)

# boxplot, no labels
bwplot(grade ~ write, hs0)



# boxplot with labels
bwplot(grade.f ~ write, hs0)

This is why there are no F's.

summary(total)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   95.0   139.0   159.0   157.4   177.0   207.0 

Labels are also nice when looking at frequency tables.

# without labels
table(schtyp, gender)
      gender
schtyp 0  1 
     1 77 91
     2 14 18

# with labels
table(schtyp.f, female)
         female
schtyp.f  male female
  public  77   91    
  private 14   18 

Factor variables can also be used in statistical models.

# The variable grade is treated as a continuous variable, hence 1 degree of freedom.
anova(lm(write~gender+grade, hs0))
Analysis of Variance Table

Response: write
           Df  Sum Sq Mean Sq F value    Pr(>F)    
gender      1  1176.2  1176.2  41.089 1.052e-09 ***
grade       1 11063.3 11063.3 386.475 < 2.2e-16 ***
Residuals 197  5639.4    28.6                      
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

# The variable grade is treated as a categorical variable, thus 3 degrees of freedom.
anova(lm(write~female+grade.f, hs0))
Analysis of Variance Table

Response: write
           Df  Sum Sq Mean Sq F value    Pr(>F)    
female      1  1176.2  1176.2  40.754 1.234e-09 ***
grade.f     3 11074.7  3691.6 127.907 < 2.2e-16 ***
Residuals 195  5628.0    28.9                      
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

Unless you are going to continue working with the hs0 data frame it is generally a good idea to detach all attached data frames.

detach()

6.0 For More Information


How to cite this page

Report an error on this page

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.