UCLA Academic Technology Services HomeServicesClassesContactJobs
Help the Stat Consulting Group by giving a gift             
Loading

R Class Notes
Modifying Data


1.0 R functions used in this unit and the syntax file

comment add comment to an object
sapply apply a function over a list or vector
is.factorcheck if a variable is a factor variable
factor creates a categorical variable with value labels if desired
table creates frequency table

Here is the link to the syntax file used for this section.

2.0 Commenting a data frame or a variable

It is a good practice to label the data sets or variables that we have been working on. This can be accomplished by using the comment function.

# cleaning up 
rm(list=ls())

# reading in data
hs0 <- read.table("http://www.ats.ucla.edu/stat/R/notes/hs0.csv", header=T, sep=",")
# commenting the data set 
comment(hs0)<-"High school and beyond data"
# checking
comment(hs0)
# variable labels using comment
comment(hs0$write)<-"writing score"
comment(hs0$read) <-"reading score"

# more checking to make sure that our comments stay with the data frame
save(hs0,file="hs0.rda") 
rm(list=ls())
load(file="hs0.rda")
comment(hs0)
comment(hs0$write)

3.0 Creating factor variables

For the rest of this section, we are going to attach hs0 so our syntax will look cleaner.  The search() function displays what is currently on the search path.

search()
attach(hs0)
search()

We use the sapply function with the is.factor function to check if any of the variables in the hs0 data frame are factor variables.

sapply(hs0, is.factor)

Creating a factor (categorical) variable called schtyp.f for schtyp and a factor variable female for gender with value labels.

schtyp.f <- factor(schtyp, levels=c(1, 2), labels=c("public", "private"))
female <- factor(gender, levels=c(0, 1), labels=c("male", "female")) 
table(schtyp.f)
table(female)

4.0 Recoding variables and generating new variables

Recoding race=5 to be NA (to be missing).

table(hs0$race)
hs0$race[hs0$race==5] <-NA
table(hs0$race)

# displaying the missings as well
table(hs0$race, useNA="ifany")

Creating a variable called total = read + write+ math+science

total<-read+write+math+science
# noticing the missing values generated
summary(total)

Creating a variable called grade based on total.

# initializing a variable
grade<-0
grade[total <=140]<-0
grade[total > 140 & total <= 180] <-1
grade[total > 180 & total <= 210] <-2
grade[total > 210 & total <= 234] <-3
grade[total > 234] <-4

comment(grade)<-"combined grades of read, write, math, science"
grade<-factor(grade, levels=c(0, 1, 2, 3, 4), labels=c("F", "D", "C", "B", "A"))
table(grade)

Creating mean scores in two ways - working with missing values differently.

m1<-(read+write+math+science)/4
m2<-rowMeans(cbind(read, write, math, science))
m2<-rowMeans(cbind(read, write, math, science), na.rm=T)

At this point, we might want to combine the new variables we have created with the original data set. We can use the cbind function for this.

hs1<-cbind(hs0, cbind(schtyp.f, female, total, grade))
table(hs1$race)
is.data.frame(hs1)

5.0 For More Information


How to cite this page

Report an error on this page or leave a comment

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California