UCLA Academic Technology Services HomeServicesClassesContactJobs
Help the Stat Consulting Group by giving a gift             
Loading

R Class Notes
Exploring Data


1.0 R functions used in this unit and the syntax file

attach attaching a data frame (or list) to the search path
detach detaching a data frame (or list) to the search path
class list the type of an object
mean calculate the mean
median calculate the median
range calculate the minimum and maximum
sd calculate the standard deviation
var calculate the variance
summary generic function provides a summary results of an object
length calculates the count
by applying a function to a data frame split by factors
tapplyapplying a function to each cell of a ragged array
cbind combining columns
hist histogram plot
histogram trellis histogram plot(s)
boxplot box plot
bwplot trellis box plot(s)
barplot bar plot
table frequency table
cor calculates correlations
plot generic plot function

Here is the link to the syntax file used for this section.

2.0 Overview of the data set

Read in the hs0 data over the internet using the read.table() function.

hs0 <- read.table("http://www.ats.ucla.edu/stat/R/notes/hs0.csv", header=T, sep=",")

Listing the first 20 observations. We are using the bracket notation for the indexing of the data frame where the first position indicates the rows (observations) and we are specifying that we want to list rows (observations) 1-20. The second position indicates the columns (variables). By leaving this position blank we are indicating that we want to see the first 20 observations for all the columns (variables).

hs0[1:20, ]

Listing the names of all the variables in the data set.

names(hs0)

Creating a Printing the first 10 observations for variables read - science using the head() function.

# shorthand way of referring to read, write, math, science 
read.sci <- hs0[ , 7:10]  
# checking the type of object 
class(read.sci)
# listing the first 10 observations
head(read.sci, n=10)

3.0 Descriptive Statistics

Let's start with the dim or the length function to print out the number of rows and number of columns. We then use the summary function to display the basic information about each variable in the data.

# displaying the dimensions 
dim(read.sci)
length(read.sci)
length(read.sci$read)

summary(read.sci)

Looking at the range of variables. Notice that the way to refer a variable in a data frame is via the dollar sign "$" after the name of the data frame. The na.rm=T argument for the range function is used to specify that we want to remove missing observations from the computation. .

range(read.sci$write)
range(read.sci$science)
range(read.sci$science, na.rm=T)

# the minimum and the maximum among all the variables
range(read.sci, na.rm=T)

Listing the means of all the variables in the data frame. The na.rm=T argument for the mean function is used to specify that we want to remove missing observations from the computation of the means.

mean(read.sci)
mean(read.sci, na.rm=T)
sd(read.sci, na.rm=T)

Now let's look at the entire data set. We want to look at the means and standard deviations of each variable in the data set by program types (prgtype). First of all, let's look at the one-way table for variable prgtype using the table function. Then we will look at the means and standard deviations of each variable in the data.

table(hs0$prgtype)
by(hs0, hs0$prgtype, mean)
by(hs0, hs0$prgtype, sd)

This might be a good time to attach the data file so we can save some typing by not using the "$" sign before the variable names.

# attaching hs0, so its variables will be sesarchable by R 
attach(hs0)

Let's reproduce the means and make the output look a little nicer. We can change the number of digits to be displayed from the default of 7 to 2 by using the option function. Let's first use the getOption function to display the default setting.

getOption("digits")
options(digits=2)
by(hs0, prgtype, mean, na.rm=T)
by(hs0, prgtype, sd, na.rm=T)

Descriptive statistics for the variables write by prgtyp in a much nicer display.

m   <- tapply(write, prgtype, mean)
v   <- tapply(write, prgtype, var)
med <- tapply(write, prgtype, median)
n   <- tapply(write, prgtype, length)
sd  <- tapply(write, prgtype, sd)
cbind(mean=m, var=v, std.dev=sd, median=med, n=n)
# set the number of digits to 7
options(digits=7)

4.0 Exploring the data through graphs

R has many graphics capabilities which can be useful while exploring data as well as during data analysis. One of the great features of the graphics capabilities is how easy it is to do conditional graphs, also called trellis graphs. The first histogram of write is the old fashioned type of graph which is slightly clunky; the second histogram of write is a trellis graph without any conditioning and the last graph is a histogram of write conditional on the level of gender.
Note: In R we need to download the lattice package in order to be able to use the trellis graph functions.

hist(write)
# load trellis graphics
library(lattice)  

# trellis graphs 
histogram(~write, hs0, type="count")


# histogram of write by gender
histogram(~write | gender, hs0, type="count")  

# change the number of bins to 15
hist(write, breaks=15)

# boxplot function in the graphics package
boxplot(write)

#trellis graphs
bwplot(ses~ write, hs0)

# boxplot by gender
bwplot(ses~ write| gender, hs0)

Barplots are another way of visualizing data. The first graph shows ses by gender where the levels of ses are stacked on top of another. In the second graph we have included many more options including using the beside argument which indicates that each level of ses should be a separate bar. This makes comparing the levels of ses easier. We also specify which colors we want to use in the plot in the col argument. In order to more easily distinguish the levels of ses within each level of gender we have used the space argument to indicate that within each level of gender the bars should be separated by a space equal to 1/10 of a bar whereas the space of the bars between each level of gender is equal to 1/2 of a bar. The ylim argument is used to specify that the y-axis should range from 0 to 50. In both graphs the bars are ordered according to the levels of gender, thus in the first graph the bar on the right (in the second graph it is the which corresponds to gender=0 and the left bar to gender=1.

barplot(table(ses, gender), legend=c("low", "medium", "high"))
barplot(table(ses, gender), beside=T, legend=c("low", "medium", "high"), ylim=c(0, 50))

# changing the location of legend and adding a title, etc
barplot(table(ses, gender), beside=T, legend.text=c("low", "medium", "high"), ylim=c(0, 50), space=c(.1, 1),
col=c("lightblue", "blue", "dark blue"), names.arg=c("male", "female"),
main="Distribution of SES by gender", args.legend=list(x =9, y=45, cex=.6))

5.0 Frequency Tables

One -way frequency table of ses and two-way crosstabulation of gender and ses.

table(ses)
tab1<-table(gender, ses)

Next we compute the row and column proportions and frequencies for the two-way table.

# row proportions
prop.table(tab1,1)  
     
# column proportions
prop.table(tab1,2)  
     
# row frequencies
rowSums(tab1)  
 
# column frequencies
colSums(tab1)  

6.0 Correlations and scatter plots

Correlations of write, read, math and science with listwise deletion of missing values. The correlations will not be calculated if there are missing values so it is important to use the complete.obs argument to indicate how the missing values should be handled.

# correlation of a pair of variables
cor(write, math)
cor(write, science)
cor(write, science, use="complete.obs")

# correlation matrix
cor(read.sci, use="complete.obs")
cor(read.sci, use="pairwise.complete.obs")
plot(math, write)

# scatter plot matrix
plot(read.sci)

Unless you are going to continue working with the hs0 data frame it is generally a good idea to detach all attached data frames.

detach()

7.0 For More Information


How to cite this page

Report an error on this page or leave a comment

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California