UCLA Academic Technology Services HomeServicesClassesContactJobs

R Class Notes
Exploring Data


1.0 R functions used in this unit

head display first n observations
sapply applies a function to elements in a list
colMeans column means
colSums column sums
rowSums row sums
median calculates the median
length calculates the count
var calculates the variance
sd calculates the standard deviation
tapply applies a function to each cell of a ragged array
cbind combining columns
summary generic function provides a synopsis of an object
hist histogram plot
histogram trellis histogram plot(s)
boxplot box plot
bwplot trellis box plot(s)
stem stem-and-leaf plot
barplot bar plot
table frequency table
cor calculates correlations
lm fits a linear model
plot generic plot function
abline adds a line to an existing plot

2.0 Overview of the data set

Read in the hs0 data using the read.table() function and attach the data frame.

hs0 <- read.table("http://www.ats.ucla.edu/stat/R/notes/hs0.csv", header=T, sep=",")

attach(hs0)

Listing the first 20 observations.
We are using the bracket notation for the indexing of the data frame where the first position indicates the rows (observations) and we are specifying that we want to list rows (observations) 1-20. The second position indicates the columns (variables). By leaving this position blank we are indicating that we want to see the first 20 observations for all the columns (variables).

hs0[1:20, ]
   gender  id race ses schtyp  prgtype read write math science socst
1       0  70    4   1      1  general   57    52   41      47    57
2       1 121    4   2      1   vocati   68    59   53      63    61
3       0  86    4   3      1  general   44    33   54      58    31
4       0 141    4   3      1   vocati   63    44   47      53    56
5       0 172    4   2      1 academic   47    52   57      53    61
6       0 113    4   2      1 academic   44    52   51      63    61
7       0  50    3   2      1  general   50    59   42      53    61
8       0  11    1   2      1 academic   34    46   45      39    36
9       0  84    4   2      1  general   63    57   54      NA    51
10      0  48    3   2      1 academic   57    55   52      50    51
11      0  75    4   2      1   vocati   60    46   51      53    61
12      0  60    5   2      1 academic   57    65   51      63    61
13      0  95    4   3      1 academic   73    60   71      61    71
14      0 104    4   3      1 academic   54    63   57      55    46
15      0  38    3   1      1 academic   45    57   50      31    56
16      0 115    4   1      1  general   42    49   43      50    56
17      0  76    4   3      1 academic   47    52   51      50    56
18      0 195    4   2      2  general   57    57   60      NA    56
19      0 114    4   3      1 academic   68    65   62      55    61
20      0  85    4   2      1  general   55    39   57      53    46

Printing the first 10 observations for variables read - science using the head() function.

names(hs0)
 [1] "gender"  "id"      "race"    "ses"     "schtyp"  "prgtype" "read"   
 [8] "write"   "math"    "science" "socst" 

vars <- hs0[ , 7:10]  # shorthand way of referring to read, write, math, science 

head(vars, n=10)

      read write math science
 [1,]   57    52   41      47
 [2,]   68    59   53      63
 [3,]   44    33   54      58
 [4,]   63    44   47      53
 [5,]   47    52   57      53
 [6,]   44    52   51      63
 [7,]   50    59   42      53
 [8,]   34    46   45      39
 [9,]   63    57   54      NA
[10,]   57    55   52      50

3.0 Descriptive Statistics

Listing the means of all the variables in the data frame. The na.rm=T argument for the mean function is used to specify that we want to remove missing observations from the computation of the means. This function will generate a warning because we try to compute the mean of prgtype which is a character variable.

sapply(hs0, mean, na.rm=T)
   gender        id      race       ses    schtyp   prgtype      read 
  0.54500 100.50000   3.44000   2.05500   1.16000        NA  52.23000 
    write      math   science     socst 
 52.77500  52.64500  51.66154  52.40500 
Warning message: 
argument is not numeric or logical: returning NA in: mean.default(X[[6]], ...) 

Obtaining other descriptive statistics such as count, medians, variances and standard deviations for the variables read, write, math and science, which are found in columns 7 through 10. Again we use the na.rm=T argument to indicate that we want to remove missing observations since science has a few missing observations.

sapply(vars, length)  # count
   read   write    math science 
    200     200     200     200 
    
# the count for science is wrong, we will need to create a new variable with only
# the nonmissing cases of science and then use the length function
science.good <- na.omit(science)
length(science.good)
[1] 195

sapply(vars, median, na.rm=T)  # median
   read   write    math science 
     50      54      52      53 
     
sapply(vars, var, na.rm=T)  # variance
     read     write      math   science 
105.12271  89.84359  87.76781  97.33846 

sapply(vars, sd, na.rm=T)  # standard deviation
     read     write      math   science 
10.252937  9.478586  9.368448  9.866026 

sapply(vars, min, na.rm=T)
   read   write    math science 
     28      31      33      26 
     
sapply(vars, max, na.rm=T)
   read   write    math science 
     76      67      75      74 
     
sapply(vars, fivenum, na.rm=T)  # five number summary
     read write math science
[1,]   28  31.0   33      26
[2,]   44  45.5   45      44
[3,]   50  54.0   52      53
[4,]   60  60.0   59      58
[5,]   76  67.0   75      74

We can also use the colMeans function to obtain the mean. We can specify the variables by their numbers as in the sapply or as variable names using cbind.

colMeans(vars, na.rm=T)
    read     math  science    write 
52.23000 52.64500 51.66154 52.77500

Descriptive statistics can also be computed for a subset of the data frame. In this example, we are looking at the summary statistics for only those students who had a reading score of 60 or higher.

sapply(vars[read >= 60, ], mean, na.rm=T)
    read    write     math  science 
65.48214 59.53571 60.25000 59.43396 

sapply(vars[read >= 60, ], median, na.rm=T)
   read   write    math science 
   65.0    60.0    60.5    61.0 

Obtaining the means of the variables write and science broken down by prgtype. Science is the only variable with missing observations and thus we use the na.rm argument to remove the missing observation from the calculation of the mean.

tapply(write, prgtype, mean)
academic  general   vocati 
56.25714 51.33333 46.76000 

tapply(science, prgtype, mean, na.rm=T)
academic  general   vocati 
53.61765 52.18605 47.22000

Other descriptive statistics including variances, standard deviations, medians and counts for the variable write broken down by prgtyp.

tapply(write, prgtype, length)  # count
academic  general   vocati 
     105       45       50
     
tapply(write, prgtype, var)  # variance
academic  general   vocati 
63.09670 88.31818 86.83918

tapply(write, prgtype, sd)  # standard deviation
academic  general   vocati 
7.943343 9.397775 9.318754 

tapply(write, prgtype, median) # median
academic  general   vocati 
      59       54       46

Descriptive statistics for the variables write by prgtyp in a much nicer display.

m   <- tapply(write, prgtype, mean)
v   <- tapply(write, prgtype, var)
med <- tapply(write, prgtype, median)
n   <- tapply(write, prgtype, length)
sd  <- tapply(write, prgtype, sd)
cbind(mean=m, var=v, std.dev=sd, median=med, n=n)
             mean      var  std.dev median   n
academic 56.25714 63.09670 7.943343     59 105
general  51.33333 88.31818 9.397775     54  45
vocati   46.76000 86.83918 9.318754     46  50

More descriptive statistics including quantiles can be obtained by using the summary function.

summary(science)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  26.00   44.00   53.00   51.66   58.00   74.00    5.00 

4.0 Exploring the data through graphs

R has many graphics capabilities which can be useful while exploring data as well as during data analysis. One of the great features of the graphics capabilities is how easy it is to do conditional graphs, also called trellis graphs. The first histogram of write is the old fashioned type of graph which is slightly clunky; the second histogram of write is a trellis graph without any conditioning and the last graph is a histogram of write conditional on the level of gender.
Note: In R we need to download the lattice package in order to be able to use the trellis graph functions.

library(lattice)  # load trellis graphics

hist(write)


# trellis graphs 
histogram(~write, hs0, type="count")



histogram(~write | gender, hs0, type="count")  # histogram of write by gender

Note: In R it is possible to change the number of bins by using the breaks argument in the hist function.

hist(write, breaks=15)

The first graph is a boxplot of the variable write; the second is a trellis graph of write by ses; the third is a trellis graph of boxplots of write by ses for each level of gender.

boxplot(write)



#trellis graphs
bwplot(ses~ write, hs0)



bwplot(ses~ write| gender, hs0)

Stem-and-leaf plot of write.

stem(write, scale=1)

  The decimal point is at the |

  30 | 0000
  32 | 0000
  34 | 00
  36 | 00000
  38 | 000000
  40 | 0000000000000
  42 | 000
  44 | 0000000000000
  46 | 00000000000
  48 | 00000000000
  50 | 00
  52 | 0000000000000000
  54 | 00000000000000000000
  56 | 000000000000
  58 | 0000000000000000000000000
  60 | 00000000
  62 | 0000000000000000000000
  64 | 0000000000000000
  66 | 0000000

stem(write, scale=.5)

  The decimal point is 1 digit(s) to the right of the |

  3 | 11113333
  3 | 5566777899999
  4 | 0001111111111223444444444444
  4 | 56666666667799999999999
  5 | 00222222222222222344444444444444444
  5 | 5557777777777779999999999999999999999999
  6 | 000011112222222222222222223333
  6 | 55555555555555557777777

Barplots are another way of visualizing data. The first graph shows ses by gender where the levels of ses are stacked on top of another. In the second graph we have included many more options including using the beside argument which indicates that each level of ses should be a separate bar. This makes comparing the levels of ses easier. We also specify which colors we want to use in the plot in the col argument. In order to more easily distinguish the levels of ses within each level of gender we have used the space argument to indicate that within each level of gender the bars should be separated by a space equal to 1/10 of a bar whereas the space of the bars between each level of gender is equal to 1/2 of a bar. The ylim argument is used to specify that the y-axis should range from 0 to 50. In both graphs the bars are ordered according to the levels of gender, thus in the first graph the bar on the right (in the second graph it is the which corresponds to gender=0 and the left bar to gender=1.

barplot(table(ses, gender), legend=c("low", "medium", "high"))



barplot(table(ses, gender), beside=T, legend=c("low", "medium", "high"), ylim=c(0, 50))

5.0 Frequency Tables

Frequency table of ses.

table(ses)
ses
 1  2  3 
47 95 58 

The frequency table of write illustrates that it is generally undesirable to obtain frequencies of continuous variables.

table(write)
write
31 33 35 36 37 38 39 40 41 42 43 44 45 46 47 49 50 52 53 54 55 57 59 60 61 62 
 4  4  2  2  3  1  5  3 10  2  1 12  1  9  2 11  2 15  1 17  3 12 25  4  4 18 
63 65 67 
 4 16  7 

Frequency tables of multiple variables including gender, schtyp and prgtype, which are variables 1, 5, and 6.

table.vars <- hs0[ , c(1, 5, 6)]  # shorthand way of referring to gender, schtyp and prgtype

sapply(table.vars, table)
$gender

  0   1 
 91 109 

$schtyp

  1   2 
168  32 

$prgtype

academic  general   vocati 
     105       45       50 

Crosstabulation of gender and ses.

tab1 <- table(gender, ses)
tab1
      ses
gender 1  2  3 
     0 15 47 29
     1 32 48 29

Next we compute the row and column proportions and frequencies as well as a chisquare test of independence for the two-way table.

prop.table(tab1,1)  # row proportions
      ses
gender 1         2         3        
     0 0.1648352 0.5164835 0.3186813
     1 0.2935780 0.4403670 0.2660550
     
prop.table(tab1,2)  # column proportions
      ses
gender 1         2         3        
     0 0.3191489 0.4947368 0.5000000
     1 0.6808511 0.5052632 0.5000000
     
rowSums(tab1)  # row frequencies
  0   1 
 91 109 
 
colSums(tab1)  # column frequencies
 1  2  3 
47 95 58 

summary(tab1)  # chi-square test of independence
Number of cases in table: 200 
Number of factors: 2 
Test for independence of all factors:
	Chisq = 4.577, df = 2, p-value = 0.1014

Correlations of write, read, math and science with listwise deletion of missing values. The correlations will not be calculated if there are missing values so it is important to use the complete.obs argument to indicate how the missing values should be handled.

cor(vars, use="complete.obs")
             read     write      math   science
read    1.0000000 0.5959677 0.6492202 0.6170562
write   0.5959677 1.0000000 0.6203022 0.5671298
math    0.6492202 0.6203022 1.0000000 0.6166288
science 0.6170562 0.5671298 0.6166288 1.0000000

We will end this section by showing how to obtain a simple regression model predicting write from read as well as obtaining a scatter plot with a regression line.

reg1 <- lm(write~read)
summary(reg1)

Call:
lm(formula = write ~ read)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.5447  -5.1225   0.6451   6.3259  15.4553 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 23.95944    2.80574   8.539 3.55e-15 ***
read         0.55171    0.05272  10.465  < 2e-16 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

Residual standard error: 7.625 on 198 degrees of freedom
Multiple R-Squared: 0.3561,	Adjusted R-squared: 0.3529 
F-statistic: 109.5 on 1 and 198 DF,  p-value: < 2.2e-16 

plot(read, write)  # Note: x-variable listed first in plot function

abline(reg1)

Unless you are going to continue working with the hs0 data frame it is generally a good idea to detach all attached data frames.

detach()

6.0 For More Information


How to cite this page

Report an error on this page

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.