|
|
|
||||
|
|
|||||
| head | display first n observations |
| sapply | applies a function to elements in a list |
| colMeans | column means |
| colSums | column sums |
| rowSums | row sums |
| median | calculates the median |
| length | calculates the count |
| var | calculates the variance |
| sd | calculates the standard deviation |
| tapply | applies a function to each cell of a ragged array |
| cbind | combining columns |
| summary | generic function provides a synopsis of an object |
| hist | histogram plot |
| histogram | trellis histogram plot(s) |
| boxplot | box plot |
| bwplot | trellis box plot(s) |
| stem | stem-and-leaf plot |
| barplot | bar plot |
| table | frequency table |
| cor | calculates correlations |
| lm | fits a linear model |
| plot | generic plot function |
| abline | adds a line to an existing plot |
Read in the hs0 data using the read.table() function and attach the data frame.
hs0 <- read.table("http://www.ats.ucla.edu/stat/R/notes/hs0.csv", header=T, sep=",")
attach(hs0)
Listing the first 20 observations.
We are using the bracket notation for the indexing of the data frame where the first position indicates the rows (observations) and we are specifying that we want to list rows (observations) 1-20. The second position indicates the columns (variables). By leaving this position blank we are indicating that we want to see the first 20 observations for all the columns (variables).
hs0[1:20, ] gender id race ses schtyp prgtype read write math science socst 1 0 70 4 1 1 general 57 52 41 47 57 2 1 121 4 2 1 vocati 68 59 53 63 61 3 0 86 4 3 1 general 44 33 54 58 31 4 0 141 4 3 1 vocati 63 44 47 53 56 5 0 172 4 2 1 academic 47 52 57 53 61 6 0 113 4 2 1 academic 44 52 51 63 61 7 0 50 3 2 1 general 50 59 42 53 61 8 0 11 1 2 1 academic 34 46 45 39 36 9 0 84 4 2 1 general 63 57 54 NA 51 10 0 48 3 2 1 academic 57 55 52 50 51 11 0 75 4 2 1 vocati 60 46 51 53 61 12 0 60 5 2 1 academic 57 65 51 63 61 13 0 95 4 3 1 academic 73 60 71 61 71 14 0 104 4 3 1 academic 54 63 57 55 46 15 0 38 3 1 1 academic 45 57 50 31 56 16 0 115 4 1 1 general 42 49 43 50 56 17 0 76 4 3 1 academic 47 52 51 50 56 18 0 195 4 2 2 general 57 57 60 NA 56 19 0 114 4 3 1 academic 68 65 62 55 61 20 0 85 4 2 1 general 55 39 57 53 46
Printing the first 10 observations for variables read - science using the head() function.
names(hs0)
[1] "gender" "id" "race" "ses" "schtyp" "prgtype" "read"
[8] "write" "math" "science" "socst"
vars <- hs0[ , 7:10] # shorthand way of referring to read, write, math, science
head(vars, n=10)
read write math science
[1,] 57 52 41 47
[2,] 68 59 53 63
[3,] 44 33 54 58
[4,] 63 44 47 53
[5,] 47 52 57 53
[6,] 44 52 51 63
[7,] 50 59 42 53
[8,] 34 46 45 39
[9,] 63 57 54 NA
[10,] 57 55 52 50
Listing the means of all the variables in the data frame. The na.rm=T argument for the mean function is used to specify that we want to remove missing observations from the computation of the means. This function will generate a warning because we try to compute the mean of prgtype which is a character variable.
sapply(hs0, mean, na.rm=T)
gender id race ses schtyp prgtype read
0.54500 100.50000 3.44000 2.05500 1.16000 NA 52.23000
write math science socst
52.77500 52.64500 51.66154 52.40500
Warning message:
argument is not numeric or logical: returning NA in: mean.default(X[[6]], ...)
Obtaining other descriptive statistics such as count, medians, variances and standard deviations for the variables read, write, math and science, which are found in columns 7 through 10. Again we use the na.rm=T argument to indicate that we want to remove missing observations since science has a few missing observations.
sapply(vars, length) # count
read write math science
200 200 200 200
# the count for science is wrong, we will need to create a new variable with only
# the nonmissing cases of science and then use the length function
science.good <- na.omit(science)
length(science.good)
[1] 195
sapply(vars, median, na.rm=T) # median
read write math science
50 54 52 53
sapply(vars, var, na.rm=T) # variance
read write math science
105.12271 89.84359 87.76781 97.33846
sapply(vars, sd, na.rm=T) # standard deviation
read write math science
10.252937 9.478586 9.368448 9.866026
sapply(vars, min, na.rm=T)
read write math science
28 31 33 26
sapply(vars, max, na.rm=T)
read write math science
76 67 75 74
sapply(vars, fivenum, na.rm=T) # five number summary
read write math science
[1,] 28 31.0 33 26
[2,] 44 45.5 45 44
[3,] 50 54.0 52 53
[4,] 60 60.0 59 58
[5,] 76 67.0 75 74
We can also use the colMeans function to obtain the mean. We can specify the variables by their numbers as in the sapply or as variable names using cbind.
colMeans(vars, na.rm=T)
read math science write
52.23000 52.64500 51.66154 52.77500
Descriptive statistics can also be computed for a subset of the data frame. In this example, we are looking at the summary statistics for only those students who had a reading score of 60 or higher.
sapply(vars[read >= 60, ], mean, na.rm=T)
read write math science
65.48214 59.53571 60.25000 59.43396
sapply(vars[read >= 60, ], median, na.rm=T)
read write math science
65.0 60.0 60.5 61.0
Obtaining the means of the variables write and science broken down by prgtype. Science is the only variable with missing observations and thus we use the na.rm argument to remove the missing observation from the calculation of the mean.
tapply(write, prgtype, mean) academic general vocati 56.25714 51.33333 46.76000 tapply(science, prgtype, mean, na.rm=T) academic general vocati 53.61765 52.18605 47.22000
Other descriptive statistics including variances, standard deviations, medians and counts for the variable write broken down by prgtyp.
tapply(write, prgtype, length) # count
academic general vocati
105 45 50
tapply(write, prgtype, var) # variance
academic general vocati
63.09670 88.31818 86.83918
tapply(write, prgtype, sd) # standard deviation
academic general vocati
7.943343 9.397775 9.318754
tapply(write, prgtype, median) # median
academic general vocati
59 54 46
Descriptive statistics for the variables write by prgtyp in a much nicer display.
m <- tapply(write, prgtype, mean)
v <- tapply(write, prgtype, var)
med <- tapply(write, prgtype, median)
n <- tapply(write, prgtype, length)
sd <- tapply(write, prgtype, sd)
cbind(mean=m, var=v, std.dev=sd, median=med, n=n)
mean var std.dev median n
academic 56.25714 63.09670 7.943343 59 105
general 51.33333 88.31818 9.397775 54 45
vocati 46.76000 86.83918 9.318754 46 50
More descriptive statistics including quantiles can be obtained by using the summary function.
summary(science) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 26.00 44.00 53.00 51.66 58.00 74.00 5.00
R has many graphics capabilities which can be useful while exploring data as well as during data analysis. One of the great features of the graphics capabilities is how easy it is to do conditional graphs, also called trellis graphs. The first histogram of write is the old fashioned type of graph which is slightly clunky; the second histogram of write is a trellis graph without any conditioning and the last graph is a histogram of write conditional on the level of gender.
Note: In R we need to download the lattice package in order to be able to use the trellis graph functions.
library(lattice) # load trellis graphics hist(write)# trellis graphs histogram(~write, hs0, type="count")
histogram(~write | gender, hs0, type="count") # histogram of write by gender
Note: In R it is possible to change the number of bins by using the breaks argument in the hist function.
hist(write, breaks=15)
The first graph is a boxplot of the variable write; the second is a trellis graph of write by ses; the third is a trellis graph of boxplots of write by ses for each level of gender.
boxplot(write)#trellis graphs bwplot(ses~ write, hs0)
bwplot(ses~ write| gender, hs0)
Stem-and-leaf plot of write.
stem(write, scale=1) The decimal point is at the | 30 | 0000 32 | 0000 34 | 00 36 | 00000 38 | 000000 40 | 0000000000000 42 | 000 44 | 0000000000000 46 | 00000000000 48 | 00000000000 50 | 00 52 | 0000000000000000 54 | 00000000000000000000 56 | 000000000000 58 | 0000000000000000000000000 60 | 00000000 62 | 0000000000000000000000 64 | 0000000000000000 66 | 0000000 stem(write, scale=.5) The decimal point is 1 digit(s) to the right of the | 3 | 11113333 3 | 5566777899999 4 | 0001111111111223444444444444 4 | 56666666667799999999999 5 | 00222222222222222344444444444444444 5 | 5557777777777779999999999999999999999999 6 | 000011112222222222222222223333 6 | 55555555555555557777777
Barplots are another way of visualizing data. The first graph shows ses by gender where the levels of ses are stacked on top of another. In the second graph we have included many more options including using the beside argument which indicates that each level of ses should be a separate bar. This makes comparing the levels of ses easier. We also specify which colors we want to use in the plot in the col argument. In order to more easily distinguish the levels of ses within each level of gender we have used the space argument to indicate that within each level of gender the bars should be separated by a space equal to 1/10 of a bar whereas the space of the bars between each level of gender is equal to 1/2 of a bar. The ylim argument is used to specify that the y-axis should range from 0 to 50. In both graphs the bars are ordered according to the levels of gender, thus in the first graph the bar on the right (in the second graph it is the which corresponds to gender=0 and the left bar to gender=1.
barplot(table(ses, gender), legend=c("low", "medium", "high"))
barplot(table(ses, gender), beside=T, legend=c("low", "medium", "high"), ylim=c(0, 50))

Frequency table of ses.
table(ses) ses 1 2 3 47 95 58
The frequency table of write illustrates that it is generally undesirable to obtain frequencies of continuous variables.
table(write) write 31 33 35 36 37 38 39 40 41 42 43 44 45 46 47 49 50 52 53 54 55 57 59 60 61 62 4 4 2 2 3 1 5 3 10 2 1 12 1 9 2 11 2 15 1 17 3 12 25 4 4 18 63 65 67 4 16 7
Frequency tables of multiple variables including gender, schtyp and prgtype, which are variables 1, 5, and 6.
table.vars <- hs0[ , c(1, 5, 6)] # shorthand way of referring to gender, schtyp and prgtype
sapply(table.vars, table)
$gender
0 1
91 109
$schtyp
1 2
168 32
$prgtype
academic general vocati
105 45 50
Crosstabulation of gender and ses.
tab1 <- table(gender, ses)
tab1
ses
gender 1 2 3
0 15 47 29
1 32 48 29
Next we compute the row and column proportions and frequencies as well as a chisquare test of independence for the two-way table.
prop.table(tab1,1) # row proportions
ses
gender 1 2 3
0 0.1648352 0.5164835 0.3186813
1 0.2935780 0.4403670 0.2660550
prop.table(tab1,2) # column proportions
ses
gender 1 2 3
0 0.3191489 0.4947368 0.5000000
1 0.6808511 0.5052632 0.5000000
rowSums(tab1) # row frequencies
0 1
91 109
colSums(tab1) # column frequencies
1 2 3
47 95 58
summary(tab1) # chi-square test of independence
Number of cases in table: 200
Number of factors: 2
Test for independence of all factors:
Chisq = 4.577, df = 2, p-value = 0.1014
Correlations of write, read, math and science with listwise deletion of missing values. The correlations will not be calculated if there are missing values so it is important to use the complete.obs argument to indicate how the missing values should be handled.
cor(vars, use="complete.obs")
read write math science
read 1.0000000 0.5959677 0.6492202 0.6170562
write 0.5959677 1.0000000 0.6203022 0.5671298
math 0.6492202 0.6203022 1.0000000 0.6166288
science 0.6170562 0.5671298 0.6166288 1.0000000
We will end this section by showing how to obtain a simple regression model predicting write from read as well as obtaining a scatter plot with a regression line.
reg1 <- lm(write~read)
summary(reg1)
Call:
lm(formula = write ~ read)
Residuals:
Min 1Q Median 3Q Max
-20.5447 -5.1225 0.6451 6.3259 15.4553
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.95944 2.80574 8.539 3.55e-15 ***
read 0.55171 0.05272 10.465 < 2e-16 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 7.625 on 198 degrees of freedom
Multiple R-Squared: 0.3561, Adjusted R-squared: 0.3529
F-statistic: 109.5 on 1 and 198 DF, p-value: < 2.2e-16
plot(read, write) # Note: x-variable listed first in plot function
abline(reg1)

Unless you are going to continue working with the hs0 data frame it is generally a good idea to detach all attached data frames.
detach()
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services
The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.