|
|
|
||||
|
|
|||||
| t.test | t-tests, including one sample, two sample and paired |
| tapply | applies a function to each cell of a ragged array |
| var | calculates the variance |
| lm | fits a linear model (regression) |
| anova | extracts the anova table from a lm object |
| summary | generic function provides a synopsis of an object |
| fitted | extracts the fitted values from a lm object |
| resid | extracts the residuals from a lm object |
| plot | a generic function which is used here to obtain default plots of a lm object as well as to generate a scatter plot between two continuous variables. |
| abline | generic function which adds a line to an existing plot |
| glm | logistic regression |
| drop1 | compares model by dropping terms one at a time |
| wilcox.test | non-parametric analyses |
| kruskal.test | non-parametric analyses |
Read in the hs1 data via the Internet using read.table function.
Note: If you have done the previous sections then the hs1 data may already be available to you.
hs1 <- read.table("http://www.ats.ucla.edu/stat/R/notes/hs1.csv", header=T, sep=",")
attach(hs1)
This is the one-sample t-test, testing whether the sample of writing scores was drawn from a population with a mean of 50.
t.test(write, mu=50) One Sample t-test data: write t = 4.1403, df = 199, p-value = 5.121e-05 alternative hypothesis: true mean is not equal to 50 95 percent confidence interval: 51.45332 54.09668 sample estimates: mean of x 52.775
This is the paired t-test, testing whether or not the mean of write equals the mean of read.
t.test(write, read, paired=TRUE)
Paired t-test
data: write and read
t = 0.8673, df = 199, p-value = 0.3868
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.6941424 1.7841424
sample estimates:
mean of the differences
0.545
This is the two-sample independent t-test. We use the tapply function to look at the variances of the variable write for each group of female. The output from the first t.test function assumes equal variances which is the default in the t.test function; the output from the second t.test function assumes unequal variances.
tapply(write, female, var)
0 1
106.19634 66.15732
t.test(write~female, var.equal=TRUE) # assuming equal variances
Two Sample t-test
data: write by female
t = 3.7341, df = 198, p-value = 0.0002463
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.298059 7.441835
sample estimates:
mean in group female mean in group male
54.99083 50.12088
t.test(write~female, var.equal=FALSE) # assuming unequal variances
Welch Two Sample t-test
data: write by female
t = 3.6564, df = 169.707, p-value = 0.0003409
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.240734 7.499159
sample estimates:
mean in group female mean in group male
54.99083 50.12088
In R you can use either the aov function or the anova function combined with the lm function. Both alternatives will give you the same results. The anova function extracts the anova table from the linear model fitted by the lm function. The aov function only fits an anova model and we use the summary function to see all the output.
anova(lm(write~factor(prog)))
Analysis of Variance Table
Response: write
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(prog) 2 3175.7 1587.8 21.275 4.31e-09 ***
Residuals 197 14703.2 74.6
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
summary(aov(write~factor(prog)))
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(prog) 2 3175.7 1587.8 21.275 4.31e-09 ***
Residuals 197 14703.2 74.6
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
The following is an example of a two-way factorial ANOVA.
anova(lm(write~factor(prog)*female))
Analysis of Variance Table
Response: write
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(prog) 2 3175.7 1587.8 23.2511 8.876e-10 ***
female 1 1128.7 1128.7 16.5278 6.963e-05 ***
as.factor(prog):female 2 326.0 163.0 2.3865 0.09464 .
Residuals 194 13248.5 68.3
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
summary(aov(write~factor(prog)*female))
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(prog) 2 3175.7 1587.8 23.2511 8.876e-10 ***
female 1 1128.7 1128.7 16.5278 6.963e-05 ***
as.factor(prog):female 2 326.0 163.0 2.3865 0.09464 .
Residuals 194 13248.5 68.3
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Here is an analysis of covariance (ANCOVA). In this example, prog is the categorical predictor and read is the continuous covariate.
anova(lm(write~factor(prog) + read))
Analysis of Variance Table
Response: write
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(prog) 2 3175.7 1587.8 28.654 1.212e-11 ***
read 1 3842.0 3842.0 69.332 1.414e-14 ***
Residuals 196 10861.2 55.4
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
summary(aov(write~factor(prog) + read))
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(prog) 2 3175.7 1587.8 28.654 1.212e-11 ***
read 1 3842.0 3842.0 69.332 1.414e-14 ***
Residuals 196 10861.2 55.4
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Plain old OLS regression.
summary(lm(write~female+read))
Call:
lm(formula = write ~ female + read)
Residuals:
Min 1Q Median 3Q Max
-17.5227 -5.6580 0.1680 5.0435 15.1749
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.22837 2.71376 7.454 2.80e-12 ***
female 5.48689 1.01426 5.410 1.82e-07 ***
read 0.56589 0.04938 11.459 < 2e-16 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 7.133 on 197 degrees of freedom
Multiple R-Squared: 0.4394, Adjusted R-squared: 0.4337
F-statistic: 77.21 on 2 and 197 DF, p-value: < 2.2e-16
The generic plot function will produce multiple diagnostic plots when applied to an lm object. These plots include residual versus fitted plots, qqplots of the residuals as well as scatter plots with the regression line overlaid. If you are only interested in one or a few of the plots it might be useful to use the which.plot option in the plot function.
lm2 <- lm(write~read+socst)
summary(lm2)
Call:
lm(formula = write ~ read + socst)
Residuals:
Min 1Q Median 3Q Max
-17.89830 -3.98965 -0.03932 4.76637 17.16929
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.76284 2.83366 6.269 2.26e-09 ***
read 0.33274 0.06262 5.314 2.89e-07 ***
socst 0.33648 0.05980 5.627 6.25e-08 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 7.095 on 197 degrees of freedom
Multiple R-Squared: 0.4453, Adjusted R-squared: 0.4397
F-statistic: 79.07 on 2 and 197 DF, p-value: < 2.2e-16
plot(lm2) # plotting diagnostic plots of lm2

The fitted function will extract the fitted values from the lm object and the resid function will extract the residuals.
write[1:20]
[1] 52 59 33 44 52 52 59 46 57 55 46 65 60 63 57 49 52 57 65 39
fitted(lm2)[1:20]
1 2 3 4 5 6 7 8
55.90829 60.91436 42.83426 57.56827 53.92677 52.92854 54.92500 41.18922
9 10 11 12 13 14 15 16
55.88589 53.88943 58.25242 57.25419 65.94283 51.20882 51.57890 50.58068
17 18 19 20
52.24439 55.57181 60.91436 51.54157
resid(lm2)[1:20]
1 2 3 4 5 6
-3.9082885 -1.9143586 -9.8342607 -13.5682666 -1.9267691 -0.9285420
7 8 9 10 11 12
4.0750038 4.8107827 1.1141137 1.1105678 -12.2524197 7.7458074
13 14 15 16 17 18
-5.9428308 11.7911751 5.4210958 -1.5806771 -0.2443889 1.4281876
19 20
4.0856414 -12.5415673
In order to demonstrate we will create a dichotomous variable called honcomp (honors composition). Honcomp will be equal to 1 when the logical test of write >= 60 is true and honcomp will be equal to zero when it is not true. This variable is created purely for illustrative purposes only!
honcomp <- write >= 60 honcomp[1:20] [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE [13] TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
The glm function fits a general linear model including a logistic regression. In order to fit a logistic model we need to specify that the distribution of the dependent variable is binomial in the family argument and the default link function used will then be the logit function.
lr <- glm(honcomp~female+read, family=binomial)
summary(lr)
Call:
glm(formula = honcomp ~ female + read, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7862 -0.6716 -0.3991 0.5084 2.3984
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.60336 1.42641 -6.733 1.67e-11 ***
female 1.12093 0.40810 2.747 0.00602 **
read 0.14437 0.02333 6.187 6.13e-10 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 231.29 on 199 degrees of freedom
Residual deviance: 170.89 on 197 degrees of freedom
AIC: 176.89
Number of Fisher Scoring iterations: 5
exp(coef(lr)) # odds ratios
(Intercept) female read
6.750122e-05 3.067693e+00 1.155307e+00
In glm objects it is sometimes very useful to be able to compare all the hierarchical models which can be done using the drop1 function. It compares all models by dropping one predictor at a time from the model. When the model includes a categorical variable it will drop all levels of the categorical variables at the same time.
Please note that the drop1 function can also include a test argument. By specifying Chisq in the test argument R will perform a likelihood ratio test for each comparisons among the hierarchical models.
lr2 <- glm(honcomp~female+read+science+math+factor(prog), binomial, na.action=na.exclude)
drop1(lr2, test="Chisq")
Single term deletions
Model:
honcomp ~ female + read + science + math + as.factor(prog)
Df Deviance AIC LRT Pr(Chi)
141.352 155.352
female 1 149.208 161.208 7.856 0.0050651 **
read 1 146.170 158.170 4.818 0.0281635 *
science 1 144.600 156.600 3.249 0.0714824 .
math 1 153.387 165.387 12.035 0.0005221 ***
as.factor(prog) 2 141.660 151.660 0.308 0.8573245
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
The signtest is the nonparametric analog to the single-sample t-test and is obtained by using the wilcox.test function. The value that is being tested is specified by the mu argument.
wilcox.test(write, mu=50)
Wilcoxon signed rank test with continuity correction data: write V = 13177, p-value = 3.702e-05 alternative hypothesis: true location is not equal to 50
The signrank test is the nonparametric analog to the paired t-test. This test can be obtained by also using the wilcox.test function and specifying T in the paired argument.
wilcox.test(write, read, paired=T)
Wilcoxon signed rank test with continuity correction data: write and read V = 9261, p-value = 0.3666 alternative hypothesis: true location shift is not equal to 0
The ranksum test is the nonparametric analog to the independent two-sample t-test.
wilcox.test(write, female) Wilcoxon rank sum test with continuity correction data: write and female W = 40000, p-value < 2.2e-16 alternative hypothesis: true mu is not equal to 0
The kruskal wallis test is the nonparametric analog to the one-way anova.
kruskal.test(write, ses) Kruskal-Wallis rank sum test data: write and ses Kruskal-Wallis chi-squared = 11.1966, df = 2, p-value = 0.003704
Unless you are going to continue working with the hs1 data it is generally a good idea to detach all attached data frames.
detach()
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services