UCLA Academic Technology Services HomeServicesClassesContactJobs

R Class Notes
Analyzing Data


1.0 R functions used in this unit

t.test t-tests, including one sample, two sample and paired
tapply applies a function to each cell of a ragged array
var calculates the variance
lm fits a linear model (regression)
anova extracts the anova table from a lm object
summary generic function provides a synopsis of an object
fitted extracts the fitted values from a lm object
resid extracts the residuals from a lm object
plot a generic function which is used here to obtain default plots of a lm object
as well as to generate a scatter plot between two continuous variables.
abline generic function which adds a line to an existing plot
glm logistic regression
drop1 compares model by dropping terms one at a time
wilcox.test non-parametric analyses
kruskal.test non-parametric analyses

Read in the hs1 data via the Internet using read.table function.
Note: If you have done the previous sections then the hs1 data may already be available to you.

hs1 <- read.table("http://www.ats.ucla.edu/stat/R/notes/hs1.csv", header=T, sep=",")

attach(hs1)

2.0 t-tests

This is the one-sample t-test, testing whether the sample of writing scores was drawn from a population with a mean of 50.

t.test(write, mu=50)
	One Sample t-test

data:  write 
t = 4.1403, df = 199, p-value = 5.121e-05
alternative hypothesis: true mean is not equal to 50 
95 percent confidence interval:
 51.45332 54.09668 
sample estimates:
mean of x 
   52.775

This is the paired t-test, testing whether or not the mean of write equals the mean of read.

t.test(write, read, paired=TRUE)

	Paired t-test

data:  write and read 
t = 0.8673, df = 199, p-value = 0.3868
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -0.6941424  1.7841424 
sample estimates:
mean of the differences 
                  0.545

This is the two-sample independent t-test. We use the tapply function to look at the variances of the variable write for each group of female.  The output from the first t.test function assumes equal variances which is the default in the t.test function; the output from the second t.test function assumes unequal variances.

tapply(write, female, var)
        0         1 
106.19634  66.15732 

t.test(write~female, var.equal=TRUE)  # assuming equal variances

	Two Sample t-test

data:  write by female 
t = 3.7341, df = 198, p-value = 0.0002463
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 2.298059 7.441835 
sample estimates:
mean in group female   mean in group male 
            54.99083             50.12088 
   
t.test(write~female, var.equal=FALSE)  # assuming unequal variances

	Welch Two Sample t-test

data:  write by female 
t = 3.6564, df = 169.707, p-value = 0.0003409
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 2.240734 7.499159 
sample estimates:
mean in group female   mean in group male 
            54.99083             50.12088

3.0 Anova

In R you can use either the aov function or the anova function combined with the lm function. Both alternatives will give you the same results. The anova function extracts the anova table from the linear model fitted by the lm function. The aov function only fits an anova model and we use the summary function to see all the output.

anova(lm(write~factor(prog)))
Analysis of Variance Table

Response: write
                 Df  Sum Sq Mean Sq F value   Pr(>F)    
as.factor(prog)   2  3175.7  1587.8  21.275 4.31e-09 ***
Residuals       197 14703.2    74.6                     
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

summary(aov(write~factor(prog)))
                 Df  Sum Sq Mean Sq F value   Pr(>F)    
as.factor(prog)   2  3175.7  1587.8  21.275 4.31e-09 ***
Residuals       197 14703.2    74.6                     
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

The following is an example of a two-way factorial ANOVA.

anova(lm(write~factor(prog)*female))
Analysis of Variance Table

Response: write
                        Df  Sum Sq Mean Sq F value    Pr(>F)    
as.factor(prog)          2  3175.7  1587.8 23.2511 8.876e-10 ***
female                   1  1128.7  1128.7 16.5278 6.963e-05 ***
as.factor(prog):female   2   326.0   163.0  2.3865   0.09464 .  
Residuals              194 13248.5    68.3                      
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

summary(aov(write~factor(prog)*female))
                        Df  Sum Sq Mean Sq F value    Pr(>F)    
as.factor(prog)          2  3175.7  1587.8 23.2511 8.876e-10 ***
female                   1  1128.7  1128.7 16.5278 6.963e-05 ***
as.factor(prog):female   2   326.0   163.0  2.3865   0.09464 .  
Residuals              194 13248.5    68.3                      
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

Here is an analysis of covariance (ANCOVA). In this example, prog is the categorical predictor and read is the continuous covariate.

anova(lm(write~factor(prog) + read))
Analysis of Variance Table

Response: write
                 Df  Sum Sq Mean Sq F value    Pr(>F)    
as.factor(prog)   2  3175.7  1587.8  28.654 1.212e-11 ***
read              1  3842.0  3842.0  69.332 1.414e-14 ***
Residuals       196 10861.2    55.4                      
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

summary(aov(write~factor(prog) + read))
                 Df  Sum Sq Mean Sq F value    Pr(>F)    
as.factor(prog)   2  3175.7  1587.8  28.654 1.212e-11 ***
read              1  3842.0  3842.0  69.332 1.414e-14 ***
Residuals       196 10861.2    55.4                      
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

4.0 Regression

Plain old OLS regression.

summary(lm(write~female+read))
Call:
lm(formula = write ~ female + read)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.5227  -5.6580   0.1680   5.0435  15.1749 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 20.22837    2.71376   7.454 2.80e-12 ***
female       5.48689    1.01426   5.410 1.82e-07 ***
read         0.56589    0.04938  11.459  < 2e-16 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

Residual standard error: 7.133 on 197 degrees of freedom
Multiple R-Squared: 0.4394,	Adjusted R-squared: 0.4337 
F-statistic: 77.21 on 2 and 197 DF,  p-value: < 2.2e-16 

The generic plot function will produce multiple diagnostic plots when applied to an lm object. These plots include residual versus fitted plots, qqplots of the residuals as well as scatter plots with the regression line overlaid. If you are only interested in one or a few of the plots it might be useful to use the which.plot option in the plot function.

lm2 <- lm(write~read+socst)
summary(lm2)
Call:
lm(formula = write ~ read + socst)

Residuals:
      Min        1Q    Median        3Q       Max 
-17.89830  -3.98965  -0.03932   4.76637  17.16929 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17.76284    2.83366   6.269 2.26e-09 ***
read         0.33274    0.06262   5.314 2.89e-07 ***
socst        0.33648    0.05980   5.627 6.25e-08 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

Residual standard error: 7.095 on 197 degrees of freedom
Multiple R-Squared: 0.4453,	Adjusted R-squared: 0.4397 
F-statistic: 79.07 on 2 and 197 DF,  p-value: < 2.2e-16 

plot(lm2)  # plotting diagnostic plots of lm2







The fitted function will extract the fitted values from the lm object and the resid function will extract the residuals.

write[1:20]
 [1] 52 59 33 44 52 52 59 46 57 55 46 65 60 63 57 49 52 57 65 39

fitted(lm2)[1:20]
       1        2        3        4        5        6        7        8 
55.90829 60.91436 42.83426 57.56827 53.92677 52.92854 54.92500 41.18922 
       9       10       11       12       13       14       15       16 
55.88589 53.88943 58.25242 57.25419 65.94283 51.20882 51.57890 50.58068 
      17       18       19       20 
52.24439 55.57181 60.91436 51.54157 

resid(lm2)[1:20]
          1           2           3           4           5           6 
 -3.9082885  -1.9143586  -9.8342607 -13.5682666  -1.9267691  -0.9285420 
          7           8           9          10          11          12 
  4.0750038   4.8107827   1.1141137   1.1105678 -12.2524197   7.7458074 
         13          14          15          16          17          18 
 -5.9428308  11.7911751   5.4210958  -1.5806771  -0.2443889   1.4281876 
         19          20 
  4.0856414 -12.5415673 

5.0 Logistic regression

In order to demonstrate we will create a dichotomous variable called honcomp (honors composition).   Honcomp will be equal to 1 when the logical test of write >= 60 is true and honcomp will be equal to zero when it is not true.  This variable is created purely for illustrative purposes only!

honcomp <- write >= 60
honcomp[1:20]
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[13]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE

The glm function fits a general linear model including a logistic regression. In order to fit a logistic model we need to specify that the distribution of the dependent variable is binomial in the family argument and the default link function used will then be the logit function.

lr <- glm(honcomp~female+read, family=binomial)
summary(lr)
Call:
glm(formula = honcomp ~ female + read, family = binomial)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7862  -0.6716  -0.3991   0.5084   2.3984  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -9.60336    1.42641  -6.733 1.67e-11 ***
female       1.12093    0.40810   2.747  0.00602 ** 
read         0.14437    0.02333   6.187 6.13e-10 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 231.29  on 199  degrees of freedom
Residual deviance: 170.89  on 197  degrees of freedom
AIC: 176.89

Number of Fisher Scoring iterations: 5

exp(coef(lr))  # odds ratios
 (Intercept)       female         read 
6.750122e-05 3.067693e+00 1.155307e+00 

In glm objects it is sometimes very useful to be able to compare all the hierarchical models which can be done using the drop1 function. It compares all models by dropping one predictor at a time from the model. When the model includes a categorical variable it will drop all levels of the categorical variables at the same time.

Please note that the drop1 function can also include a test argument. By specifying Chisq in the test argument R will perform a likelihood ratio test for each comparisons among the hierarchical models.

lr2 <- glm(honcomp~female+read+science+math+factor(prog), binomial, na.action=na.exclude)
drop1(lr2, test="Chisq")
Single term deletions

Model:
honcomp ~ female + read + science + math + as.factor(prog)
                Df Deviance     AIC     LRT   Pr(Chi)    
              141.352 155.352                      
female           1  149.208 161.208   7.856 0.0050651 ** 
read             1  146.170 158.170   4.818 0.0281635 *  
science          1  144.600 156.600   3.249 0.0714824 .  
math             1  153.387 165.387  12.035 0.0005221 ***
as.factor(prog)  2  141.660 151.660   0.308 0.8573245    
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

6.0 Non-Parametric Tests

The signtest is the nonparametric analog to the single-sample t-test and is obtained by using the wilcox.test function. The value that is being tested is specified by the mu argument.

wilcox.test(write, mu=50)
        Wilcoxon signed rank test with continuity correction

data:  write 
V = 13177, p-value = 3.702e-05
alternative hypothesis: true location is not equal to 50 

The signrank test is the nonparametric analog to the paired t-test. This test can be obtained by also using the wilcox.test function and specifying T in the paired argument.

wilcox.test(write, read, paired=T)
  Wilcoxon signed rank test with continuity correction

data:  write and read 
V = 9261, p-value = 0.3666
alternative hypothesis: true location shift is not equal to 0 

The ranksum test is the nonparametric analog to the independent two-sample t-test.

wilcox.test(write, female)
	Wilcoxon rank sum test with continuity correction

data:  write and female 
W = 40000, p-value < 2.2e-16
alternative hypothesis: true mu is not equal to 0

The kruskal wallis test is the nonparametric analog to the one-way anova.

kruskal.test(write, ses)
	Kruskal-Wallis rank sum test

data:  write and ses 
Kruskal-Wallis chi-squared = 11.1966, df = 2, p-value = 0.003704

Unless you are going to continue working with the hs1 data it is generally a good idea to detach all attached data frames.

detach()

6.0 For More Information


How to cite this page

Report an error on this page

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.