Introducing R

Statistical Consulting Group

UCLA Institute for Digital Research & Education

Links to slides and code

Link to all slides:

www.ats.ucla.edu/stat/r/seminars/intro.htm

For just the code (right-click to download):

www.ats.ucla.edu/stat/r/seminars/intro.R

Installing R and Packages

Introduction

R is a programming environment

Base R and packages

Base R and most R packages are available for download from the Comprehensive R Archive Network (CRAN)

Downloading and Installing R
Interacting with R
You can work directly in R, but most users prefer a graphical interface. For starters:

More advanced users may prefer a good text editor with plugins for syntax highlighting, code completion, etc. for R such as:

Diagram of How R Works
Seminar Packages

For the purposes of this seminar, we will be using the following packages frequently:

Installing Packages

To use packages in R, we must first install them using the install.packages function, which typically downloads the package from CRAN and installs it for use

#install.packages("foreign")
#install.packages("readxl")
#install.packages("dplyr")
#install.packages("reshape2")
#install.packages("ggplot2")
#install.packages("GGally")
#install.packages("vcd")
Loading Packages

If we know we will need a particular package for our current R session, we must load it into the R environment using the library or require functions

library(foreign)
library(readxl)
library(dplyr)
library(reshape2)
require(ggplot2)
require(GGally)
require(vcd)
Additional Notes on Installing R

For Windows and Mac, there are binary files to install R. You can download these and just click to install them. The defaults are sensible and if you do not have a reason otherwise, are generally a good idea to accept.

On various Linux distributions, you can use apt-get install r-base if that distribution maintains a version of R.

For Windows, Mac, and Linux, if you have the appropriate tools installed, you can build R from source. The definitive guide to this is available http://cran.r-project.org/doc/manuals/R-admin.html

Once R is installed, assuming you have sufficient privileges, you can install and update most R packages. To install them is just:

install.packages("packagename")

To update packages, you can update them all at once using:

update.packages()

You can also specify specific CRAN mirrors to use. For example the UCLA Statistics Department maintains a mirror

install.packages("packagename", repos = "http://cran.stat.ucla.edu/")
Code to load or install then load

You can use this code to either load the package, or if that does not work, install the package and then load it.

# foreign ships with R so no need to install
require(readxl) || {install.packages("readxl"); require(readxl)}
## [1] TRUE
require(reshape2) || {install.packages("reshape2"); require(reshape2)}
## [1] TRUE
require(ggplot2) || {install.packages("ggplot2"); require(ggplot2)}
## [1] TRUE
require(GGally) || {install.packages("GGally"); require(GGally)}
## [1] TRUE
require(vcd) || {install.packages("vcd"); require(vcd)}
## [1] TRUE
Basic info on R session

To get a description of the version of R and its attached packages used in the current session, we can use the sessionInfo function

sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] lattice_0.20-33 vcd_1.4-1       GGally_1.0.1    reshape2_1.4.1 
##  [5] dplyr_0.4.3     readxl_0.1.1    foreign_0.8-66  knitr_1.12.3   
##  [9] MASS_7.3-45     ggplot2_2.1.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.3      magrittr_1.5     munsell_0.4.3    colorspace_1.2-6
##  [5] R6_2.1.2         stringr_1.0.0    highr_0.5.1      plyr_1.8.3      
##  [9] tools_3.3.1      parallel_3.3.1   gtable_0.2.0     nlme_3.1-128    
## [13] mgcv_1.8-12      DBI_0.3.1        digest_0.6.9     lazyeval_0.1.10 
## [17] lmtest_0.9-34    assertthat_0.1   Matrix_1.2-6     formatR_1.3     
## [21] evaluate_0.8.3   labeling_0.3     stringi_1.0-1    scales_0.4.0    
## [25] reshape_0.8.5    zoo_1.7-12
R programming

R code can be entered into the command line directly or saved to a script, which can be run inside a session using the source function

Commands are separated either by a ; or by a newline.

R is case sensitive.

The # character at the beginning of a line signifies a comment, which is not executed.

Help files for R functions are accessed by preceding the name of the function with ? (e.g. ?require).

??keyword searches R documentation for keyword (e.g. ??logistic)

R programming
R programming

R stores both data and output from data analysis (as well as everything else) in objects

Things are assigned to and stored in objects using the <- or = operator

A list of all objects in the current session can be obtained with ls()

# assign the number 3 to object called abc
abc <- 3
# list all objects in current session
ls()
##  [1] "abc"       "buildInfo" "d"         "dall"      "dat.csv"  
##  [6] "dat.dta"   "dat.spss"  "dat.tab"   "dat.xls"   "dboth"    
## [11] "ddropped"  "dfemale"   "dmale"     "duse"      "f"        
## [16] "m"         "m2"        "m3"        "m3b"       "m4"       
## [21] "newdat"    "tab"       "tab2"      "tab3"      "testtab2"

Entering Data

Dataset files

R works most easily with datasets stored as text files. Typically, values in text files are separated, or delimited, by tabs or spaces:


gender id race ses schtyp prgtype read write math science socst
0 70 4 1 1 general 57 52 41 47 57
1 121 4 2 1 vocati 68 59 53 63 31
0 86 4 3 1 general 44 33 54 58 31
0 141 4 3 1 vocati 63 44 47 53 56

or by commas (CSV file):


gender,id,race,ses,schtyp,prgtype,read,write,math,science,socst
0,70,4,1,1,general,57,52,41,47,57
1,121,4,2,1,vocati,68,59,53,63,61
0,86,4,3,1,general,44,33,54,58,31
0,141,4,3,1,vocati,63,44,47,53,56
Reading in Data 1

Base R functions read.table and read.csv can read in data stored as text files, delimited by almost anything (notice the sep = option)

Although we are retrieving files over the internet for this class, these functions are typically used for files saved to disk.

Note how we are assigning the loaded data to objects.

# comma separated values
dat.csv <- read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv")
# tab separated values
dat.tab <- read.table("http://www.ats.ucla.edu/stat/data/hsb2.txt",
  header=TRUE, sep = "\t")
Reading in Data 1
Reading in Data 2

We can read in datasets from other statistical analysis software using functions found in the foreign package

require(foreign)
# SPSS files
dat.spss <- read.spss("http://www.ats.ucla.edu/stat/data/hsb2.sav",
  to.data.frame=TRUE)
# Stata files
dat.dta <- read.dta("http://www.ats.ucla.edu/stat/data/hsb2.dta")
Reading in Excel Files

Datasets are often saved as Excel spreadsheets. Here we utilize the readxl package to read in the excel file. We need to download the file first.

# these two steps only needed to read excel files from the internet
download.file("http://www.ats.ucla.edu/stat/data/hsb2.xls", f, mode="wb")

dat.xls <- read_excel(f)
Viewing Data

R has ways to look at the dataset at a glance or as a whole.

# first few rows
head(dat.csv)
##    id female race ses schtyp prog read write math science socst
## 1  70      0    4   1      1    1   57    52   41      47    57
## 2 121      1    4   2      1    3   68    59   53      63    61
## 3  86      0    4   3      1    1   44    33   54      58    31
## 4 141      0    4   3      1    3   63    44   47      53    56
## 5 172      0    4   2      1    2   47    52   57      53    61
## 6 113      0    4   2      1    2   44    52   51      63    61
# last few rows
tail(dat.csv)
##      id female race ses schtyp prog read write math science socst
## 195 179      1    4   2      2    2   47    65   60      50    56
## 196  31      1    2   2      2    1   55    59   52      42    56
## 197 145      1    4   2      1    3   42    46   38      36    46
## 198 187      1    4   2      2    1   57    41   57      55    52
## 199 118      1    4   2      1    1   55    62   58      58    61
## 200 137      1    4   3      1    2   63    65   65      53    61
# variable names
colnames(dat.csv)
##  [1] "id"      "female"  "race"    "ses"     "schtyp"  "prog"    "read"   
##  [8] "write"   "math"    "science" "socst"
# pop-up view of entire data set (uncomment to run)
# View(dat.csv)
Data frames

Once read in, datasets in R are typically stored as data frames, which have a matrix structure. Observations are arranged as rows and variables, either numerical or categorical, are arranged as columns.

Individual rows, columns, and cells in a data frame can be accessed through many methods of indexing.

We most commonly use object[row,column] notation.

# single cell value
dat.csv[2,3]
## [1] 4
# omitting row value implies all rows; here all rows in column 3
dat.csv[,3]
##   [1] 4 4 4 4 4 4 3 1 4 3 4 4 4 4 3 4 4 4 4 4 4 4 3 1 1 3 4 4 4 2 4 4 4 4 4
##  [36] 4 4 4 1 4 4 4 4 3 4 4 3 4 4 1 2 4 1 4 4 1 4 1 4 1 4 4 4 4 4 4 4 4 4 1
##  [71] 4 4 4 4 4 1 4 4 4 1 4 4 4 1 4 4 4 4 4 4 2 4 4 1 4 4 4 4 1 4 4 4 3 4 4
## [106] 4 4 4 3 4 4 1 4 4 1 4 4 4 4 3 1 4 4 4 3 4 4 2 4 3 4 2 4 4 4 4 4 3 1 3
## [141] 1 4 4 1 4 4 4 4 1 3 3 4 4 1 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 1 3 2 3
## [176] 4 4 4 4 4 4 4 4 4 2 2 4 2 4 3 4 4 4 2 4 2 4 4 4 4
# omitting column values implies all columns; here all columns in row 2
dat.csv[2,]
##    id female race ses schtyp prog read write math science socst
## 2 121      1    4   2      1    3   68    59   53      63    61
# can also use ranges - rows 2 and 3, columns 2 and 3
dat.csv[2:3, 2:3]
##   female race
## 2      1    4
## 3      0    4
More variable indexing

We can also access variables directly by using their names, either with object[,"variable"] notation or object$variable notation.

# get first 10 rows of variable female using two methods
dat.csv[1:10, "female"]
##  [1] 0 1 0 0 0 0 0 0 0 0
dat.csv$female[1:10]
##  [1] 0 1 0 0 0 0 0 0 0 0
The c function for combining values into a vector

The c function is widely used to combine values of common type together to form a vector.

For example, it can be used to access non-sequential rows and columns from a data frame.

# get column 1 for rows 1, 3 and 5
dat.csv[c(1,3,5), 1]
## [1]  70  86 172
# get row 1 values for variables female, prog and socst
dat.csv[1,c("female", "prog", "socst")]
##   female prog socst
## 1      0    1    57
Variable Names

If there were no variable names, or we wanted to change the names, we could use colnames.

colnames(dat.csv) <- c("ID", "Sex", "Ethnicity", "SES", "SchoolType",
  "Program", "Reading", "Writing", "Math", "Science", "SocialStudies")

# to change one variable name, just use indexing
colnames(dat.csv)[1] <- "ID2"
Saving Data

We can save our data in a number of formats, including text, Excel .xlsx, and in other statistical software formats like Stata .dta. Note that the code below is commented, so it will not run.

The function write.dta comes from the foreign package, while write.xlsx comes from the xlsx package.

#write.csv(dat.csv, file = "path/to/save/filename.csv")

#write.table(dat.csv, file = "path/to/save/filename.txt", sep = "\t", na=".")

#write.dta(dat.csv, file = "path/to/save/filename.dta")

#write.xlsx(dat.csv, file = "path/to/save/filename.xlsx", sheetName="hsb2")

# save to binary R format (can save multiple datasets and R objects)
#save(dat.csv, dat.dta, dat.spss, dat.txt, file = "path/to/save/filename.RData")

Exploring Data

Exploring Data

Now we're going to read some data in and store it in the object, d. We prefer short names for objects that we will use frequently.

We can now easily explore and get to know these data, which contain a number of school, test, and demographic variables for 200 students.

d <- read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv")
Description of Dataset

Using dim, we get the number of observations(rows) and variables(columns) in d.

Using str, we get the structure of d, including the class of d and the data type of all column variables (usually one of "numeric", "integer", "logical", or "character").

dim(d)
## [1] 200  11
#d is of class "data.frame""
#all of its variables are of type "integer"
str(d)
## 'data.frame':	200 obs. of  11 variables:
##  $ id     : int  70 121 86 141 172 113 50 11 84 48 ...
##  $ female : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ race   : int  4 4 4 4 4 4 3 1 4 3 ...
##  $ ses    : int  1 2 3 3 2 2 2 2 2 2 ...
##  $ schtyp : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ prog   : int  1 3 1 3 2 2 1 2 1 2 ...
##  $ read   : int  57 68 44 63 47 44 50 34 63 57 ...
##  $ write  : int  52 59 33 44 52 52 59 46 57 55 ...
##  $ math   : int  41 53 54 47 57 51 42 45 54 52 ...
##  $ science: int  47 63 58 53 53 63 53 39 58 50 ...
##  $ socst  : int  57 61 31 56 61 61 61 36 51 51 ...
R classes

R objects belong to classes. Most functions only accept objects of a specific class, so it is important to know the classes of our objects. Objects can belong to more than one class, and users can define classes to control the inputs of their functions.

The class function lists all classes to which the object belongs. If class returns a basic data type (e.g. "numeric", "character", "integer"), the object has an implicit class of "vector" (array) for one-dimensional objects and "matrix" for multi-dimensional objects.

#d is of class data.frame
class(d)
## [1] "data.frame"
#d$math is of class integer vector
class(d$math)
## [1] "integer"
Generic functions

Generic functions remove the need for the user to remember the classes of objects that functions support.

Generic functions accept objects from multiple classes. They then pass the object to a specific function (called methods) designed for the object's class. The various functions for specific classes can have widely diverging purposes.

For example, when passing a data.frame to the generic plot function, plot passes the data.frame to a function called plot.data.frame, which creates a scatter plot matrix of all variables in the data.frame. To contrast, passing a regression model object to plot produces regression diagnostic plots instead.

#plot calls plot.data.frame if given a data.frame input
plot(d)
plot of chunk unnamed-chunk-19
#plot calls plot.lm if given an lm regression object input
plot(lm(write ~ read, data=d), ask=F)
plot of chunk unnamed-chunk-19
plot of chunk unnamed-chunk-19
plot of chunk unnamed-chunk-19
plot of chunk unnamed-chunk-19
Generic functions, classes and methods

If given a generic function name as an argument, the methods function will list all specific functions (methods) that the generic function searches for a class match. If given a class as an argument, methods will find all specific functions that accept that class.

#what type of objects does generic function plot accept?
methods(plot)
##  [1] plot,ANY-method        plot,color-method      plot.acf*             
##  [4] plot.ACF*              plot.augPred*          plot.compareFits*     
##  [7] plot.correspondence*   plot.data.frame*       plot.decomposed.ts*   
## [10] plot.default           plot.dendrogram*       plot.density*         
## [13] plot.ecdf              plot.factor*           plot.formula*         
## [16] plot.function          plot.gam*              plot.ggplot*          
## [19] plot.gls*              plot.goodfit*          plot.gtable*          
## [22] plot.hclust*           plot.histogram*        plot.HoltWinters*     
## [25] plot.intervals.lmList* plot.isoreg*           plot.jam*             
## [28] plot.lda*              plot.lm*               plot.lme*             
## [31] plot.lmList*           plot.loddsratio*       plot.loglm*           
## [34] plot.mca*              plot.medpolish*        plot.mlm*             
## [37] plot.nffGroupedData*   plot.nfnGroupedData*   plot.nls*             
## [40] plot.nmGroupedData*    plot.pdMat*            plot.ppr*             
## [43] plot.prcomp*           plot.princomp*         plot.profile*         
## [46] plot.profile.nls*      plot.ranef.lme*        plot.ranef.lmList*    
## [49] plot.raster*           plot.ridgelm*          plot.shingle*         
## [52] plot.simulate.lme*     plot.spec*             plot.stepfun          
## [55] plot.stl*              plot.structable*       plot.table*           
## [58] plot.trellis*          plot.ts                plot.tskernel*        
## [61] plot.TukeyHSD*         plot.Variogram*        plot.zoo*             
## see '?methods' for accessing help and source code
#what functions accept data.frames as arguments?
methods(class="data.frame")
##  [1] $              $<-            [              [[            
##  [5] [[<-           [<-            aggregate      anti_join     
##  [9] anyDuplicated  arrange_       as.data.frame  as.list       
## [13] as.matrix      as.tbl         as.tbl_cube    by            
## [17] cbind          coerce         collapse       collect       
## [21] compute        corresp        dim            dimnames      
## [25] dimnames<-     distinct_      do_            droplevels    
## [29] duplicated     edit           filter_        format        
## [33] formula        fortify        full_join      ggplot        
## [37] group_by_      group_indices_ group_size     groups        
## [41] head           initialize     inner_join     intersect     
## [45] is.na          lda            left_join      loglm1        
## [49] Math           melt           merge          mutate_       
## [53] n_groups       na.contiguous  na.exclude     na.omit       
## [57] Ops            parallel       parallelplot   plot          
## [61] print          prompt         qda            rbind         
## [65] rename_        right_join     row.names      row.names<-   
## [69] rowsum         same_src       sample_frac    sample_n      
## [73] select_        semi_join      setdiff        setequal      
## [77] show           slice_         slotsFromS3    split         
## [81] split<-        splom          stack          str           
## [85] subset         summarise_     summary        Summary       
## [89] t              tail           tbl_vars       transform     
## [93] type_sum       ungroup        union          unique        
## [97] unstack        within        
## see '?methods' for accessing help and source code
Descriptive Stats

summary is a generic function to summarize many types of R objects, including datasets.

When used on a data frame, summary returns distributional summaries of variables in the dataset.

summary(d)
##        id            female           race           ses      
##  Min.   :  1.0   Min.   :0.000   Min.   :1.00   Min.   :1.00  
##  1st Qu.: 50.8   1st Qu.:0.000   1st Qu.:3.00   1st Qu.:2.00  
##  Median :100.5   Median :1.000   Median :4.00   Median :2.00  
##  Mean   :100.5   Mean   :0.545   Mean   :3.43   Mean   :2.06  
##  3rd Qu.:150.2   3rd Qu.:1.000   3rd Qu.:4.00   3rd Qu.:3.00  
##  Max.   :200.0   Max.   :1.000   Max.   :4.00   Max.   :3.00  
##      schtyp          prog           read          write     
##  Min.   :1.00   Min.   :1.00   Min.   :28.0   Min.   :31.0  
##  1st Qu.:1.00   1st Qu.:2.00   1st Qu.:44.0   1st Qu.:45.8  
##  Median :1.00   Median :2.00   Median :50.0   Median :54.0  
##  Mean   :1.16   Mean   :2.02   Mean   :52.2   Mean   :52.8  
##  3rd Qu.:1.00   3rd Qu.:2.25   3rd Qu.:60.0   3rd Qu.:60.0  
##  Max.   :2.00   Max.   :3.00   Max.   :76.0   Max.   :67.0  
##       math         science         socst     
##  Min.   :33.0   Min.   :26.0   Min.   :26.0  
##  1st Qu.:45.0   1st Qu.:44.0   1st Qu.:46.0  
##  Median :52.0   Median :53.0   Median :52.0  
##  Mean   :52.6   Mean   :51.9   Mean   :52.4  
##  3rd Qu.:59.0   3rd Qu.:58.0   3rd Qu.:61.0  
##  Max.   :75.0   Max.   :74.0   Max.   :71.0
#summary is a generic function that accepts data.frames
#there should be a summary.data.frame
methods(summary)
##  [1] summary,ANY-method             summary,DBIObject-method      
##  [3] summary,diagonalMatrix-method  summary,sparseMatrix-method   
##  [5] summary.aov                    summary.aovlist*              
##  [7] summary.aspell*                summary.assocstats*           
##  [9] summary.check_packages_in_dir* summary.connection            
## [11] summary.corAR1*                summary.corARMA*              
## [13] summary.corCAR1*               summary.corCompSymm*          
## [15] summary.corExp*                summary.corGaus*              
## [17] summary.corIdent*              summary.corLin*               
## [19] summary.corNatural*            summary.corRatio*             
## [21] summary.corSpher*              summary.corStruct*            
## [23] summary.corSymm*               summary.data.frame            
## [25] summary.Date                   summary.default               
## [27] summary.ecdf*                  summary.factor                
## [29] summary.gam*                   summary.ggplot*               
## [31] summary.glm                    summary.gls*                  
## [33] summary.goodfit*               summary.infl*                 
## [35] summary.Kappa*                 summary.lm                    
## [37] summary.lme*                   summary.lmList*               
## [39] summary.loddsratio*            summary.loess*                
## [41] summary.loglm*                 summary.manova                
## [43] summary.matrix                 summary.mlm*                  
## [45] summary.modelStruct*           summary.negbin*               
## [47] summary.nls*                   summary.nlsList*              
## [49] summary.packageStatus*         summary.pdBlocked*            
## [51] summary.pdCompSymm*            summary.pdDiag*               
## [53] summary.PDF_Dictionary*        summary.PDF_Stream*           
## [55] summary.pdIdent*               summary.pdIdnot*              
## [57] summary.pdLogChol*             summary.pdMat*                
## [59] summary.pdNatural*             summary.pdSymm*               
## [61] summary.pdTens*                summary.polr*                 
## [63] summary.POSIXct                summary.POSIXlt               
## [65] summary.ppr*                   summary.prcomp*               
## [67] summary.princomp*              summary.proc_time             
## [69] summary.reStruct*              summary.rlm*                  
## [71] summary.shingle*               summary.srcfile               
## [73] summary.srcref                 summary.stepfun               
## [75] summary.stl*                   summary.table                 
## [77] summary.trellis*               summary.tukeysmooth*          
## [79] summary.varComb*               summary.varConstPower*        
## [81] summary.varExp*                summary.varFixed*             
## [83] summary.varFunc*               summary.varIdent*             
## [85] summary.varPower*              summary.yearmon*              
## [87] summary.yearqtr*               summary.zoo*                  
## see '?methods' for accessing help and source code
Conditional Summaries 1

If we want conditional summaries, for example only for those students with high reading scores (read >= 60), we first subset the data using filter, then summarize as usual.

R permits nested function calls, where the results of one function are passed directly as an argument to another function. Here, filter returns a dataset containing observations where read >= 60. This data subset is then passed to summary to obtain distributions of the variables in the subset.

Note that filter is in the dplyr package.

summary(filter(d, read >= 60))
##        id            female           race          ses      
##  Min.   :  3.0   Min.   :0.000   Min.   :1.0   Min.   :1.00  
##  1st Qu.: 76.5   1st Qu.:0.000   1st Qu.:4.0   1st Qu.:2.00  
##  Median :108.5   Median :0.000   Median :4.0   Median :3.00  
##  Mean   :109.8   Mean   :0.482   Mean   :3.7   Mean   :2.38  
##  3rd Qu.:143.2   3rd Qu.:1.000   3rd Qu.:4.0   3rd Qu.:3.00  
##  Max.   :200.0   Max.   :1.000   Max.   :4.0   Max.   :3.00  
##      schtyp          prog           read          write     
##  Min.   :1.00   Min.   :1.00   Min.   :60.0   Min.   :43.0  
##  1st Qu.:1.00   1st Qu.:2.00   1st Qu.:63.0   1st Qu.:57.0  
##  Median :1.00   Median :2.00   Median :65.0   Median :60.0  
##  Mean   :1.18   Mean   :1.95   Mean   :65.5   Mean   :59.5  
##  3rd Qu.:1.00   3rd Qu.:2.00   3rd Qu.:68.0   3rd Qu.:65.0  
##  Max.   :2.00   Max.   :3.00   Max.   :76.0   Max.   :67.0  
##       math         science         socst     
##  Min.   :35.0   Min.   :44.0   Min.   :41.0  
##  1st Qu.:55.8   1st Qu.:55.0   1st Qu.:56.0  
##  Median :60.5   Median :61.0   Median :61.0  
##  Mean   :60.2   Mean   :59.7   Mean   :60.9  
##  3rd Qu.:66.2   3rd Qu.:65.2   3rd Qu.:66.0  
##  Max.   :75.0   Max.   :74.0   Max.   :71.0
Conditional Summaries 2

We can separate the data in other ways, such as by groups. Let's look at the means of the last 5 variables for each type of program, prog.

Here, we are asking the by function to apply the colMeans function to variables located in columns 7 through 11, and to calculate those means by groups denoted in the variable prog.

by(d[, 7:11], d$prog, colMeans)
## d$prog: 1
##    read   write    math science   socst 
##  49.756  51.333  50.022  52.444  50.600 
## -------------------------------------------------------- 
## d$prog: 2
##    read   write    math science   socst 
##  56.162  56.257  56.733  53.800  56.695 
## -------------------------------------------------------- 
## d$prog: 3
##    read   write    math science   socst 
##   46.20   46.76   46.42   47.22   45.02
Histograms

Typically it is easier to inspect variable distributions with graphics. For the following graphics, we use the ggplot2 package. Histograms are often used for continuous variable distributions...

ggplot(d, aes(x = write)) + geom_histogram()
plot of chunk unnamed-chunk-24
Density Plots

...as are kernel density plots...

ggplot(d, aes(x = write)) + geom_density()
plot of chunk unnamed-chunk-25
Boxplots

...as well as boxplots, which show the median, lower and upper quartiles (or hinges) and the full range.

ggplot(d, aes(x = 1, y = math)) + geom_boxplot()
plot of chunk unnamed-chunk-26
Conditional Visualization 1

We can also plot graphs by group to better understand our data. Here we examine the densities of write for each type of program, prog.

# density plots by program type
ggplot(d, aes(x = write)) + geom_density() + facet_wrap(~ prog)
plot of chunk unnamed-chunk-27
Conditional Visualization 2

We could also view boxplots of math by type of program.

Here we plot math scores for each group in prog.

ggplot(d, aes(x = factor(prog), y = math)) + geom_boxplot()
plot of chunk unnamed-chunk-28
Extended Visualization 2

Here we demonstrate the flexibility of R by plotting the distributions of all of our continuous variables in a single boxplot.

The function melt from the reshape2 package stacks the values located in columns 7 through 11 of our dataset on top of one another to form one column called "value", and stacks the variable names associated with each value in a second column called "variable".

Thus we are asking for a boxplot of "values" grouped by "variable".

Try running just melt(d[, 7:11]) first. You'll notice there are 1000 rows and 2 columns, since it stacks the 200 values of five different variables on top of one another.

ggplot(melt(d[, 7:11]), aes(x = variable, y = value)) + geom_boxplot()
## No id variables; using all as measure variables
plot of chunk unnamed-chunk-29
Extended Visualization 3

Finally we can get boxplots of these same variables coloured by program type by specifying a fill argument.

ggplot(melt(d[, 6:11], id.vars = "prog"),
       aes(x = variable, y = value, fill = factor(prog))) +
  geom_boxplot()
plot of chunk unnamed-chunk-30
Graphics in R

This presentation has made heavy use of the ggplot2 package to visualize data in R. ggplot2 is an implementation of Leland Wilkinson's Grammar of Graphics which is a framework for thinking about and creating graphs.

However, there are many other packages and ways to make graphics in R. Another very popular system uses the lattice package. This is a recommended package, which means that it ships with R so everyone has it. Here are some examples.

# load lattice
require(lattice)

# simple scatter plot
xyplot(read ~ write, data = d)
plot of chunk unnamed-chunk-31
# conditioned scatter plot
xyplot(read ~ write | prog, data = d)
plot of chunk unnamed-chunk-31
# conditioning on two variables
xyplot(read ~ write | prog * schtyp, data = d)
plot of chunk unnamed-chunk-31
# box and whisker plots
bwplot(read ~ factor(prog), data = d)
plot of chunk unnamed-chunk-31

It is also possible to make graphs in base R, but often they look much simpler and less aesthetically pleasing, unless you take the time to customize them.

Categorical Data 1

We can look at the distributions of categorical variables with frequency tables.

xtabs( ~ female, data = d)
## female
##   0   1 
##  91 109
xtabs( ~ race, data = d)
## race
##   1   2   3   4 
##  24  11  20 145
xtabs( ~ prog, data = d)
## prog
##   1   2   3 
##  45 105  50
Categorical Data 2

Two-way cross tabs.

xtabs( ~ ses + schtyp, data = d)
##    schtyp
## ses  1  2
##   1 45  2
##   2 76 19
##   3 47 11
Categorical Data 3

Three-way cross tabs.

(tab3 <- xtabs( ~ ses + prog + schtyp, data = d))
## , , schtyp = 1
## 
##    prog
## ses  1  2  3
##   1 14 19 12
##   2 17 30 29
##   3  8 32  7
## 
## , , schtyp = 2
## 
##    prog
## ses  1  2  3
##   1  2  0  0
##   2  3 14  2
##   3  1 10  0
Categorical Independence

We can test whether ses and schtyp are independent with a permutation test from the vcd package.

(tab2 <- xtabs( ~ ses + schtyp, data = d))
##    schtyp
## ses  1  2
##   1 45  2
##   2 76 19
##   3 47 11
set.seed(10)
(testtab2 <- coindep_test(tab2, n = 5000))
## 
## 	Permutation test for conditional independence
## 
## data:  tab2
## f(x) = 2.01, p-value = 0.025
Visualizing Cat Data (VCD) 1

The area of each cell in a mosaic plot corresponds to the frequency from the crosstabs. mosaic comes from the vcd package.

# simple mosaic plot
mosaic(tab2)
plot of chunk unnamed-chunk-36
VCD 2

To visually understand whether the variables are independent, let's shade the cells based on pearsonized residuals from what we would observe if the data were independent.

mosaic(tab2, gp = shading_hsv,
  gp_args = list(p.value = testtab2$p.value, interpolate = -1:2))
plot of chunk unnamed-chunk-37
VCD 3

We can condition just like we did with cross tabs, to visualize a three-way cross tabs.

cotabplot(~ ses + prog | schtyp, data = d, panel = cotab_coindep, n = 5000)
plot of chunk unnamed-chunk-38
Correlations 1

As a last step in our data exploration, we would like some quick looks at bivariate (pairwise) relationships in our data. Correlation matrices provide quick summaries of these pairwise relationships.

If there are no missing data, we can use the cor function with default arguments. Otherwise, we could use the use argument to get listwise or pairwise deletion. Check the help file (?cor) for how to use the use argument.

Correlations 2

Here we are requesting all pairwise correlations among variables in columns 7 through 11.

cor(d[, 7:11])
##            read   write    math science   socst
## read    1.00000 0.59678 0.66228 0.63016 0.62148
## write   0.59678 1.00000 0.61745 0.57044 0.60479
## math    0.66228 0.61745 1.00000 0.63073 0.54448
## science 0.63016 0.57044 0.63073 1.00000 0.46511
## socst   0.62148 0.60479 0.54448 0.46511 1.00000
Correlations 3

If there are missing data, for listwise deletion, use only complete observations.

cor(d[, 7:11], use = "complete.obs")
##            read   write    math science   socst
## read    1.00000 0.59678 0.66228 0.63016 0.62148
## write   0.59678 1.00000 0.61745 0.57044 0.60479
## math    0.66228 0.61745 1.00000 0.63073 0.54448
## science 0.63016 0.57044 0.63073 1.00000 0.46511
## socst   0.62148 0.60479 0.54448 0.46511 1.00000
Correlations 4

If there are missing data, for pairwise deletion, use pairwise complete observations.

cor(d[, 7:11], use = "pairwise.complete.obs")
##            read   write    math science   socst
## read    1.00000 0.59678 0.66228 0.63016 0.62148
## write   0.59678 1.00000 0.61745 0.57044 0.60479
## math    0.66228 0.61745 1.00000 0.63073 0.54448
## science 0.63016 0.57044 0.63073 1.00000 0.46511
## socst   0.62148 0.60479 0.54448 0.46511 1.00000
Visual Summaries, Continuous Variables

We can inspect univariate and bivariate relationships using a scatter plot matrix.

ggpairs(d[, 7:11])
plot of chunk unnamed-chunk-42

Modifying Data

Modifying Data

This section demonstrates reordering and modifying your data.

Let's begin by reading in our dataset again and storing it in object d.

# read data in and store in an easy to use name to save typing
d <- read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv")
Sorting

We can sort data using the arrange function from the dplyr package.

Here we are requesting that arrange sort by female, and then by math. The function arrange returns the sorted dataset .

library(dplyr)
d <- arrange(d, female, math)
head(d)
##    id female race ses schtyp prog read write math science socst
## 1 167      0    4   2      1    1   63    49   35      66    41
## 2 128      0    4   3      1    2   39    33   38      47    41
## 3  49      0    3   3      1    3   50    40   39      49    47
## 4  22      0    1   2      1    3   42    39   39      56    46
## 5 134      0    4   1      1    1   44    44   39      34    46
## 6 117      0    4   3      1    3   34    49   39      42    56
Coding Categorical Variables as Factors

Within R, categorical variables are typically coded as factors, a special class that allows for value labeling.

Here, using the factor function, we convert all categorical variables to factors and label their values.

To spare us from unnecessary typing, we use the mutate function from the dplyr package to let R know that all conversions to factors and labeling should occur within the dataset d

#library(dplyr)
str(d)
## 'data.frame':	200 obs. of  11 variables:
##  $ id     : int  167 128 49 22 134 117 153 140 58 133 ...
##  $ female : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ race   : int  4 4 3 1 4 4 4 4 4 4 ...
##  $ ses    : int  2 3 3 2 1 3 2 2 2 2 ...
##  $ schtyp : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ prog   : int  1 2 3 3 1 3 3 3 3 3 ...
##  $ read   : int  63 39 50 42 44 34 39 44 55 50 ...
##  $ write  : int  49 33 40 39 44 49 31 41 41 31 ...
##  $ math   : int  35 38 39 39 39 39 40 40 40 40 ...
##  $ science: int  66 47 49 56 34 42 39 50 44 34 ...
##  $ socst  : int  41 41 47 46 46 56 51 26 41 31 ...
d <- mutate(d,
            id = factor(id),
            female = factor(female, levels = 0:1, labels = c("male", "female")),
            race = factor(race, levels = 1:4, labels = c("Hispanic", "Asian", "African American", "White")),
            schtyp = factor(schtyp, levels = 1:2, labels = c("public", "private")),
            prog = factor(prog, levels = 1:3, labels = c("general", "academic", "vocational"))
)
Results

Here are the results of our changing our categorical variables to factors.

str(d)
## 'data.frame':	200 obs. of  11 variables:
##  $ id     : Factor w/ 200 levels "1","2","3","4",..: 167 128 49 22 134 117 153 140 58 133 ...
##  $ female : Factor w/ 2 levels "male","female": 1 1 1 1 1 1 1 1 1 1 ...
##  $ race   : Factor w/ 4 levels "Hispanic","Asian",..: 4 4 3 1 4 4 4 4 4 4 ...
##  $ ses    : int  2 3 3 2 1 3 2 2 2 2 ...
##  $ schtyp : Factor w/ 2 levels "public","private": 1 1 1 1 1 1 1 1 1 1 ...
##  $ prog   : Factor w/ 3 levels "general","academic",..: 1 2 3 3 1 3 3 3 3 3 ...
##  $ read   : int  63 39 50 42 44 34 39 44 55 50 ...
##  $ write  : int  49 33 40 39 44 49 31 41 41 31 ...
##  $ math   : int  35 38 39 39 39 39 40 40 40 40 ...
##  $ science: int  66 47 49 56 34 42 39 50 44 34 ...
##  $ socst  : int  41 41 47 46 46 56 51 26 41 31 ...
summary(d)
##        id         female                  race          ses      
##  1      :  1   male  : 91   Hispanic        : 24   Min.   :1.00  
##  2      :  1   female:109   Asian           : 11   1st Qu.:2.00  
##  3      :  1                African American: 20   Median :2.00  
##  4      :  1                White           :145   Mean   :2.06  
##  5      :  1                                       3rd Qu.:3.00  
##  6      :  1                                       Max.   :3.00  
##  (Other):194                                                     
##      schtyp            prog          read          write     
##  public :168   general   : 45   Min.   :28.0   Min.   :31.0  
##  private: 32   academic  :105   1st Qu.:44.0   1st Qu.:45.8  
##                vocational: 50   Median :50.0   Median :54.0  
##                                 Mean   :52.2   Mean   :52.8  
##                                 3rd Qu.:60.0   3rd Qu.:60.0  
##                                 Max.   :76.0   Max.   :67.0  
##                                                              
##       math         science         socst     
##  Min.   :33.0   Min.   :26.0   Min.   :26.0  
##  1st Qu.:45.0   1st Qu.:44.0   1st Qu.:46.0  
##  Median :52.0   Median :53.0   Median :52.0  
##  Mean   :52.6   Mean   :51.9   Mean   :52.4  
##  3rd Qu.:59.0   3rd Qu.:58.0   3rd Qu.:61.0  
##  Max.   :75.0   Max.   :74.0   Max.   :71.0  
## 
Scoring and Recoding

Often we need to create variables from other variables. For example, we may want to sum individual test items to form a total score. Or, we may want to convert a continuous scale into several categories, such as letter grades.

We use the mutate function to tell R that all variables created and referenced are within the d dataset.

Below we create a total score and use the cut function to recode continuous ranges into categories.

#create a total score as sum of 4 scores
#recode into letter grade by breaking into categories
#library(dplyr)
d <- mutate(d,
               total = read+write+math+science,
               grade = cut(total,
                            breaks = c(0, 140, 180, 210, 234, 300),
                            labels = c("F", "D", "C", "B", "A"))
               )

# view results
summary(d[, c("total", "grade")])
##      total     grade 
##  Min.   :139   F: 1  
##  1st Qu.:180   D:51  
##  Median :210   C:50  
##  Mean   :210   B:49  
##  3rd Qu.:234   A:49  
##  Max.   :277
Standardize and Average

We can get standardized (Z) scores with the scale function, and get average read scores for each level of ses using the ave function, which itself calls mean. We use mutate again to simplify the coding.

#library(dplyr)
d <- mutate(d,
            zread = as.numeric(scale(read)),
            readmean = ave(read, ses, FUN = mean)
            )

head(d[, c("read", "zread", "readmean")])
##   read    zread readmean
## 1   63  1.05043   51.579
## 2   39 -1.29036   56.500
## 3   50 -0.21750   56.500
## 4   42 -0.99776   51.579
## 5   44 -0.80270   48.277
## 6   34 -1.77803   56.500
Scoring

We can also take the mean of a set of variables, ignoring missing values, which is useful for creating composite scores.

d$rowmean <- rowMeans(d[, 7:10], na.rm=TRUE)

Managing Data

Directories in R

At times it is convenient to look at directory contents within an R session. Let's get the current working directory, list its files, and change the working directory.

getwd()
## [1] "C:/website/stat/r/seminars"
list.files()
##  [1] "figures"           "ggplot2_intro"     "intro.htm"        
##  [4] "intro.R"           "intro.Rhtml"       "R.svg"            
##  [7] "R.vsd"             "Repeated_Measures" "Rmatrix.htm"      
## [10] "ui"
#setwd("/path/to/directory")
Subsetting Observations

Suppose we would like to subset our dataset into 2 datsets, one for each gender. Here we use the filter function to split the dataset into 2 subsets, and we store each subset in a new object.

dfemale <- filter(d, female == "female")
dmale <- filter(d, female == "male")
Subsetting Variables

Often, datasets come with many more variable than we want. We can also use select to keep only the variables we need.

# note that select is special, so we do not need to quote the variable names
duse <- select(d, id, female, read, write)
# note the - preceding c(female... , which means drop these variables
ddropped <- select(d, -c(female, read, write))
Adding Observations (appending)

If we were given separate files for males and females but wanted to stack them (add observations) we can append the datasets together row-wise with rbind.

dboth <- rbind(dfemale, dmale)
dim(dfemale)
## [1] 109  16
dim(dmale)
## [1] 91 16
dim(dboth)
## [1] 200  16
Merging Data

If instead we were given separate files describing the same students, but with different variables, we could merge datasets to combine both sets of variables into 1 dataset, using the merge function. Note that we do not need to use the same variable as the id in both datasets. We could have said, by.x = "id.x", by.y = "id.y".

dall <- merge(duse, ddropped, by = "id", all = TRUE)
dim(duse)
## [1] 200   4
dim(ddropped)
## [1] 200  13
dim(dall)
## [1] 200  16

Analyzing Data

Analyzing Cat Data 1

We often analyze relationships between categorical variables using chi square tests. chisq.test can use raw data, or you can give it a frequency table. We do the latter.

(tab <- xtabs(~ ses + schtyp, data = d))
##    schtyp
## ses public private
##   1     45       2
##   2     76      19
##   3     47      11
chisq.test(tab)
## 
## 	Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 6.33, df = 2, p-value = 0.042
Analyzing Cat Data 2

If you are concerned about small cell sizes, you can use Monte Carlo simulation. Here we do 90,000 simulations.

chisq.test(tab, simulate.p.value=TRUE, B = 90000)
## 
## 	Pearson's Chi-squared test with simulated p-value (based on 90000
## 	replicates)
## 
## data:  tab
## X-squared = 6.33, df = NA, p-value = 0.042
t-tests 1

t.test performs t-tests, used to compare pairs of means. Here we show a one sample t-test comparing the mean of write to a mean of 50, and a paired samples t-test comparing the means of write and read.

t.test(d$write, mu = 50)
## 
## 	One Sample t-test
## 
## data:  d$write
## t = 4.14, df = 199, p-value = 5.1e-05
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
##  51.453 54.097
## sample estimates:
## mean of x 
##    52.775
with(d, t.test(write, read, paired = TRUE))
## 
## 	Paired t-test
## 
## data:  write and read
## t = 0.867, df = 199, p-value = 0.39
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.69414  1.78414
## sample estimates:
## mean of the differences 
##                   0.545
t-tests 2

Perhaps we would like to compare the means of write between males and females. Here are independent samples t-tests with equal variance assumed and not assumed.

t.test(write ~ female, data = d, var.equal=TRUE)
## 
## 	Two Sample t-test
## 
## data:  write by female
## t = -3.73, df = 198, p-value = 0.00025
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.4418 -2.2981
## sample estimates:
##   mean in group male mean in group female 
##               50.121               54.991
t.test(write ~ female, data = d)
## 
## 	Welch Two Sample t-test
## 
## data:  write by female
## t = -3.66, df = 170, p-value = 0.00034
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.4992 -2.2407
## sample estimates:
##   mean in group male mean in group female 
##               50.121               54.991
ANOVA and Regression

ANOVAs and ordinary least-squares regression are just linear models, so we use lm for both. You should typically store the results of an estimation command in an object, as it often contains a wealth of information.

The object storing the results of lm can then be supplied to anova for an ANOVA table partitioning the variance sequentially, or to summary for a table of regression parameters and measures of fit.

If you would like ANOVAs using other types of sums of squares, get the car package from CRAN.

m <- lm(write ~ prog * female, data = d)
anova(m)
## Analysis of Variance Table
## 
## Response: write
##              Df Sum Sq Mean Sq F value  Pr(>F)    
## prog          2   3176    1588   23.25 8.9e-10 ***
## female        1   1129    1129   16.53 7.0e-05 ***
## prog:female   2    326     163    2.39   0.095 .  
## Residuals   194  13249      68                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(m)
## 
## Call:
## lm(formula = write ~ prog * female, data = d)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -21.62  -5.14   1.04   6.12  21.17 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    49.14       1.80   27.25   <2e-16 ***
## progacademic                    5.47       2.17    2.52   0.0124 *  
## progvocational                 -7.32       2.49   -2.93   0.0038 ** 
## femalefemale                    4.11       2.47    1.66   0.0979 .  
## progacademic:femalefemale      -1.14       2.95   -0.39   0.7005    
## progvocational:femalefemale     5.03       3.41    1.48   0.1413    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.26 on 194 degrees of freedom
## Multiple R-squared:  0.259,	Adjusted R-squared:  0.24 
## F-statistic: 13.6 on 5 and 194 DF,  p-value: 2.38e-11
Regression continued

We can easily update a model, for instance by adding a continuous predictor, say read, using update.

summary(m2 <- update(m, . ~ . + read))
## 
## Call:
## lm(formula = write ~ prog + female + read + prog:female, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.219  -4.467   0.426   4.608  14.507 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  22.9406     3.1881    7.20  1.3e-11 ***
## progacademic                  3.8293     1.8147    2.11  0.03614 *  
## progvocational               -3.7044     2.1127   -1.75  0.08113 .  
## femalefemale                  7.0732     2.0806    3.40  0.00082 ***
## read                          0.4948     0.0531    9.32  < 2e-16 ***
## progacademic:femalefemale    -4.0012     2.4791   -1.61  0.10816    
## progvocational:femalefemale   1.5617     2.8598    0.55  0.58563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.88 on 193 degrees of freedom
## Multiple R-squared:  0.489,	Adjusted R-squared:  0.473 
## F-statistic: 30.8 on 6 and 193 DF,  p-value: <2e-16
Regression Diagnostics 1

Using the plot function on a object storing the results of lm will produce regression diagnostic plots. Here we are arranging them in a 2x2 square using par.

par(mfrow = c(2, 2))
plot(m2)
plot of chunk unnamed-chunk-61
Regression Diagnostics 1

We can get specific diagnostic plots, such as a density plot of the residuals (assumed normally distributed for inference).

plot(density(resid(m2)))
plot of chunk unnamed-chunk-62
Regression 3

Interactions are easy to add to models using *. This notation also automatically creates the lower order terms.

summary(m3 <- lm(write ~ prog * read, data = d))
## 
## Call:
## lm(formula = write ~ prog * read, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.595  -4.773   0.557   5.375  18.405 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         27.33720    6.17869    4.42  1.6e-05 ***
## progacademic         2.82708    7.56135    0.37  0.70890    
## progvocational      -2.88581    8.36821   -0.34  0.73058    
## read                 0.48228    0.12214    3.95  0.00011 ***
## progacademic:read   -0.01768    0.14413   -0.12  0.90250    
## progvocational:read  0.00059    0.17122    0.00  0.99725    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.48 on 194 degrees of freedom
## Multiple R-squared:  0.393,	Adjusted R-squared:  0.377 
## F-statistic: 25.1 on 5 and 194 DF,  p-value: <2e-16
Regression 4

To get the multi-degree of freedom test for the interaction, we can compare a model with and without the interaction.

m3b <- update(m3, . ~ . - prog:read)
anova(m3b, m3)
## Analysis of Variance Table
## 
## Model 1: write ~ prog + read
## Model 2: write ~ prog * read
##   Res.Df   RSS Df Sum of Sq    F Pr(>F)
## 1    196 10861                         
## 2    194 10860  2      1.37 0.01   0.99
Estimated Means 1

We can get the estimated (predicted) cell means using predict.

First, using expand.grid we create a new dataset containing all possible crossings of predictor values for which we would like to predict the outcome. We then use the model coefficients to predict our cell means, and store them back in the new dataset.

newdat <- with(d, expand.grid(prog = levels(prog), female = levels(female)))
(newdat <- cbind(newdat, predict(m, newdat, se=TRUE)))
##         prog female    fit se.fit  df residual.scale
## 1    general   male 49.143 1.8033 194         8.2639
## 2   academic   male 54.617 1.2054 194         8.2639
## 3 vocational   male 41.826 1.7231 194         8.2639
## 4    general female 53.250 1.6869 194         8.2639
## 5   academic female 57.586 1.0851 194         8.2639
## 6 vocational female 50.963 1.5904 194         8.2639
Estimated Means 2

Plots of predicted values can be nice.

ggplot(newdat, aes(x = prog, y = fit, colour = female)) +
  geom_errorbar(aes(ymin = fit - se.fit, ymax = fit + se.fit), width=.25) +
  geom_point(size=3)
plot of chunk unnamed-chunk-66
Estimated Means 3

This was a good example why it is nice to use good variable names. If we had, this graph would be ready to go.

colnames(newdat)[1:3] <- c("Program", "Sex", "Write")

ggplot(newdat, aes(x = Program, y = Write, colour = Sex)) +
  geom_errorbar(aes(ymin = Write - se.fit, ymax = Write + se.fit), width=.25) +
  geom_point(size=3)
plot of chunk unnamed-chunk-67
Logistic Regression 1

Generalized linear models include OLS regression, Poisson, and logistic. Let's look at logistic regression.

Notice we are storing the results of the regression in model object m4.

m4 <- glm(schtyp ~ ses + read, data = d, family = binomial)
summary(m4)
## 
## Call:
## glm(formula = schtyp ~ ses + read, family = binomial, data = d)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.809  -0.628  -0.549  -0.434   2.183  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -3.3804     1.0887   -3.10   0.0019 **
## ses           0.4784     0.2918    1.64   0.1011   
## read          0.0131     0.0196    0.67   0.5046   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 175.87  on 199  degrees of freedom
## Residual deviance: 171.61  on 197  degrees of freedom
## AIC: 177.6
## 
## Number of Fisher Scoring iterations: 4
Logistic Regression 2

Odds ratios are common to report in logistic regression. These are just the exponentiated coefficients. Here coef is extracting the coefficents from the model object. The coefficients are then exponentiated.

exp(coef(m4))
## (Intercept)         ses        read 
##    0.034034    1.613556    1.013178
Logistic Regression 3

We can also get confidence intervals for the coefficients or the odds ratios.

confint(m4)
## Waiting for profiling to be done...
##                 2.5 %    97.5 %
## (Intercept) -5.592098 -1.302186
## ses         -0.083122  1.067576
## read        -0.025416  0.051926
exp(confint(m4))
## Waiting for profiling to be done...
##                 2.5 %  97.5 %
## (Intercept) 0.0037272 0.27194
## ses         0.9202387 2.90832
## read        0.9749046 1.05330
Predicted Probabilities 1

For the ANOVA, we got estimated cell means. For logistic regression, let's get the predicted probabilities.

newdat <- expand.grid(ses = unique(d$ses), read = seq(min(d$read), max(d$read)))
newdat$PredictedProbability <- predict(m4, newdat, type = "response")
Predicted Probabilities 2

Now we can plot the predicted probabilities to see how read scores and ses affect the probability of being in a private versus public school.

ggplot(newdat, aes(x = read, y = PredictedProbability, colour = factor(ses))) +
  geom_line()
plot of chunk unnamed-chunk-72
Nonparametric Tests 1

We can use nonparametric tests when we are not sure that our data meet the distributional assumptions of the statistical inference tests we would like to use.

# equivalent of one sampe t-test
wilcox.test(d$write, mu = 50)
## 
## 	Wilcoxon signed rank test with continuity correction
## 
## data:  d$write
## V = 13200, p-value = 3.7e-05
## alternative hypothesis: true location is not equal to 50
# Wilcoxon sign-ranked test, equivalent of paired samples t-test
with(d, wilcox.test(write, read, paired=TRUE))
## 
## 	Wilcoxon signed rank test with continuity correction
## 
## data:  write and read
## V = 9260, p-value = 0.37
## alternative hypothesis: true location shift is not equal to 0
Nonparametric Tests 2
# Wilcoxon test, equivalent of indepdent samples t-test
wilcox.test(write ~ female, data = d)
## 
## 	Wilcoxon rank sum test with continuity correction
## 
## data:  write by female
## W = 3610, p-value = 0.00087
## alternative hypothesis: true location shift is not equal to 0
Nonparametric Tests 3
# Kruskal-Wallis test, equivalent of one-way ANOVA
kruskal.test(write ~ prog, data = d)
## 
## 	Kruskal-Wallis rank sum test
## 
## data:  write by prog
## Kruskal-Wallis chi-squared = 34, df = 2, p-value = 4e-08

For More Information

Getting Help - Manual

Built-in reference manual is available:

?lm

and searchable:

??regression
Getting Help - Vignettes

Many packages include vignettes, for more detailed examples.

vignette()
Getting Help - Converting

For those of you switching packages:

Available for free on SpringerLink.

Getting Help - Online
Books on R

There is an extensive list of books on R maintained at http://www.r-project.org/doc/bib/R-books.html.