Link to all slides:

www.ats.ucla.edu/stat/r/seminars/intro.htm

For just the code (right-click to download):

`R`

is a programming environment

- uses a well-developed but simple programming language
- allows for rapid development of new tools according to user demand
- these tools are distributed as packages, which any user can
download to customize the
`R`

environment

Base `R`

and most `R`

packages are available for download from the
Comprehensive R Archive Network (CRAN)

- cran.r-project.org
- base
`R`

comes with a number of basic data management, analysis, and graphical tools `R`

's power and flexibility, however, lie in its array of packages (currently more around 6,000!)

`R`

, but most users prefer a graphical interface. For starters:
More advanced users may prefer a good text editor with plugins for syntax highlighting,
code completion, etc. for `R`

such as:

For the purposes of this seminar, we will be using the following packages frequently:

`foreign`

package to read data files from other stats packages`xlsx`

package (requires Java to be installed, same architecture as your`R`

version, also the`rJava`

package and`xlsxjars`

package)`dplyr`

package for various data management tasks`reshape2`

package to easily melt data to long form`ggplot2`

package for elegant data visualization using the*Grammar of Graphics*`GGally`

package for scatter plot matrices`vcd`

package for visualizing and analyzing categorical data

To use packages in `R`

, we must first install them using
the `install.packages`

function, which typically downloads the
package from CRAN and installs it for use

#install.packages("foreign") #install.packages("xlsx") #install.packages("dplyr") #install.packages("reshape2") #install.packages("ggplot2") #install.packages("GGally") #install.packages("vcd")

If we know we will need a particular package for our current R
session, we must load it into the R environment using
the `library`

or `require`

functions

library(foreign) library(xlsx) library(dplyr) library(reshape2) require(ggplot2) require(GGally) require(vcd)

For Windows and Mac, there are binary files to install `R`

.
You can download these and just click to install them. The defaults are
sensible and if you do not have a reason otherwise, are generally a good
idea to accept.

On various Linux distributions, you can use `apt-get install r-base`

if that distribution maintains a version of `R`

.

For Windows, Mac, and Linux, if you have the appropriate tools installed,
you can build `R`

from source. The definitive guide to this is available
http://cran.r-project.org/doc/manuals/R-admin.html

Once `R`

is installed, assuming you have sufficient privileges,
you can install and update most `R`

packages. To install them is just:

`install.packages("packagename")`

To update packages, you can update them all at once using:

`update.packages()`

You can also specify specific CRAN mirrors to use. For example the UCLA Statistics Department maintains a mirror

`install.packages("packagename", repos = "http://cran.stat.ucla.edu/")`

You can use this code to either load the package, or if that does not work, install the package and then load it.

# foreign ships with R so no need to install require(xlsx) || {install.packages("xlsx"); require(xlsx)}

## [1] TRUE

require(reshape2) || {install.packages("reshape2"); require(reshape2)}

## [1] TRUE

require(ggplot2) || {install.packages("ggplot2"); require(ggplot2)}

## [1] TRUE

require(GGally) || {install.packages("GGally"); require(GGally)}

## [1] TRUE

require(vcd) || {install.packages("vcd"); require(vcd)}

## [1] TRUE

To get a description of the version of R and its attached packages
used in the current session, we can use the `sessionInfo`

function

sessionInfo()

## R version 3.2.2 (2015-08-14) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 7 x64 (build 7601) Service Pack 1 ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] grid stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] lattice_0.20-33 vcd_1.4-1 GGally_0.5.0 ggplot2_1.0.1 ## [5] reshape2_1.4.1 dplyr_0.4.2 xlsx_0.5.7 xlsxjars_0.6.1 ## [9] rJava_0.9-6 foreign_0.8-65 knitr_1.10.5 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.0 magrittr_1.5 MASS_7.3-43 munsell_0.4.2 ## [5] colorspace_1.2-6 R6_2.1.0 stringr_1.0.0 highr_0.5 ## [9] plyr_1.8.3 tools_3.2.2 parallel_3.2.2 gtable_0.1.2 ## [13] DBI_0.3.1 lazyeval_0.1.10 lmtest_0.9-34 assertthat_0.1 ## [17] digest_0.6.8 formatR_1.2 evaluate_0.7 labeling_0.3 ## [21] stringi_0.5-5 scales_0.2.5 reshape_0.8.5 zoo_1.7-12 ## [25] proto_0.3-10

`R`

code can be entered into the command line directly
or saved to a script, which can be run inside a session using
the `source`

function

Commands are separated either by a `;`

or by a newline.

`R`

is case sensitive.

The `#`

character at the beginning of a line signifies
a comment, which is not executed.

Help files for `R`

functions are accessed by preceding the name of
the function with `?`

(e.g. `?require`

).

`??keyword`

searches R documentation for `keyword`

(e.g. `??logistic`

)

`R`

stores both data and output from data analysis (as well as
everything else) in *objects*

Things are assigned to and stored in objects using
the `<-`

or `=`

operator

A list of all objects in the current session can be obtained
with `ls()`

# assign the number 3 to object called abc abc <- 3 # list all objects in current session ls()

## [1] "abc" "buildInfo" "d" "dall" "dat.csv" ## [6] "dat.dta" "dat.spss" "dat.tab" "dat.xls" "dboth" ## [11] "ddropped" "dfemale" "dmale" "duse" "f" ## [16] "m" "m2" "m3" "m3b" "m4" ## [21] "newdat" "tab" "tab2" "tab3" "testtab2"

`R`

works most easily with datasets stored as text files. Typically,
values in text files are separated, or delimited, by tabs or spaces:

gender id race ses schtyp prgtype read write math science socst 0 70 4 1 1 general 57 52 41 47 57 1 121 4 2 1 vocati 68 59 53 63 31 0 86 4 3 1 general 44 33 54 58 31 0 141 4 3 1 vocati 63 44 47 53 56

or by commas (CSV file):

gender,id,race,ses,schtyp,prgtype,read,write,math,science,socst 0,70,4,1,1,general,57,52,41,47,57 1,121,4,2,1,vocati,68,59,53,63,61 0,86,4,3,1,general,44,33,54,58,31 0,141,4,3,1,vocati,63,44,47,53,56

Base R functions `read.table`

and `read.csv`

can read in data stored as text files, delimited by almost
anything (notice the `sep =`

option)

Although we are retrieving files over the internet for this class, these functions are typically used for files saved to disk.

Note how we are assigning the loaded data to objects.

# comma separated values dat.csv <- read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv") # tab separated values dat.tab <- read.table("http://www.ats.ucla.edu/stat/data/hsb2.txt", header=TRUE, sep = "\t")

We can read in datasets from other statistical analysis software using functions found
in the `foreign`

package

require(foreign) # SPSS files dat.spss <- read.spss("http://www.ats.ucla.edu/stat/data/hsb2.sav", to.data.frame=TRUE) # Stata files dat.dta <- read.dta("http://www.ats.ucla.edu/stat/data/hsb2.dta")

Datasets are often saved as Excel spreadsheets. Here we utilize the `xlsx`

package and `Java`

to download an Excel dataset.

# these two steps only needed to read excel files from the internet f <- tempfile("hsb2", fileext=".xls") download.file("http://www.ats.ucla.edu/stat/data/hsb2.xls", f, mode="wb") dat.xls <- read.xlsx(f, sheetIndex=1)

If you have trouble getting `Java`

and
the `xlsx`

package installed and working, just
click "save as" in Excel and export the data to a comma separated values file (.csv).

R has ways to look at the dataset at a glance or as a whole.

# first few rows head(dat.csv)

## id female race ses schtyp prog read write math science socst ## 1 70 0 4 1 1 1 57 52 41 47 57 ## 2 121 1 4 2 1 3 68 59 53 63 61 ## 3 86 0 4 3 1 1 44 33 54 58 31 ## 4 141 0 4 3 1 3 63 44 47 53 56 ## 5 172 0 4 2 1 2 47 52 57 53 61 ## 6 113 0 4 2 1 2 44 52 51 63 61

# last few rows tail(dat.csv)

## id female race ses schtyp prog read write math science socst ## 195 179 1 4 2 2 2 47 65 60 50 56 ## 196 31 1 2 2 2 1 55 59 52 42 56 ## 197 145 1 4 2 1 3 42 46 38 36 46 ## 198 187 1 4 2 2 1 57 41 57 55 52 ## 199 118 1 4 2 1 1 55 62 58 58 61 ## 200 137 1 4 3 1 2 63 65 65 53 61

# variable names colnames(dat.csv)

## [1] "id" "female" "race" "ses" "schtyp" "prog" "read" ## [8] "write" "math" "science" "socst"

# pop-up view of entire data set (uncomment to run) # View(dat.csv)

Once read in, datasets in R are typically stored as *data frames*, which
have a matrix structure. Observations are arranged as rows and
variables, either numerical or categorical, are arranged as
columns.

Individual rows, columns, and cells in a data frame can be accessed through many methods of indexing.

We most commonly use `object[row,column]`

notation.

# single cell value dat.csv[2,3]

## [1] 4

# omitting row value implies all rows; here all rows in column 3 dat.csv[,3]

## [1] 4 4 4 4 4 4 3 1 4 3 4 4 4 4 3 4 4 4 4 4 4 4 3 1 1 3 4 4 4 2 4 4 4 4 4 ## [36] 4 4 4 1 4 4 4 4 3 4 4 3 4 4 1 2 4 1 4 4 1 4 1 4 1 4 4 4 4 4 4 4 4 4 1 ## [71] 4 4 4 4 4 1 4 4 4 1 4 4 4 1 4 4 4 4 4 4 2 4 4 1 4 4 4 4 1 4 4 4 3 4 4 ## [106] 4 4 4 3 4 4 1 4 4 1 4 4 4 4 3 1 4 4 4 3 4 4 2 4 3 4 2 4 4 4 4 4 3 1 3 ## [141] 1 4 4 1 4 4 4 4 1 3 3 4 4 1 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 1 3 2 3 ## [176] 4 4 4 4 4 4 4 4 4 2 2 4 2 4 3 4 4 4 2 4 2 4 4 4 4

# omitting column values implies all columns; here all columns in row 2 dat.csv[2,]

## id female race ses schtyp prog read write math science socst ## 2 121 1 4 2 1 3 68 59 53 63 61

# can also use ranges - rows 2 and 3, columns 2 and 3 dat.csv[2:3, 2:3]

## female race ## 2 1 4 ## 3 0 4

We can also access variables directly by using their names, either
with `object[,"variable"]`

notation
or `object$variable`

notation.

# get first 10 rows of variable female using two methods dat.csv[1:10, "female"]

## [1] 0 1 0 0 0 0 0 0 0 0

dat.csv$female[1:10]

## [1] 0 1 0 0 0 0 0 0 0 0

The `c`

function is widely used to combine values of
common type together to form a vector.

For example, it can be used to access non-sequential rows and columns from a data frame.

# get column 1 for rows 1, 3 and 5 dat.csv[c(1,3,5), 1]

## [1] 70 86 172

# get row 1 values for variables female, prog and socst dat.csv[1,c("female", "prog", "socst")]

## female prog socst ## 1 0 1 57

If there were no variable names, or we wanted to change the names, we could use `colnames`

.

colnames(dat.csv) <- c("ID", "Sex", "Ethnicity", "SES", "SchoolType", "Program", "Reading", "Writing", "Math", "Science", "SocialStudies") # to change one variable name, just use indexing colnames(dat.csv)[1] <- "ID2"

We can save our data in a number of formats, including text, Excel .xlsx, and in other statistical software formats like Stata .dta. Note that the code below is commented, so it will not run.

The function `write.dta`

comes
from the `foreign`

package, while `write.xlsx`

comes from the `xlsx`

package.

#write.csv(dat.csv, file = "path/to/save/filename.csv") #write.table(dat.csv, file = "path/to/save/filename.txt", sep = "\t", na=".") #write.dta(dat.csv, file = "path/to/save/filename.dta") #write.xlsx(dat.csv, file = "path/to/save/filename.xlsx", sheetName="hsb2") # save to binary R format (can save multiple datasets and R objects) #save(dat.csv, dat.dta, dat.spss, dat.txt, file = "path/to/save/filename.RData")

Now we're going to read some data in and store it in the
object, `d`

. We prefer short names for objects that we will
use frequently.

We can now easily explore and get to know these data, which contain a number of school, test, and demographic variables for 200 students.

d <- read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv")

Using `dim`

, we get the number of observations(rows) and
variables(columns) in `d`

.

Using `str`

, we get the structure of `d`

, including the class(type) of all variables

dim(d)

## [1] 200 11

str(d)

## 'data.frame': 200 obs. of 11 variables: ## $ id : int 70 121 86 141 172 113 50 11 84 48 ... ## $ female : int 0 1 0 0 0 0 0 0 0 0 ... ## $ race : int 4 4 4 4 4 4 3 1 4 3 ... ## $ ses : int 1 2 3 3 2 2 2 2 2 2 ... ## $ schtyp : int 1 1 1 1 1 1 1 1 1 1 ... ## $ prog : int 1 3 1 3 2 2 1 2 1 2 ... ## $ read : int 57 68 44 63 47 44 50 34 63 57 ... ## $ write : int 52 59 33 44 52 52 59 46 57 55 ... ## $ math : int 41 53 54 47 57 51 42 45 54 52 ... ## $ science: int 47 63 58 53 53 63 53 39 58 50 ... ## $ socst : int 57 61 31 56 61 61 61 36 51 51 ...

`summary`

is a generic function to summarize many types
of `R`

objects, including datasets.

When used on a dataset, `summary`

returns distributional
summaries of variables in the dataset.

summary(d)

## id female race ses ## Min. : 1.0 Min. :0.000 Min. :1.00 Min. :1.00 ## 1st Qu.: 50.8 1st Qu.:0.000 1st Qu.:3.00 1st Qu.:2.00 ## Median :100.5 Median :1.000 Median :4.00 Median :2.00 ## Mean :100.5 Mean :0.545 Mean :3.43 Mean :2.06 ## 3rd Qu.:150.2 3rd Qu.:1.000 3rd Qu.:4.00 3rd Qu.:3.00 ## Max. :200.0 Max. :1.000 Max. :4.00 Max. :3.00 ## schtyp prog read write ## Min. :1.00 Min. :1.00 Min. :28.0 Min. :31.0 ## 1st Qu.:1.00 1st Qu.:2.00 1st Qu.:44.0 1st Qu.:45.8 ## Median :1.00 Median :2.00 Median :50.0 Median :54.0 ## Mean :1.16 Mean :2.02 Mean :52.2 Mean :52.8 ## 3rd Qu.:1.00 3rd Qu.:2.25 3rd Qu.:60.0 3rd Qu.:60.0 ## Max. :2.00 Max. :3.00 Max. :76.0 Max. :67.0 ## math science socst ## Min. :33.0 Min. :26.0 Min. :26.0 ## 1st Qu.:45.0 1st Qu.:44.0 1st Qu.:46.0 ## Median :52.0 Median :53.0 Median :52.0 ## Mean :52.6 Mean :51.9 Mean :52.4 ## 3rd Qu.:59.0 3rd Qu.:58.0 3rd Qu.:61.0 ## Max. :75.0 Max. :74.0 Max. :71.0

If we want conditional summaries, for example only for those students with
high reading scores (read >= 60), we first subset the data using `filter`

, then
summarize as usual.

`R`

permits nested function calls, where the results of
one function are passed directly as an argument to another
function. Here, `filter`

returns a dataset containing
observations where `read`

>= 60. This data subset is then
passed to `summary`

to obtain distributions of the
variables in the subset.

Note that `filter`

is in the `dplyr`

package.

summary(filter(d, read >= 60))

## id female race ses ## Min. : 3.0 Min. :0.000 Min. :1.0 Min. :1.00 ## 1st Qu.: 76.5 1st Qu.:0.000 1st Qu.:4.0 1st Qu.:2.00 ## Median :108.5 Median :0.000 Median :4.0 Median :3.00 ## Mean :109.8 Mean :0.482 Mean :3.7 Mean :2.38 ## 3rd Qu.:143.2 3rd Qu.:1.000 3rd Qu.:4.0 3rd Qu.:3.00 ## Max. :200.0 Max. :1.000 Max. :4.0 Max. :3.00 ## schtyp prog read write ## Min. :1.00 Min. :1.00 Min. :60.0 Min. :43.0 ## 1st Qu.:1.00 1st Qu.:2.00 1st Qu.:63.0 1st Qu.:57.0 ## Median :1.00 Median :2.00 Median :65.0 Median :60.0 ## Mean :1.18 Mean :1.95 Mean :65.5 Mean :59.5 ## 3rd Qu.:1.00 3rd Qu.:2.00 3rd Qu.:68.0 3rd Qu.:65.0 ## Max. :2.00 Max. :3.00 Max. :76.0 Max. :67.0 ## math science socst ## Min. :35.0 Min. :44.0 Min. :41.0 ## 1st Qu.:55.8 1st Qu.:55.0 1st Qu.:56.0 ## Median :60.5 Median :61.0 Median :61.0 ## Mean :60.2 Mean :59.7 Mean :60.9 ## 3rd Qu.:66.2 3rd Qu.:65.2 3rd Qu.:66.0 ## Max. :75.0 Max. :74.0 Max. :71.0

We can separate the data in other ways, such as by groups. Let's look
at the means of the last 5 variables for each type of
program, `prog`

.

Here, we are asking the `by`

function to apply
the `colMeans`

function to variables located in columns 7
through 11, and to calculate those means by groups denoted in the
variable `prog`

.

by(d[, 7:11], d$prog, colMeans)

## d$prog: 1 ## read write math science socst ## 49.756 51.333 50.022 52.444 50.600 ## -------------------------------------------------------- ## d$prog: 2 ## read write math science socst ## 56.162 56.257 56.733 53.800 56.695 ## -------------------------------------------------------- ## d$prog: 3 ## read write math science socst ## 46.20 46.76 46.42 47.22 45.02

Typically it is easier to inspect variable distributions with
graphics. For the following graphics, we use the `ggplot2`

package. Histograms are often used for continuous variable distributions...

ggplot(d, aes(x = write)) + geom_histogram()

...as are kernel density plots...

ggplot(d, aes(x = write)) + geom_density()

...as well as boxplots, which show the median, lower and upper quartiles (or hinges) and the full range.

ggplot(d, aes(x = 1, y = math)) + geom_boxplot()

We can also plot graphs by group to better understand our data. Here we examine the
densities of `write`

for each type of program, `prog`

.

# density plots by program type ggplot(d, aes(x = write)) + geom_density() + facet_wrap(~ prog)

We could also view boxplots of `math`

by type of
program.

Here we plot math scores for each group in `prog`

.

ggplot(d, aes(x = factor(prog), y = math)) + geom_boxplot()

Here we demonstrate the flexibility of `R`

by plotting the distributions of all of our continuous
variables in a single boxplot.

The function `melt`

from
the `reshape2`

package stacks the values located in
columns 7 through 11 of our dataset on top
of one another to form one column called "value", and stacks the
variable names associated with each value in a
second column called "variable".

Thus we are asking for a boxplot of "values" grouped by "variable".

Try running just `melt(d[, 7:11])`

first. You'll notice there are 1000 rows and 2 columns, since it stacks the 200 values of five different variables on top of one another.

ggplot(melt(d[, 7:11]), aes(x = variable, y = value)) + geom_boxplot()

Finally we can get boxplots of these same variables coloured by program
type by specifying a `fill`

argument.

ggplot(melt(d[, 6:11], id.vars = "prog"), aes(x = variable, y = value, fill = factor(prog))) + geom_boxplot()

This presentation has made heavy use of the `ggplot2`

package
to visualize data in `R`

. `ggplot2`

is an implementation of
Leland Wilkinson's *Grammar of Graphics* which is a framework for thinking
about and creating graphs.

However, there are many other packages and ways to make graphics in `R`

.
Another very popular system uses the `lattice`

package. This is a recommended
package, which means that it ships with `R`

so everyone has it.
Here are some examples.

# load lattice require(lattice) # simple scatter plot xyplot(read ~ write, data = d)

# conditioned scatter plot xyplot(read ~ write | prog, data = d)

# conditioning on two variables xyplot(read ~ write | prog * schtyp, data = d)

# box and whisker plots bwplot(read ~ factor(prog), data = d)

It is also possible to make graphs in base `R`

, but often they look
much simpler and less aesthetically pleasing, *unless* you take the time
to customize them.

We can look at the distributions of categorical variables with frequency tables.

xtabs( ~ female, data = d)

## female ## 0 1 ## 91 109

xtabs( ~ race, data = d)

## race ## 1 2 3 4 ## 24 11 20 145

xtabs( ~ prog, data = d)

## prog ## 1 2 3 ## 45 105 50

Two-way cross tabs.

xtabs( ~ ses + schtyp, data = d)

## schtyp ## ses 1 2 ## 1 45 2 ## 2 76 19 ## 3 47 11

Three-way cross tabs.

(tab3 <- xtabs( ~ ses + prog + schtyp, data = d))

## , , schtyp = 1 ## ## prog ## ses 1 2 3 ## 1 14 19 12 ## 2 17 30 29 ## 3 8 32 7 ## ## , , schtyp = 2 ## ## prog ## ses 1 2 3 ## 1 2 0 0 ## 2 3 14 2 ## 3 1 10 0

We can test whether `ses`

and `schtyp`

are
independent with a permutation test from the `vcd`

package.

(tab2 <- xtabs( ~ ses + schtyp, data = d))

## schtyp ## ses 1 2 ## 1 45 2 ## 2 76 19 ## 3 47 11

set.seed(10) (testtab2 <- coindep_test(tab2, n = 5000))

## ## Permutation test for conditional independence ## ## data: tab2 ## f(x) = 2.01, p-value = 0.025

The area of each cell in a mosaic plot corresponds to the
frequency from the crosstabs. `mosaic`

comes from
the `vcd`

package.

# simple mosaic plot mosaic(tab2)

To visually understand whether the variables are independent, let's shade the cells based on pearsonized residuals from what we would observe if the data were independent.

mosaic(tab2, gp = shading_hsv, gp_args = list(p.value = testtab2$p.value, interpolate = -1:2))

We can condition just like we did with cross tabs, to visualize a three-way cross tabs.

cotabplot(~ ses + prog | schtyp, data = d, panel = cotab_coindep, n = 5000)

As a last step in our data exploration, we would like some quick looks at bivariate (pairwise) relationships in our data. Correlation matrices provide quick summaries of these pairwise relationships.

If there are no missing data, we can use the `cor`

function with default arguments. Otherwise, we could use
the `use`

argument to get listwise or pairwise
deletion. Check the help file (`?cor`

) for how to use the `use`

argument.

Here we are requesting all pairwise correlations among variables in columns 7 through 11.

cor(d[, 7:11])

## read write math science socst ## read 1.00000 0.59678 0.66228 0.63016 0.62148 ## write 0.59678 1.00000 0.61745 0.57044 0.60479 ## math 0.66228 0.61745 1.00000 0.63073 0.54448 ## science 0.63016 0.57044 0.63073 1.00000 0.46511 ## socst 0.62148 0.60479 0.54448 0.46511 1.00000

If there are missing data, for listwise deletion, use only complete observations.

cor(d[, 7:11], use = "complete.obs")

## read write math science socst ## read 1.00000 0.59678 0.66228 0.63016 0.62148 ## write 0.59678 1.00000 0.61745 0.57044 0.60479 ## math 0.66228 0.61745 1.00000 0.63073 0.54448 ## science 0.63016 0.57044 0.63073 1.00000 0.46511 ## socst 0.62148 0.60479 0.54448 0.46511 1.00000

If there are missing data, for pairwise deletion, use pairwise complete observations.

cor(d[, 7:11], use = "pairwise.complete.obs")

## read write math science socst ## read 1.00000 0.59678 0.66228 0.63016 0.62148 ## write 0.59678 1.00000 0.61745 0.57044 0.60479 ## math 0.66228 0.61745 1.00000 0.63073 0.54448 ## science 0.63016 0.57044 0.63073 1.00000 0.46511 ## socst 0.62148 0.60479 0.54448 0.46511 1.00000

We can inspect univariate and bivariate relationships using a scatter plot matrix.

ggpairs(d[, 7:11])

This section demonstrates reordering and modifying your data.

Let's begin by reading in our dataset again and storing it in
object `d`

.

# read data in and store in an easy to use name to save typing d <- read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv")

We can sort data using the `arrange`

function from the `dplyr`

package.

Here we are requesting that `arrange`

sort by `female`

, and then by
`math`

. The function ` arrange`

returns the sorted dataset .

library(dplyr) d <- arrange(d, female, math) head(d)

## id female race ses schtyp prog read write math science socst ## 1 167 0 4 2 1 1 63 49 35 66 41 ## 2 128 0 4 3 1 2 39 33 38 47 41 ## 3 49 0 3 3 1 3 50 40 39 49 47 ## 4 22 0 1 2 1 3 42 39 39 56 46 ## 5 134 0 4 1 1 1 44 44 39 34 46 ## 6 117 0 4 3 1 3 34 49 39 42 56

Within `R`

, categorical variables are typically coded as *factors*, a special class that allows for value
labeling.

Here, using the `factor`

function, we convert all categorical variables to factors and
label their values.

To spare us from unnecessary typing, we use the `mutate`

function from the `dplyr`

package to let `R`

know that all conversions to
factors and labeling should
occur within the dataset `d`

#library(dplyr) str(d)

## 'data.frame': 200 obs. of 11 variables: ## $ id : int 167 128 49 22 134 117 153 140 58 133 ... ## $ female : int 0 0 0 0 0 0 0 0 0 0 ... ## $ race : int 4 4 3 1 4 4 4 4 4 4 ... ## $ ses : int 2 3 3 2 1 3 2 2 2 2 ... ## $ schtyp : int 1 1 1 1 1 1 1 1 1 1 ... ## $ prog : int 1 2 3 3 1 3 3 3 3 3 ... ## $ read : int 63 39 50 42 44 34 39 44 55 50 ... ## $ write : int 49 33 40 39 44 49 31 41 41 31 ... ## $ math : int 35 38 39 39 39 39 40 40 40 40 ... ## $ science: int 66 47 49 56 34 42 39 50 44 34 ... ## $ socst : int 41 41 47 46 46 56 51 26 41 31 ...

d <- mutate(d, id = factor(id), female = factor(female, levels = 0:1, labels = c("male", "female")), race = factor(race, levels = 1:4, labels = c("Hispanic", "Asian", "African American", "White")), schtyp = factor(schtyp, levels = 1:2, labels = c("public", "private")), prog = factor(prog, levels = 1:3, labels = c("general", "academic", "vocational")) )

Here are the results of our changing our categorical variables to factors.

str(d)

## 'data.frame': 200 obs. of 11 variables: ## $ id : Factor w/ 200 levels "1","2","3","4",..: 167 128 49 22 134 117 153 140 58 133 ... ## $ female : Factor w/ 2 levels "male","female": 1 1 1 1 1 1 1 1 1 1 ... ## $ race : Factor w/ 4 levels "Hispanic","Asian",..: 4 4 3 1 4 4 4 4 4 4 ... ## $ ses : int 2 3 3 2 1 3 2 2 2 2 ... ## $ schtyp : Factor w/ 2 levels "public","private": 1 1 1 1 1 1 1 1 1 1 ... ## $ prog : Factor w/ 3 levels "general","academic",..: 1 2 3 3 1 3 3 3 3 3 ... ## $ read : int 63 39 50 42 44 34 39 44 55 50 ... ## $ write : int 49 33 40 39 44 49 31 41 41 31 ... ## $ math : int 35 38 39 39 39 39 40 40 40 40 ... ## $ science: int 66 47 49 56 34 42 39 50 44 34 ... ## $ socst : int 41 41 47 46 46 56 51 26 41 31 ...

summary(d)

## id female race ses ## 1 : 1 male : 91 Hispanic : 24 Min. :1.00 ## 2 : 1 female:109 Asian : 11 1st Qu.:2.00 ## 3 : 1 African American: 20 Median :2.00 ## 4 : 1 White :145 Mean :2.06 ## 5 : 1 3rd Qu.:3.00 ## 6 : 1 Max. :3.00 ## (Other):194 ## schtyp prog read write ## public :168 general : 45 Min. :28.0 Min. :31.0 ## private: 32 academic :105 1st Qu.:44.0 1st Qu.:45.8 ## vocational: 50 Median :50.0 Median :54.0 ## Mean :52.2 Mean :52.8 ## 3rd Qu.:60.0 3rd Qu.:60.0 ## Max. :76.0 Max. :67.0 ## ## math science socst ## Min. :33.0 Min. :26.0 Min. :26.0 ## 1st Qu.:45.0 1st Qu.:44.0 1st Qu.:46.0 ## Median :52.0 Median :53.0 Median :52.0 ## Mean :52.6 Mean :51.9 Mean :52.4 ## 3rd Qu.:59.0 3rd Qu.:58.0 3rd Qu.:61.0 ## Max. :75.0 Max. :74.0 Max. :71.0 ##

Often we need to create variables from other variables. For example, we may want to sum individual test items to form a total score. Or, we may want to convert a continuous scale into several categories, such as letter grades.

We use the `mutate`

function to tell R that all variables created and referenced are within the `d`

dataset.

Below we create a total score and use the `cut`

function to recode continuous
ranges into categories.

#create a total score as sum of 4 scores #recode into letter grade by breaking into categories #library(dplyr) d <- mutate(d, total = read+write+math+science, grade = cut(total, breaks = c(0, 140, 180, 210, 234, 300), labels = c("F", "D", "C", "B", "A")) ) # view results summary(d[, c("total", "grade")])

## total grade ## Min. :139 F: 1 ## 1st Qu.:180 D:51 ## Median :210 C:50 ## Mean :210 B:49 ## 3rd Qu.:234 A:49 ## Max. :277

We can get standardized (Z) scores with the `scale`

function, and get
average `read`

scores for each level of `ses`

using the `ave`

function, which itself calls `mean`

. We use `mutate`

again to simplify the coding.

#library(dplyr) d <- mutate(d, zread = as.numeric(scale(read)), readmean = ave(read, ses, FUN = mean) ) head(d[, c("read", "zread", "readmean")])

## read zread readmean ## 1 63 1.05043 51.579 ## 2 39 -1.29036 56.500 ## 3 50 -0.21750 56.500 ## 4 42 -0.99776 51.579 ## 5 44 -0.80270 48.277 ## 6 34 -1.77803 56.500

We can also take the mean of a set of variables, ignoring missing values, which is useful for creating composite scores.

d$rowmean <- rowMeans(d[, 7:10], na.rm=TRUE)

At times it is convenient to look at directory contents within
an `R`

session. Let's get the current working directory, list its files, and change the
working directory.

getwd()

## [1] "C:/website/stat/r/seminars"

list.files()

## [1] "figures" "intro.htm" "intro.R" ## [4] "intro.Rhtml" "R.svg" "R.vsd" ## [7] "Repeated_Measures" "Rmatrix.htm" "ui"

```
#setwd("/path/to/directory")
```

Suppose we would like to subset our dataset into 2 datsets, one for
each gender. Here we use the `filter`

function to split
the dataset into 2 subsets, and we store each subset in a new object.

dfemale <- filter(d, female == "female") dmale <- filter(d, female == "male")

Often, datasets come with many more variable than we want. We can
also use `select`

to keep only the variables we
need.

# note that select is special, so we do not need to quote the variable names duse <- select(d, id, female, read, write) # note the - preceding c(female... , which means drop these variables ddropped <- select(d, -c(female, read, write))

If we were given separate files for males and females but wanted to
stack them (add observations) we can append the datasets together
row-wise with `rbind`

.

dboth <- rbind(dfemale, dmale) dim(dfemale)

## [1] 109 16

dim(dmale)

## [1] 91 16

dim(dboth)

## [1] 200 16

If instead we were given separate files describing the same students, but
with different variables, we could merge datasets to combine both
sets of variables into 1 dataset, using
the `merge`

function. Note that we do not need to use the
same variable as the id in both datasets. We could have
said, `by.x = "id.x"`

, `by.y = "id.y"`

.

dall <- merge(duse, ddropped, by = "id", all = TRUE) dim(duse)

## [1] 200 4

dim(ddropped)

## [1] 200 13

dim(dall)

## [1] 200 16

We often analyze relationships between categorical variables using chi square
tests. `chisq.test`

can use raw data, or you can give it
a frequency table. We do the latter.

(tab <- xtabs(~ ses + schtyp, data = d))

## schtyp ## ses public private ## 1 45 2 ## 2 76 19 ## 3 47 11

chisq.test(tab)

## ## Pearson's Chi-squared test ## ## data: tab ## X-squared = 6.33, df = 2, p-value = 0.042

If you are concerned about small cell sizes, you can use Monte Carlo simulation. Here we do 90,000 simulations.

chisq.test(tab, simulate.p.value=TRUE, B = 90000)

## ## Pearson's Chi-squared test with simulated p-value (based on 90000 ## replicates) ## ## data: tab ## X-squared = 6.33, df = NA, p-value = 0.042

`t.test`

performs t-tests, used to compare pairs of means. Here we show a one sample t-test
comparing the mean of `write`

to a mean of 50, and a paired samples t-test comparing the
means of `write`

and `read`

.

t.test(d$write, mu = 50)

## ## One Sample t-test ## ## data: d$write ## t = 4.14, df = 199, p-value = 5.1e-05 ## alternative hypothesis: true mean is not equal to 50 ## 95 percent confidence interval: ## 51.453 54.097 ## sample estimates: ## mean of x ## 52.775

with(d, t.test(write, read, paired = TRUE))

## ## Paired t-test ## ## data: write and read ## t = 0.867, df = 199, p-value = 0.39 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.69414 1.78414 ## sample estimates: ## mean of the differences ## 0.545

Perhaps we would like to compare the means of `write`

between males and females. Here are independent samples t-tests with equal variance assumed and
not assumed.

t.test(write ~ female, data = d, var.equal=TRUE)

## ## Two Sample t-test ## ## data: write by female ## t = -3.73, df = 198, p-value = 0.00025 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -7.4418 -2.2981 ## sample estimates: ## mean in group male mean in group female ## 50.121 54.991

t.test(write ~ female, data = d)

## ## Welch Two Sample t-test ## ## data: write by female ## t = -3.66, df = 170, p-value = 0.00034 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -7.4992 -2.2407 ## sample estimates: ## mean in group male mean in group female ## 50.121 54.991

ANOVAs and ordinary least-squares regression are just linear
models, so we use `lm`

for both. You should typically
store the results of an estimation command in an object, as it often
contains a wealth of information.

The object storing the results of `lm`

can then be
supplied to `anova`

for an ANOVA table partitioning the
variance sequentially, or to `summary`

for a table of regression
parameters and measures of fit.

If you would like ANOVAs using other types of sums of squares, get
the `car`

package from CRAN.

m <- lm(write ~ prog * female, data = d) anova(m)

## Analysis of Variance Table ## ## Response: write ## Df Sum Sq Mean Sq F value Pr(>F) ## prog 2 3176 1588 23.25 8.9e-10 *** ## female 1 1129 1129 16.53 7.0e-05 *** ## prog:female 2 326 163 2.39 0.095 . ## Residuals 194 13249 68 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(m)

## ## Call: ## lm(formula = write ~ prog * female, data = d) ## ## Residuals: ## Min 1Q Median 3Q Max ## -21.62 -5.14 1.04 6.12 21.17 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 49.14 1.80 27.25 <2e-16 *** ## progacademic 5.47 2.17 2.52 0.0124 * ## progvocational -7.32 2.49 -2.93 0.0038 ** ## femalefemale 4.11 2.47 1.66 0.0979 . ## progacademic:femalefemale -1.14 2.95 -0.39 0.7005 ## progvocational:femalefemale 5.03 3.41 1.48 0.1413 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 8.26 on 194 degrees of freedom ## Multiple R-squared: 0.259, Adjusted R-squared: 0.24 ## F-statistic: 13.6 on 5 and 194 DF, p-value: 2.38e-11

We can easily update a model, for instance by adding a continuous
predictor, say `read`

, using `update`

.

summary(m2 <- update(m, . ~ . + read))

## ## Call: ## lm(formula = write ~ prog + female + read + prog:female, data = d) ## ## Residuals: ## Min 1Q Median 3Q Max ## -17.219 -4.467 0.426 4.608 14.507 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 22.9406 3.1881 7.20 1.3e-11 *** ## progacademic 3.8293 1.8147 2.11 0.03614 * ## progvocational -3.7044 2.1127 -1.75 0.08113 . ## femalefemale 7.0732 2.0806 3.40 0.00082 *** ## read 0.4948 0.0531 9.32 < 2e-16 *** ## progacademic:femalefemale -4.0012 2.4791 -1.61 0.10816 ## progvocational:femalefemale 1.5617 2.8598 0.55 0.58563 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.88 on 193 degrees of freedom ## Multiple R-squared: 0.489, Adjusted R-squared: 0.473 ## F-statistic: 30.8 on 6 and 193 DF, p-value: <2e-16

Using the `plot`

function on a object storing the
results of `lm`

will produce regression diagnostic
plots. Here we are arranging them in a 2x2 square using `par`

.

par(mfrow = c(2, 2)) plot(m2)

We can get specific diagnostic plots, such as a density plot of the residuals (assumed normally distributed for inference).

plot(density(resid(m2)))

Interactions are easy to add to models using `*`

. This notation also automatically
creates the lower order terms.

summary(m3 <- lm(write ~ prog * read, data = d))

## ## Call: ## lm(formula = write ~ prog * read, data = d) ## ## Residuals: ## Min 1Q Median 3Q Max ## -17.595 -4.773 0.557 5.375 18.405 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 27.33720 6.17869 4.42 1.6e-05 *** ## progacademic 2.82708 7.56135 0.37 0.70890 ## progvocational -2.88581 8.36821 -0.34 0.73058 ## read 0.48228 0.12214 3.95 0.00011 *** ## progacademic:read -0.01768 0.14413 -0.12 0.90250 ## progvocational:read 0.00059 0.17122 0.00 0.99725 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 7.48 on 194 degrees of freedom ## Multiple R-squared: 0.393, Adjusted R-squared: 0.377 ## F-statistic: 25.1 on 5 and 194 DF, p-value: <2e-16

To get the multi-degree of freedom test for the interaction, we can compare a model with and without the interaction.

m3b <- update(m3, . ~ . - prog:read) anova(m3b, m3)

## Analysis of Variance Table ## ## Model 1: write ~ prog + read ## Model 2: write ~ prog * read ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 196 10861 ## 2 194 10860 2 1.37 0.01 0.99

We can get the estimated (predicted) cell means
using `predict`

.

First, using `expand.grid`

we create a new dataset containing
all possible crossings of predictor values for which we would like
to predict the outcome. We then use the model coefficients to
predict our cell means, and store them back in the new dataset.

newdat <- with(d, expand.grid(prog = levels(prog), female = levels(female))) (newdat <- cbind(newdat, predict(m, newdat, se=TRUE)))

## prog female fit se.fit df residual.scale ## 1 general male 49.143 1.8033 194 8.2639 ## 2 academic male 54.617 1.2054 194 8.2639 ## 3 vocational male 41.826 1.7231 194 8.2639 ## 4 general female 53.250 1.6869 194 8.2639 ## 5 academic female 57.586 1.0851 194 8.2639 ## 6 vocational female 50.963 1.5904 194 8.2639

Plots of predicted values can be nice.

ggplot(newdat, aes(x = prog, y = fit, colour = female)) + geom_errorbar(aes(ymin = fit - se.fit, ymax = fit + se.fit), width=.25) + geom_point(size=3)

This was a good example why it is nice to use good variable names. If we had, this graph would be ready to go.

colnames(newdat)[1:3] <- c("Program", "Sex", "Write") ggplot(newdat, aes(x = Program, y = Write, colour = Sex)) + geom_errorbar(aes(ymin = Write - se.fit, ymax = Write + se.fit), width=.25) + geom_point(size=3)

Generalized linear models include OLS regression, Poisson, and logistic. Let's look at logistic regression.

Notice we are storing the results of the regression in model object m4.

m4 <- glm(schtyp ~ ses + read, data = d, family = binomial) summary(m4)

## ## Call: ## glm(formula = schtyp ~ ses + read, family = binomial, data = d) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.809 -0.628 -0.549 -0.434 2.183 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.3804 1.0887 -3.10 0.0019 ** ## ses 0.4784 0.2918 1.64 0.1011 ## read 0.0131 0.0196 0.67 0.5046 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 175.87 on 199 degrees of freedom ## Residual deviance: 171.61 on 197 degrees of freedom ## AIC: 177.6 ## ## Number of Fisher Scoring iterations: 4

Odds ratios are common to report in logistic regression. These are
just the exponentiated coefficients. Here `coef`

is
extracting the coefficents from the model object. The coefficients
are then exponentiated.

exp(coef(m4))

## (Intercept) ses read ## 0.034034 1.613556 1.013178

We can also get confidence intervals for the coefficients or the odds ratios.

confint(m4)

## 2.5 % 97.5 % ## (Intercept) -5.592098 -1.302186 ## ses -0.083122 1.067576 ## read -0.025416 0.051926

exp(confint(m4))

## 2.5 % 97.5 % ## (Intercept) 0.0037272 0.27194 ## ses 0.9202387 2.90832 ## read 0.9749046 1.05330

For the ANOVA, we got estimated cell means. For logistic regression, let's get the predicted probabilities.

newdat <- expand.grid(ses = unique(d$ses), read = seq(min(d$read), max(d$read))) newdat$PredictedProbability <- predict(m4, newdat, type = "response")

Now we can plot the predicted probabilities to see
how `read`

scores and `ses`

affect the
probability of being in a private versus public school.

ggplot(newdat, aes(x = read, y = PredictedProbability, colour = factor(ses))) + geom_line()

We can use nonparametric tests when we are not sure that our data meet the distributional assumptions of the statistical inference tests we would like to use.

# equivalent of one sampe t-test wilcox.test(d$write, mu = 50)

## ## Wilcoxon signed rank test with continuity correction ## ## data: d$write ## V = 13200, p-value = 3.7e-05 ## alternative hypothesis: true location is not equal to 50

# Wilcoxon sign-ranked test, equivalent of paired samples t-test with(d, wilcox.test(write, read, paired=TRUE))

## ## Wilcoxon signed rank test with continuity correction ## ## data: write and read ## V = 9260, p-value = 0.37 ## alternative hypothesis: true location shift is not equal to 0

# Wilcoxon test, equivalent of indepdent samples t-test wilcox.test(write ~ female, data = d)

## ## Wilcoxon rank sum test with continuity correction ## ## data: write by female ## W = 3610, p-value = 0.00087 ## alternative hypothesis: true location shift is not equal to 0

# Kruskal-Wallis test, equivalent of one-way ANOVA kruskal.test(write ~ prog, data = d)

## ## Kruskal-Wallis rank sum test ## ## data: write by prog ## Kruskal-Wallis chi-squared = 34, df = 2, p-value = 4e-08

Built-in reference manual is available:

`?lm`

and searchable:

`??regression`

Many packages include vignettes, for more detailed examples.

`vignette()`

For those of you switching packages:

- R for SAS and SPSS Users - Muenchen 2011
- R for Stata Users - Muenchen and Hilbe 2010

Available for free on SpringerLink.

There is an extensive list of books on `R`

maintained at
http://www.r-project.org/doc/bib/R-books.html.