Link to all slides:
For just the code (right-click to download):
R is a programming environment
R and most
R packages are available for download from the
Comprehensive R Archive Network (CRAN)
Rcomes with a number of basic data management, analysis, and graphical tools
R's power and flexibility, however, lie in its array of packages (currently more around 6,000!)
R, but most users prefer a graphical interface. For starters:
More advanced users may prefer a good text editor with plugins for syntax highlighting,
code completion, etc. for
R such as:
For the purposes of this seminar, we will be using the following packages frequently:
foreignpackage to read data files from other stats packages
readxlpackage for reading Excel files
dplyrpackage for various data management tasks
reshape2package to easily melt data to long form
ggplot2package for elegant data visualization using the Grammar of Graphics
GGallypackage for scatter plot matrices
vcdpackage for visualizing and analyzing categorical data
To use packages in
R, we must first install them using
install.packages function, which typically downloads the
package from CRAN and installs it for use
If we know we will need a particular package for our current R
session, we must load it into the R environment using
For Windows and Mac, there are binary files to install
You can download these and just click to install them. The defaults are
sensible and if you do not have a reason otherwise, are generally a good
idea to accept.
On various Linux distributions, you can use
apt-get install r-base
if that distribution maintains a version of
For Windows, Mac, and Linux, if you have the appropriate tools installed,
you can build
R from source. The definitive guide to this is available
R is installed, assuming you have sufficient privileges,
you can install and update most
R packages. To install them is just:
To update packages, you can update them all at once using:
You can also specify specific CRAN mirrors to use. For example the UCLA Statistics Department maintains a mirror
install.packages("packagename", repos = "http://cran.stat.ucla.edu/")
You can use this code to either load the package, or if that does not work, install the package and then load it.
To get a description of the version of R and its attached packages
used in the current session, we can use the
R code can be entered into the command line directly
or saved to a script, which can be run inside a session using
Commands are separated either by a
; or by a newline.
R is case sensitive.
# character at the beginning of a line signifies
a comment, which is not executed.
Help files for
R functions are accessed by preceding the name of
the function with
??keyword searches R documentation for
R stores both data and output from data analysis (as well as
everything else) in objects
Things are assigned to and stored in objects using
A list of all objects in the current session can be obtained
R works most easily with datasets stored as text files. Typically,
values in text files are separated, or delimited, by tabs or spaces:
gender id race ses schtyp prgtype read write math science socst 0 70 4 1 1 general 57 52 41 47 57 1 121 4 2 1 vocati 68 59 53 63 31 0 86 4 3 1 general 44 33 54 58 31 0 141 4 3 1 vocati 63 44 47 53 56
or by commas (CSV file):
gender,id,race,ses,schtyp,prgtype,read,write,math,science,socst 0,70,4,1,1,general,57,52,41,47,57 1,121,4,2,1,vocati,68,59,53,63,61 0,86,4,3,1,general,44,33,54,58,31 0,141,4,3,1,vocati,63,44,47,53,56
Base R functions
read.csv can read in data stored as text files, delimited by almost
anything (notice the
sep = option)
Although we are retrieving files over the internet for this class, these functions are typically used for files saved to disk.
Note how we are assigning the loaded data to objects.
We can read in datasets from other statistical analysis software using functions found
Datasets are often saved as Excel spreadsheets. Here we utilize the
readxl package to read in the excel file. We need to download the file first.
R has ways to look at the dataset at a glance or as a whole.
Once read in, datasets in R are typically stored as data frames, which have a matrix structure. Observations are arranged as rows and variables, either numerical or categorical, are arranged as columns.
Individual rows, columns, and cells in a data frame can be accessed through many methods of indexing.
We most commonly use
We can also access variables directly by using their names, either
c function is widely used to combine values of
common type together to form a vector.
For example, it can be used to access non-sequential rows and columns from a data frame.
If there were no variable names, or we wanted to change the names, we could use
We can save our data in a number of formats, including text, Excel .xlsx, and in other statistical software formats like Stata .dta. Note that the code below is commented, so it will not run.
foreign package, while
comes from the
Now we're going to read some data in and store it in the
d. We prefer short names for objects that we will
We can now easily explore and get to know these data, which contain a number of school, test, and demographic variables for 200 students.
dim, we get the number of observations(rows) and
str, we get the structure of
d, including the class of
d and the data type of all column variables (usually one of "numeric", "integer", "logical", or "character").
R objects belong to classes. Most functions only accept objects of a specific class, so it is important to know the classes of our objects. Objects can belong to more than one class, and users can define classes to control the inputs of their functions.
class function lists all classes to which the object belongs. If
class returns a basic data type (e.g. "numeric", "character", "integer"), the object has an implicit class of "vector" (array) for one-dimensional objects and "matrix" for multi-dimensional objects.
Generic functions remove the need for the user to remember the classes of objects that functions support.
Generic functions accept objects from multiple classes. They then pass the object to a specific function (called methods) designed for the object's class. The various functions for specific classes can have widely diverging purposes.
For example, when passing a data.frame to the generic
plot passes the data.frame to a function called
plot.data.frame, which creates a scatter plot matrix of all variables in the data.frame. To contrast, passing a regression model object to
plot produces regression diagnostic plots instead.
If given a generic function name as an argument, the
methods function will list all specific functions (methods) that the generic function searches for a class match. If given a class as an argument,
methods will find all specific functions that accept that class.
summary is a generic function to summarize many types
R objects, including datasets.
When used on a data frame,
summary returns distributional
summaries of variables in the dataset.
If we want conditional summaries, for example only for those students with
high reading scores (read >= 60), we first subset the data using
summarize as usual.
R permits nested function calls, where the results of
one function are passed directly as an argument to another
filter returns a dataset containing
read >= 60. This data subset is then
summary to obtain distributions of the
variables in the subset.
filter is in the
We can separate the data in other ways, such as by groups. Let's look
at the means of the last 5 variables for each type of
Here, we are asking the
by function to apply
colMeans function to variables located in columns 7
through 11, and to calculate those means by groups denoted in the
Typically it is easier to inspect variable distributions with
graphics. For the following graphics, we use the
ggplot2 package. Histograms are often used for continuous variable distributions...
...as are kernel density plots...
...as well as boxplots, which show the median, lower and upper quartiles (or hinges) and the full range.
We can also plot graphs by group to better understand our data. Here we examine the
write for each type of program,
We could also view boxplots of
math by type of
Here we plot math scores for each group in
Here we demonstrate the flexibility of
R by plotting the distributions of all of our continuous
variables in a single boxplot.
reshape2 package stacks the values located in
columns 7 through 11 of our dataset on top
of one another to form one column called "value", and stacks the
variable names associated with each value in a
second column called "variable".
Thus we are asking for a boxplot of "values" grouped by "variable".
Try running just
melt(d[, 7:11]) first. You'll notice there are 1000 rows and 2 columns, since it stacks the 200 values of five different variables on top of one another.
Finally we can get boxplots of these same variables coloured by program
type by specifying a
This presentation has made heavy use of the
to visualize data in
ggplot2 is an implementation of
Leland Wilkinson's Grammar of Graphics which is a framework for thinking
about and creating graphs.
However, there are many other packages and ways to make graphics in
Another very popular system uses the
lattice package. This is a recommended
package, which means that it ships with
R so everyone has it.
Here are some examples.
It is also possible to make graphs in base
R, but often they look
much simpler and less aesthetically pleasing, unless you take the time
to customize them.
We can look at the distributions of categorical variables with frequency tables.
Two-way cross tabs.
Three-way cross tabs.
We can test whether
independent with a permutation test from the
The area of each cell in a mosaic plot corresponds to the
frequency from the crosstabs.
mosaic comes from
To visually understand whether the variables are independent, let's shade the cells based on pearsonized residuals from what we would observe if the data were independent.
We can condition just like we did with cross tabs, to visualize a three-way cross tabs.
As a last step in our data exploration, we would like some quick looks at bivariate (pairwise) relationships in our data. Correlation matrices provide quick summaries of these pairwise relationships.
If there are no missing data, we can use the
function with default arguments. Otherwise, we could use
use argument to get listwise or pairwise
deletion. Check the help file (
?cor) for how to use the
Here we are requesting all pairwise correlations among variables in columns 7 through 11.
If there are missing data, for listwise deletion, use only complete observations.
If there are missing data, for pairwise deletion, use pairwise complete observations.
We can inspect univariate and bivariate relationships using a scatter plot matrix.
This section demonstrates reordering and modifying your data.
Let's begin by reading in our dataset again and storing it in
We can sort data using the
arrange function from the
Here we are requesting that
arrange sort by
female, and then by
math. The function
arrange returns the sorted dataset .
R, categorical variables are typically coded as factors, a special class that allows for value
Here, using the
factor function, we convert all categorical variables to factors and
label their values.
To spare us from unnecessary typing, we use the
function from the
dplyr package to let
R know that all conversions to
factors and labeling should
occur within the dataset
Here are the results of our changing our categorical variables to factors.
Often we need to create variables from other variables. For example, we may want to sum individual test items to form a total score. Or, we may want to convert a continuous scale into several categories, such as letter grades.
We use the
mutate function to tell R that all variables created and referenced are within the
Below we create a total score and use the
cut function to recode continuous
ranges into categories.
We can get standardized (Z) scores with the
scale function, and get
read scores for each level of
ave function, which itself calls
mean. We use
mutate again to simplify the coding.
We can also take the mean of a set of variables, ignoring missing values, which is useful for creating composite scores.
At times it is convenient to look at directory contents within
R session. Let's get the current working directory, list its files, and change the
Suppose we would like to subset our dataset into 2 datsets, one for
each gender. Here we use the
filter function to split
the dataset into 2 subsets, and we store each subset in a new object.
Often, datasets come with many more variable than we want. We can
select to keep only the variables we
If we were given separate files for males and females but wanted to
stack them (add observations) we can append the datasets together
If instead we were given separate files describing the same students, but
with different variables, we could merge datasets to combine both
sets of variables into 1 dataset, using
merge function. Note that we do not need to use the
same variable as the id in both datasets. We could have
by.x = "id.x",
by.y = "id.y".
We often analyze relationships between categorical variables using chi square
chisq.test can use raw data, or you can give it
a frequency table. We do the latter.
If you are concerned about small cell sizes, you can use Monte Carlo simulation. Here we do 90,000 simulations.
t.test performs t-tests, used to compare pairs of means. Here we show a one sample t-test
comparing the mean of
write to a mean of 50, and a paired samples t-test comparing the
Perhaps we would like to compare the means of
between males and females. Here are independent samples t-tests with equal variance assumed and
ANOVAs and ordinary least-squares regression are just linear
models, so we use
lm for both. You should typically
store the results of an estimation command in an object, as it often
contains a wealth of information.
The object storing the results of
lm can then be
anova for an ANOVA table partitioning the
variance sequentially, or to
summary for a table of regression
parameters and measures of fit.
If you would like ANOVAs using other types of sums of squares, get
car package from CRAN.
We can easily update a model, for instance by adding a continuous
plot function on a object storing the
lm will produce regression diagnostic
plots. Here we are arranging them in a 2x2 square using
We can get specific diagnostic plots, such as a density plot of the residuals (assumed normally distributed for inference).
Interactions are easy to add to models using
*. This notation also automatically
creates the lower order terms.
To get the multi-degree of freedom test for the interaction, we can compare a model with and without the interaction.
We can get the estimated (predicted) cell means
expand.grid we create a new dataset containing
all possible crossings of predictor values for which we would like
to predict the outcome. We then use the model coefficients to
predict our cell means, and store them back in the new dataset.
Plots of predicted values can be nice.
This was a good example why it is nice to use good variable names. If we had, this graph would be ready to go.
Generalized linear models include OLS regression, Poisson, and logistic. Let's look at logistic regression.
Notice we are storing the results of the regression in model object m4.
Odds ratios are common to report in logistic regression. These are
just the exponentiated coefficients. Here
extracting the coefficents from the model object. The coefficients
are then exponentiated.
We can also get confidence intervals for the coefficients or the odds ratios.
For the ANOVA, we got estimated cell means. For logistic regression, let's get the predicted probabilities.
Now we can plot the predicted probabilities to see
read scores and
ses affect the
probability of being in a private versus public school.
We can use nonparametric tests when we are not sure that our data meet the distributional assumptions of the statistical inference tests we would like to use.
Built-in reference manual is available:
Many packages include vignettes, for more detailed examples.
For those of you switching packages:
Available for free on SpringerLink.
There is an extensive list of books on
R maintained at