Link to all slides:

www.ats.ucla.edu/stat/r/seminars/intro.htm

For just the code (right-click to download):

`R`

is a programming environment

- uses a well-developed but simple programming language
- allows for rapid development of new tools according to user demand
- these tools are distributed as packages, which any user can
download to customize the
`R`

environment

Base `R`

and most `R`

packages are available for download from the
Comprehensive R Archive Network (CRAN)

- cran.r-project.org
- base
`R`

comes with a number of basic data management, analysis, and graphical tools `R`

's power and flexibility, however, lie in its array of packages (currently more around 6,000!)

`R`

, but most users prefer a graphical interface. For starters:
More advanced users may prefer a good text editor with plugins for syntax highlighting,
code completion, etc. for `R`

such as:

For the purposes of this seminar, we will be using the following packages frequently:

`foreign`

package to read data files from other stats packages`readxl`

package for reading Excel files`dplyr`

package for various data management tasks`reshape2`

package to easily melt data to long form`ggplot2`

package for elegant data visualization using the*Grammar of Graphics*`GGally`

package for scatter plot matrices`vcd`

package for visualizing and analyzing categorical data

To use packages in `R`

, we must first install them using
the `install.packages`

function, which typically downloads the
package from CRAN and installs it for use

If we know we will need a particular package for our current R
session, we must load it into the R environment using
the `library`

or `require`

functions

For Windows and Mac, there are binary files to install `R`

.
You can download these and just click to install them. The defaults are
sensible and if you do not have a reason otherwise, are generally a good
idea to accept.

On various Linux distributions, you can use `apt-get install r-base`

if that distribution maintains a version of `R`

.

For Windows, Mac, and Linux, if you have the appropriate tools installed,
you can build `R`

from source. The definitive guide to this is available
http://cran.r-project.org/doc/manuals/R-admin.html

Once `R`

is installed, assuming you have sufficient privileges,
you can install and update most `R`

packages. To install them is just:

`install.packages("packagename")`

To update packages, you can update them all at once using:

`update.packages()`

You can also specify specific CRAN mirrors to use. For example the UCLA Statistics Department maintains a mirror

`install.packages("packagename", repos = "http://cran.stat.ucla.edu/")`

You can use this code to either load the package, or if that does not work, install the package and then load it.

To get a description of the version of R and its attached packages
used in the current session, we can use the `sessionInfo`

function

`R`

code can be entered into the command line directly
or saved to a script, which can be run inside a session using
the `source`

function

Commands are separated either by a `;`

or by a newline.

`R`

is case sensitive.

The `#`

character at the beginning of a line signifies
a comment, which is not executed.

Help files for `R`

functions are accessed by preceding the name of
the function with `?`

(e.g. `?require`

).

`??keyword`

searches R documentation for `keyword`

(e.g. `??logistic`

)

`R`

stores both data and output from data analysis (as well as
everything else) in *objects*

Things are assigned to and stored in objects using
the `<-`

or `=`

operator

A list of all objects in the current session can be obtained
with `ls()`

`R`

works most easily with datasets stored as text files. Typically,
values in text files are separated, or delimited, by tabs or spaces:

gender id race ses schtyp prgtype read write math science socst 0 70 4 1 1 general 57 52 41 47 57 1 121 4 2 1 vocati 68 59 53 63 31 0 86 4 3 1 general 44 33 54 58 31 0 141 4 3 1 vocati 63 44 47 53 56

or by commas (CSV file):

gender,id,race,ses,schtyp,prgtype,read,write,math,science,socst 0,70,4,1,1,general,57,52,41,47,57 1,121,4,2,1,vocati,68,59,53,63,61 0,86,4,3,1,general,44,33,54,58,31 0,141,4,3,1,vocati,63,44,47,53,56

Base R functions `read.table`

and `read.csv`

can read in data stored as text files, delimited by almost
anything (notice the `sep =`

option)

Although we are retrieving files over the internet for this class, these functions are typically used for files saved to disk.

Note how we are assigning the loaded data to objects.

We can read in datasets from other statistical analysis software using functions found
in the `foreign`

package

Datasets are often saved as Excel spreadsheets. Here we utilize the `readxl`

package to read in the excel file. We need to download the file first.

R has ways to look at the dataset at a glance or as a whole.

Once read in, datasets in R are typically stored as *data frames*, which
have a matrix structure. Observations are arranged as rows and
variables, either numerical or categorical, are arranged as
columns.

Individual rows, columns, and cells in a data frame can be accessed through many methods of indexing.

We most commonly use `object[row,column]`

notation.

We can also access variables directly by using their names, either
with `object[,"variable"]`

notation
or `object$variable`

notation.

`c`

function for combining values into a vectorThe `c`

function is widely used to combine values of
common type together to form a vector.

For example, it can be used to access non-sequential rows and columns from a data frame.

If there were no variable names, or we wanted to change the names, we could use `colnames`

.

We can save our data in a number of formats, including text, Excel .xlsx, and in other statistical software formats like Stata .dta. Note that the code below is commented, so it will not run.

The function `write.dta`

comes
from the `foreign`

package, while `write.xlsx`

comes from the `xlsx`

package.

Now we're going to read some data in and store it in the
object, `d`

. We prefer short names for objects that we will
use frequently.

We can now easily explore and get to know these data, which contain a number of school, test, and demographic variables for 200 students.

Using `dim`

, we get the number of observations(rows) and
variables(columns) in `d`

.

Using `str`

, we get the structure of `d`

, including the *class* of `d`

and the data *type* of all column variables (usually one of "numeric", "integer", "logical", or "character").

R objects belong to *classes*. Most functions only accept objects of a specific class, so it is important to know the classes of our objects. Objects can belong to more than one class, and users can define classes to control the inputs of their functions.

The `class`

function lists all classes to which the object belongs. If `class`

returns a basic data type (e.g. "numeric", "character", "integer"), the object has an implicit class of "vector" (array) for one-dimensional objects and "matrix" for multi-dimensional objects.

Generic functions remove the need for the user to remember the classes of objects that functions support.

Generic functions accept objects from multiple classes. They then pass the object to a specific function (called methods) designed for the object's class. The various functions for specific classes can have widely diverging purposes.

For example, when passing a data.frame to the generic `plot`

function, `plot`

passes the data.frame to a function called `plot.data.frame`

, which creates a scatter plot matrix of all variables in the data.frame. To contrast, passing a regression model object to `plot`

produces regression diagnostic plots instead.

`methods`

If given a generic function name as an argument, the `methods`

function will list all specific functions (methods) that the generic function searches for a class match. If given a class as an argument, `methods`

will find all specific functions that accept that class.

`summary`

is a generic function to summarize many types
of `R`

objects, including datasets.

When used on a data frame, `summary`

returns distributional
summaries of variables in the dataset.

If we want conditional summaries, for example only for those students with
high reading scores (read >= 60), we first subset the data using `filter`

, then
summarize as usual.

`R`

permits nested function calls, where the results of
one function are passed directly as an argument to another
function. Here, `filter`

returns a dataset containing
observations where `read`

>= 60. This data subset is then
passed to `summary`

to obtain distributions of the
variables in the subset.

Note that `filter`

is in the `dplyr`

package.

We can separate the data in other ways, such as by groups. Let's look
at the means of the last 5 variables for each type of
program, `prog`

.

Here, we are asking the `by`

function to apply
the `colMeans`

function to variables located in columns 7
through 11, and to calculate those means by groups denoted in the
variable `prog`

.

Typically it is easier to inspect variable distributions with
graphics. For the following graphics, we use the `ggplot2`

package. Histograms are often used for continuous variable distributions...

...as are kernel density plots...

...as well as boxplots, which show the median, lower and upper quartiles (or hinges) and the full range.

We can also plot graphs by group to better understand our data. Here we examine the
densities of `write`

for each type of program, `prog`

.

We could also view boxplots of `math`

by type of
program.

Here we plot math scores for each group in `prog`

.

Here we demonstrate the flexibility of `R`

by plotting the distributions of all of our continuous
variables in a single boxplot.

The function `melt`

from
the `reshape2`

package stacks the values located in
columns 7 through 11 of our dataset on top
of one another to form one column called "value", and stacks the
variable names associated with each value in a
second column called "variable".

Thus we are asking for a boxplot of "values" grouped by "variable".

Try running just `melt(d[, 7:11])`

first. You'll notice there are 1000 rows and 2 columns, since it stacks the 200 values of five different variables on top of one another.

Finally we can get boxplots of these same variables coloured by program
type by specifying a `fill`

argument.

This presentation has made heavy use of the `ggplot2`

package
to visualize data in `R`

. `ggplot2`

is an implementation of
Leland Wilkinson's *Grammar of Graphics* which is a framework for thinking
about and creating graphs.

However, there are many other packages and ways to make graphics in `R`

.
Another very popular system uses the `lattice`

package. This is a recommended
package, which means that it ships with `R`

so everyone has it.
Here are some examples.

It is also possible to make graphs in base `R`

, but often they look
much simpler and less aesthetically pleasing, *unless* you take the time
to customize them.

We can look at the distributions of categorical variables with frequency tables.

Two-way cross tabs.

Three-way cross tabs.

We can test whether `ses`

and `schtyp`

are
independent with a permutation test from the `vcd`

package.

The area of each cell in a mosaic plot corresponds to the
frequency from the crosstabs. `mosaic`

comes from
the `vcd`

package.

To visually understand whether the variables are independent, let's shade the cells based on pearsonized residuals from what we would observe if the data were independent.

We can condition just like we did with cross tabs, to visualize a three-way cross tabs.

As a last step in our data exploration, we would like some quick looks at bivariate (pairwise) relationships in our data. Correlation matrices provide quick summaries of these pairwise relationships.

If there are no missing data, we can use the `cor`

function with default arguments. Otherwise, we could use
the `use`

argument to get listwise or pairwise
deletion. Check the help file (`?cor`

) for how to use the `use`

argument.

Here we are requesting all pairwise correlations among variables in columns 7 through 11.

If there are missing data, for listwise deletion, use only complete observations.

If there are missing data, for pairwise deletion, use pairwise complete observations.

We can inspect univariate and bivariate relationships using a scatter plot matrix.

This section demonstrates reordering and modifying your data.

Let's begin by reading in our dataset again and storing it in
object `d`

.

We can sort data using the `arrange`

function from the `dplyr`

package.

Here we are requesting that `arrange`

sort by `female`

, and then by
`math`

. The function ` arrange`

returns the sorted dataset .

Within `R`

, categorical variables are typically coded as *factors*, a special class that allows for value
labeling.

Here, using the `factor`

function, we convert all categorical variables to factors and
label their values.

To spare us from unnecessary typing, we use the `mutate`

function from the `dplyr`

package to let `R`

know that all conversions to
factors and labeling should
occur within the dataset `d`

Here are the results of our changing our categorical variables to factors.

Often we need to create variables from other variables. For example, we may want to sum individual test items to form a total score. Or, we may want to convert a continuous scale into several categories, such as letter grades.

We use the `mutate`

function to tell R that all variables created and referenced are within the `d`

dataset.

Below we create a total score and use the `cut`

function to recode continuous
ranges into categories.

We can get standardized (Z) scores with the `scale`

function, and get
average `read`

scores for each level of `ses`

using the `ave`

function, which itself calls `mean`

. We use `mutate`

again to simplify the coding.

We can also take the mean of a set of variables, ignoring missing values, which is useful for creating composite scores.

At times it is convenient to look at directory contents within
an `R`

session. Let's get the current working directory, list its files, and change the
working directory.

Suppose we would like to subset our dataset into 2 datsets, one for
each gender. Here we use the `filter`

function to split
the dataset into 2 subsets, and we store each subset in a new object.

Often, datasets come with many more variable than we want. We can
also use `select`

to keep only the variables we
need.

If we were given separate files for males and females but wanted to
stack them (add observations) we can append the datasets together
row-wise with `rbind`

.

If instead we were given separate files describing the same students, but
with different variables, we could merge datasets to combine both
sets of variables into 1 dataset, using
the `merge`

function. Note that we do not need to use the
same variable as the id in both datasets. We could have
said, `by.x = "id.x"`

, `by.y = "id.y"`

.

We often analyze relationships between categorical variables using chi square
tests. `chisq.test`

can use raw data, or you can give it
a frequency table. We do the latter.

If you are concerned about small cell sizes, you can use Monte Carlo simulation. Here we do 90,000 simulations.

`t.test`

performs t-tests, used to compare pairs of means. Here we show a one sample t-test
comparing the mean of `write`

to a mean of 50, and a paired samples t-test comparing the
means of `write`

and `read`

.

Perhaps we would like to compare the means of `write`

between males and females. Here are independent samples t-tests with equal variance assumed and
not assumed.

ANOVAs and ordinary least-squares regression are just linear
models, so we use `lm`

for both. You should typically
store the results of an estimation command in an object, as it often
contains a wealth of information.

The object storing the results of `lm`

can then be
supplied to `anova`

for an ANOVA table partitioning the
variance sequentially, or to `summary`

for a table of regression
parameters and measures of fit.

If you would like ANOVAs using other types of sums of squares, get
the `car`

package from CRAN.

We can easily update a model, for instance by adding a continuous
predictor, say `read`

, using `update`

.

Using the `plot`

function on a object storing the
results of `lm`

will produce regression diagnostic
plots. Here we are arranging them in a 2x2 square using `par`

.

We can get specific diagnostic plots, such as a density plot of the residuals (assumed normally distributed for inference).

Interactions are easy to add to models using `*`

. This notation also automatically
creates the lower order terms.

To get the multi-degree of freedom test for the interaction, we can compare a model with and without the interaction.

We can get the estimated (predicted) cell means
using `predict`

.

First, using `expand.grid`

we create a new dataset containing
all possible crossings of predictor values for which we would like
to predict the outcome. We then use the model coefficients to
predict our cell means, and store them back in the new dataset.

Plots of predicted values can be nice.

This was a good example why it is nice to use good variable names. If we had, this graph would be ready to go.

Generalized linear models include OLS regression, Poisson, and logistic. Let's look at logistic regression.

Notice we are storing the results of the regression in model object m4.

Odds ratios are common to report in logistic regression. These are
just the exponentiated coefficients. Here `coef`

is
extracting the coefficents from the model object. The coefficients
are then exponentiated.

We can also get confidence intervals for the coefficients or the odds ratios.

For the ANOVA, we got estimated cell means. For logistic regression, let's get the predicted probabilities.

Now we can plot the predicted probabilities to see
how `read`

scores and `ses`

affect the
probability of being in a private versus public school.

We can use nonparametric tests when we are not sure that our data meet the distributional assumptions of the statistical inference tests we would like to use.

Built-in reference manual is available:

`?lm`

and searchable:

`??regression`

Many packages include vignettes, for more detailed examples.

`vignette()`

For those of you switching packages:

- R for SAS and SPSS Users - Muenchen 2011
- R for Stata Users - Muenchen and Hilbe 2010

Available for free on SpringerLink.

There is an extensive list of books on `R`

maintained at
http://www.r-project.org/doc/bib/R-books.html.