Introducing R

Statistical Consulting Group

UCLA Institute for Digital Research & Education

Links to slides and code

Link to all slides:

www.ats.ucla.edu/stat/r/seminars/intro.htm

For just the code (right-click to download):

www.ats.ucla.edu/stat/r/seminars/intro.R

Installing R and Packages

Introduction

R is a programming environment

Base R and packages

Base R and most R packages are available for download from the Comprehensive R Archive Network (CRAN)

Downloading and Installing R
Interacting with R
You can work directly in R, but most users prefer a graphical interface. For starters:

More advanced users may prefer a good text editor with plugins for syntax highlighting, code completion, etc. for R such as:

Diagram of How R Works
Seminar Packages

For the purposes of this seminar, we will be using the following packages frequently:

Installing Packages

To use packages in R, we must first install them using the install.packages function, which typically downloads the package from CRAN and installs it for use

Loading Packages

If we know we will need a particular package for our current R session, we must load it into the R environment using the library or require functions

Additional Notes on Installing R

For Windows and Mac, there are binary files to install R. You can download these and just click to install them. The defaults are sensible and if you do not have a reason otherwise, are generally a good idea to accept.

On various Linux distributions, you can use apt-get install r-base if that distribution maintains a version of R.

For Windows, Mac, and Linux, if you have the appropriate tools installed, you can build R from source. The definitive guide to this is available http://cran.r-project.org/doc/manuals/R-admin.html

Once R is installed, assuming you have sufficient privileges, you can install and update most R packages. To install them is just:

install.packages("packagename")

To update packages, you can update them all at once using:

update.packages()

You can also specify specific CRAN mirrors to use. For example the UCLA Statistics Department maintains a mirror

install.packages("packagename", repos = "http://cran.stat.ucla.edu/")
Code to load or install then load

You can use this code to either load the package, or if that does not work, install the package and then load it.

Basic info on R session

To get a description of the version of R and its attached packages used in the current session, we can use the sessionInfo function

R programming

R code can be entered into the command line directly or saved to a script, which can be run inside a session using the source function

Commands are separated either by a ; or by a newline.

R is case sensitive.

The # character at the beginning of a line signifies a comment, which is not executed.

Help files for R functions are accessed by preceding the name of the function with ? (e.g. ?require).

??keyword searches R documentation for keyword (e.g. ??logistic)

R programming
R programming

R stores both data and output from data analysis (as well as everything else) in objects

Things are assigned to and stored in objects using the <- or = operator

A list of all objects in the current session can be obtained with ls()

Entering Data

Dataset files

R works most easily with datasets stored as text files. Typically, values in text files are separated, or delimited, by tabs or spaces:


gender id race ses schtyp prgtype read write math science socst
0 70 4 1 1 general 57 52 41 47 57
1 121 4 2 1 vocati 68 59 53 63 31
0 86 4 3 1 general 44 33 54 58 31
0 141 4 3 1 vocati 63 44 47 53 56

or by commas (CSV file):


gender,id,race,ses,schtyp,prgtype,read,write,math,science,socst
0,70,4,1,1,general,57,52,41,47,57
1,121,4,2,1,vocati,68,59,53,63,61
0,86,4,3,1,general,44,33,54,58,31
0,141,4,3,1,vocati,63,44,47,53,56
Reading in Data 1

Base R functions read.table and read.csv can read in data stored as text files, delimited by almost anything (notice the sep = option)

Although we are retrieving files over the internet for this class, these functions are typically used for files saved to disk.

Note how we are assigning the loaded data to objects.

Reading in Data 1
Reading in Data 2

We can read in datasets from other statistical analysis software using functions found in the foreign package

Reading in Excel Files

Datasets are often saved as Excel spreadsheets. Here we utilize the readxl package to read in the excel file. We need to download the file first.

Viewing Data

R has ways to look at the dataset at a glance or as a whole.

Data frames

Once read in, datasets in R are typically stored as data frames, which have a matrix structure. Observations are arranged as rows and variables, either numerical or categorical, are arranged as columns.

Individual rows, columns, and cells in a data frame can be accessed through many methods of indexing.

We most commonly use object[row,column] notation.

More variable indexing

We can also access variables directly by using their names, either with object[,"variable"] notation or object$variable notation.

The c function for combining values into a vector

The c function is widely used to combine values of common type together to form a vector.

For example, it can be used to access non-sequential rows and columns from a data frame.

Variable Names

If there were no variable names, or we wanted to change the names, we could use colnames.

Saving Data

We can save our data in a number of formats, including text, Excel .xlsx, and in other statistical software formats like Stata .dta. Note that the code below is commented, so it will not run.

The function write.dta comes from the foreign package, while write.xlsx comes from the xlsx package.

Exploring Data

Exploring Data

Now we're going to read some data in and store it in the object, d. We prefer short names for objects that we will use frequently.

We can now easily explore and get to know these data, which contain a number of school, test, and demographic variables for 200 students.

Description of Dataset

Using dim, we get the number of observations(rows) and variables(columns) in d.

Using str, we get the structure of d, including the class of d and the data type of all column variables (usually one of "numeric", "integer", "logical", or "character").

R classes

R objects belong to classes. Most functions only accept objects of a specific class, so it is important to know the classes of our objects. Objects can belong to more than one class, and users can define classes to control the inputs of their functions.

The class function lists all classes to which the object belongs. If class returns a basic data type (e.g. "numeric", "character", "integer"), the object has an implicit class of "vector" (array) for one-dimensional objects and "matrix" for multi-dimensional objects.

Generic functions

Generic functions remove the need for the user to remember the classes of objects that functions support.

Generic functions accept objects from multiple classes. They then pass the object to a specific function (called methods) designed for the object's class. The various functions for specific classes can have widely diverging purposes.

For example, when passing a data.frame to the generic plot function, plot passes the data.frame to a function called plot.data.frame, which creates a scatter plot matrix of all variables in the data.frame. To contrast, passing a regression model object to plot produces regression diagnostic plots instead.

Generic functions, classes and methods

If given a generic function name as an argument, the methods function will list all specific functions (methods) that the generic function searches for a class match. If given a class as an argument, methods will find all specific functions that accept that class.

Descriptive Stats

summary is a generic function to summarize many types of R objects, including datasets.

When used on a data frame, summary returns distributional summaries of variables in the dataset.

Conditional Summaries 1

If we want conditional summaries, for example only for those students with high reading scores (read >= 60), we first subset the data using filter, then summarize as usual.

R permits nested function calls, where the results of one function are passed directly as an argument to another function. Here, filter returns a dataset containing observations where read >= 60. This data subset is then passed to summary to obtain distributions of the variables in the subset.

Note that filter is in the dplyr package.

Conditional Summaries 2

We can separate the data in other ways, such as by groups. Let's look at the means of the last 5 variables for each type of program, prog.

Here, we are asking the by function to apply the colMeans function to variables located in columns 7 through 11, and to calculate those means by groups denoted in the variable prog.

Histograms

Typically it is easier to inspect variable distributions with graphics. For the following graphics, we use the ggplot2 package. Histograms are often used for continuous variable distributions...

Density Plots

...as are kernel density plots...

Boxplots

...as well as boxplots, which show the median, lower and upper quartiles (or hinges) and the full range.

Conditional Visualization 1

We can also plot graphs by group to better understand our data. Here we examine the densities of write for each type of program, prog.

Conditional Visualization 2

We could also view boxplots of math by type of program.

Here we plot math scores for each group in prog.

Extended Visualization 2

Here we demonstrate the flexibility of R by plotting the distributions of all of our continuous variables in a single boxplot.

The function melt from the reshape2 package stacks the values located in columns 7 through 11 of our dataset on top of one another to form one column called "value", and stacks the variable names associated with each value in a second column called "variable".

Thus we are asking for a boxplot of "values" grouped by "variable".

Try running just melt(d[, 7:11]) first. You'll notice there are 1000 rows and 2 columns, since it stacks the 200 values of five different variables on top of one another.

Extended Visualization 3

Finally we can get boxplots of these same variables coloured by program type by specifying a fill argument.

Graphics in R

This presentation has made heavy use of the ggplot2 package to visualize data in R. ggplot2 is an implementation of Leland Wilkinson's Grammar of Graphics which is a framework for thinking about and creating graphs.

However, there are many other packages and ways to make graphics in R. Another very popular system uses the lattice package. This is a recommended package, which means that it ships with R so everyone has it. Here are some examples.

It is also possible to make graphs in base R, but often they look much simpler and less aesthetically pleasing, unless you take the time to customize them.

Categorical Data 1

We can look at the distributions of categorical variables with frequency tables.

Categorical Data 2

Two-way cross tabs.

Categorical Data 3

Three-way cross tabs.

Categorical Independence

We can test whether ses and schtyp are independent with a permutation test from the vcd package.

Visualizing Cat Data (VCD) 1

The area of each cell in a mosaic plot corresponds to the frequency from the crosstabs. mosaic comes from the vcd package.

VCD 2

To visually understand whether the variables are independent, let's shade the cells based on pearsonized residuals from what we would observe if the data were independent.

VCD 3

We can condition just like we did with cross tabs, to visualize a three-way cross tabs.

Correlations 1

As a last step in our data exploration, we would like some quick looks at bivariate (pairwise) relationships in our data. Correlation matrices provide quick summaries of these pairwise relationships.

If there are no missing data, we can use the cor function with default arguments. Otherwise, we could use the use argument to get listwise or pairwise deletion. Check the help file (?cor) for how to use the use argument.

Correlations 2

Here we are requesting all pairwise correlations among variables in columns 7 through 11.

Correlations 3

If there are missing data, for listwise deletion, use only complete observations.

Correlations 4

If there are missing data, for pairwise deletion, use pairwise complete observations.

Visual Summaries, Continuous Variables

We can inspect univariate and bivariate relationships using a scatter plot matrix.

Modifying Data

Modifying Data

This section demonstrates reordering and modifying your data.

Let's begin by reading in our dataset again and storing it in object d.

Sorting

We can sort data using the arrange function from the dplyr package.

Here we are requesting that arrange sort by female, and then by math. The function arrange returns the sorted dataset .

Coding Categorical Variables as Factors

Within R, categorical variables are typically coded as factors, a special class that allows for value labeling.

Here, using the factor function, we convert all categorical variables to factors and label their values.

To spare us from unnecessary typing, we use the mutate function from the dplyr package to let R know that all conversions to factors and labeling should occur within the dataset d

Results

Here are the results of our changing our categorical variables to factors.

Scoring and Recoding

Often we need to create variables from other variables. For example, we may want to sum individual test items to form a total score. Or, we may want to convert a continuous scale into several categories, such as letter grades.

We use the mutate function to tell R that all variables created and referenced are within the d dataset.

Below we create a total score and use the cut function to recode continuous ranges into categories.

Standardize and Average

We can get standardized (Z) scores with the scale function, and get average read scores for each level of ses using the ave function, which itself calls mean. We use mutate again to simplify the coding.

Scoring

We can also take the mean of a set of variables, ignoring missing values, which is useful for creating composite scores.

Managing Data

Directories in R

At times it is convenient to look at directory contents within an R session. Let's get the current working directory, list its files, and change the working directory.

Subsetting Observations

Suppose we would like to subset our dataset into 2 datsets, one for each gender. Here we use the filter function to split the dataset into 2 subsets, and we store each subset in a new object.

Subsetting Variables

Often, datasets come with many more variable than we want. We can also use select to keep only the variables we need.

Adding Observations (appending)

If we were given separate files for males and females but wanted to stack them (add observations) we can append the datasets together row-wise with rbind.

Merging Data

If instead we were given separate files describing the same students, but with different variables, we could merge datasets to combine both sets of variables into 1 dataset, using the merge function. Note that we do not need to use the same variable as the id in both datasets. We could have said, by.x = "id.x", by.y = "id.y".

Analyzing Data

Analyzing Cat Data 1

We often analyze relationships between categorical variables using chi square tests. chisq.test can use raw data, or you can give it a frequency table. We do the latter.

Analyzing Cat Data 2

If you are concerned about small cell sizes, you can use Monte Carlo simulation. Here we do 90,000 simulations.

t-tests 1

t.test performs t-tests, used to compare pairs of means. Here we show a one sample t-test comparing the mean of write to a mean of 50, and a paired samples t-test comparing the means of write and read.

t-tests 2

Perhaps we would like to compare the means of write between males and females. Here are independent samples t-tests with equal variance assumed and not assumed.

ANOVA and Regression

ANOVAs and ordinary least-squares regression are just linear models, so we use lm for both. You should typically store the results of an estimation command in an object, as it often contains a wealth of information.

The object storing the results of lm can then be supplied to anova for an ANOVA table partitioning the variance sequentially, or to summary for a table of regression parameters and measures of fit.

If you would like ANOVAs using other types of sums of squares, get the car package from CRAN.

Regression continued

We can easily update a model, for instance by adding a continuous predictor, say read, using update.

Regression Diagnostics 1

Using the plot function on a object storing the results of lm will produce regression diagnostic plots. Here we are arranging them in a 2x2 square using par.

Regression Diagnostics 1

We can get specific diagnostic plots, such as a density plot of the residuals (assumed normally distributed for inference).

Regression 3

Interactions are easy to add to models using *. This notation also automatically creates the lower order terms.

Regression 4

To get the multi-degree of freedom test for the interaction, we can compare a model with and without the interaction.

Estimated Means 1

We can get the estimated (predicted) cell means using predict.

First, using expand.grid we create a new dataset containing all possible crossings of predictor values for which we would like to predict the outcome. We then use the model coefficients to predict our cell means, and store them back in the new dataset.

Estimated Means 2

Plots of predicted values can be nice.

Estimated Means 3

This was a good example why it is nice to use good variable names. If we had, this graph would be ready to go.

Logistic Regression 1

Generalized linear models include OLS regression, Poisson, and logistic. Let's look at logistic regression.

Notice we are storing the results of the regression in model object m4.

Logistic Regression 2

Odds ratios are common to report in logistic regression. These are just the exponentiated coefficients. Here coef is extracting the coefficents from the model object. The coefficients are then exponentiated.

Logistic Regression 3

We can also get confidence intervals for the coefficients or the odds ratios.

Predicted Probabilities 1

For the ANOVA, we got estimated cell means. For logistic regression, let's get the predicted probabilities.

Predicted Probabilities 2

Now we can plot the predicted probabilities to see how read scores and ses affect the probability of being in a private versus public school.

Nonparametric Tests 1

We can use nonparametric tests when we are not sure that our data meet the distributional assumptions of the statistical inference tests we would like to use.

Nonparametric Tests 2
Nonparametric Tests 3

For More Information

Getting Help - Manual

Built-in reference manual is available:

?lm

and searchable:

??regression
Getting Help - Vignettes

Many packages include vignettes, for more detailed examples.

vignette()
Getting Help - Converting

For those of you switching packages:

Available for free on SpringerLink.

Getting Help - Online
Books on R

There is an extensive list of books on R maintained at http://www.r-project.org/doc/bib/R-books.html.