UCLA Academic Technology Services HomeServicesClassesContactJobs
Help the Stat Consulting Group by giving a gift             
Loading

R FAQ
How can I visualize longitudinal data in ggplot2?

In this page, we demonstrate how to create spaghetti plots, explore overall trends, and look for interactions in longitudinal data using ggplot2. With longitudinal or repeated measures data, there are often two aspects that are interesting. First, how much variability is there between individual units at a given time or measure? Second, how much variance is their within units over time or measure? One convenient way to visualize both is using a spaghetti plot. This draws a plot with separate lines for each unit. The space between lines represents between unit variability, the change in each line (slope) represents within variability. Let's use the person period tolerance data set from Applied Longitudinal Data Analysis: Modeling Change and Event Occurrence by Judith D. Singer and John B. Willett for our example.

Basics

First, we need to read the data in, convert the numeric id and sex indicators to factor class variables, and load the ggplot2 package that we will use to make the graphs.


## read in data set (tolerance data from the ALDA book)
tolerance <- read.table("http://www.ats.ucla.edu/stat/r/examples/alda/data/tolerance1_pp.txt",
  sep = ",", header = TRUE)

## change id and male to factor variables
tolerance <- within(tolerance, {
  id <- factor(id)
  male <- factor(male, levels = 0:1, labels = c("female", "male"))
})

## view the first few rows of the dataset
head(tolerance)

  id age tolerance   male exposure time
1  9  11      2.23 female     1.54    0
2  9  12      1.79 female     1.54    1
3  9  13      1.90 female     1.54    2
4  9  14      2.12 female     1.54    3
5  9  15      2.66 female     1.54    4
6 45  11      1.12   male     1.16    0

## load ggplot2 package to make the graphs
require(ggplot2)

Now we are ready to start creating graphs. We begin by plotting tolerance on the y axis and time on the x axis. The ggplot2 package is based on the Grammar of Graphics by Leland Wilkinson. The theoretical structure behind how a graph is created is similar to how we might form a sentence. There are basic, structural components, things that say how one variable relates to another, and modifiers (like adjectives or adverbs) that control things like size, colour, shape, etc. All graphs start off with a base component that defines the data to be used, and indicates how variables are mapped to the graph. In our case, the data set is 'tolerance', we are mapping time to the x axis, tolerance (the variable inside the dataset, not the data set itself) to the y axis, and we will group things by id. All of this information is created and stored in the object, 'p' (for plot).


## define base for the graphs and store in object 'p'
p <- ggplot(data = tolerance, aes(x = time, y = tolerance, group = id))

Note that no graph is created yet. All we have said is how to map the data to the graph. We have not indicated what should be drawn. The shapes that are drawn are called geoms. Geoms are typically very basic building blocks (although there are more complex ones) consisting of things like points, lines, and polygons. These few things can be combined to create a virtually endless set of visualizations. For now, we will just look at points and lines. Notice how the variable mapping remains unchanged. All we need to vary is the shape being created.


## just plotting points (a.k.a., scatterplot)
p + geom_point()

##simple spaghetti plot
p + geom_line()

More Customized Graphs

This grammar of graphics system may seem difficult for a simple little scatter plot, and it is. Its value is in flexibility to create far more complex visualizations with minimal change. We see the spaghetti plot, but perhaps there are differences based on sex. We could compare the mean tolerance for females and males, but what if there was an interaction between time and male such that the effect of time on tolerance depends on whether an individual is female or male? Visual inspection is easy; we will simply create two separate subplots for females and males. To facilitate detection of main sex effects, we plot both subplots on the same scales (the default).


## facet (condition) the graph base on the male variable
p + geom_line() + facet_grid(. ~ male)

Aside from the one extremely high male, nothing visually 'jumps out', but it might be nice to see the means too. We add a summary function to the graph that calculates and plots the means. Note that because we are still faceting based on male (female/male), ggplot2 knows to calculate and plot the means separately. We override the groups = id we had set in our 'p' object, otherwise we would calculate the mean at each time, for each id...not very interesting.


p +
  geom_line() +
  stat_summary(aes(group = 1), geom = "point", fun.y = mean, shape = 17, size = 3) +
  facet_grid(. ~ male)

Besides easy conditioning, there is another benefit to doing the mean (or any graphed statistic) calculation within the graph call. ggplot2 has a particular order it operates. First, data is read and variables are mapped to axes or any other aspect of the graph, then any transformations are applied, next faceting or conditioning is performed, then the statistics are calculated, and finally the graphs are rendered. This means that if one were to transform the data (say with a log or square transformation), the means will be calculated on the transformed data. When trying out different ways of visualizing data, it is very convenient that the graphs are automatically updated to the transformation, facetting, etc.


## Here we plot the log base 10 of tolerance
## the means are now calculated on the logged values i.e.,
## mean(log(y)) NOT log(mean(y)) --- these two are quite different things
p +
  geom_line() +
  stat_summary(aes(group = 1), geom = "point", fun.y = mean, shape = 17, size = 3) +
  scale_y_log10() +
  facet_grid(. ~ male)

In the previous examples, the function used to calculate the summary y value was mean, but it could be any function. For example given the high score for on male at time 3, perhaps we would rather use medians. The change is trivial.


p +
  geom_line() +
  stat_summary(aes(group = 1), geom = "point", fun.y = median, shape = 17, size = 3) +
  facet_grid(. ~ male)

Alternately, we could plot the upper and lower quartile (as a side note, we might consider connecting the lower and upper quartiles with a line [like error bars]).


## note the probs argument is for the quantile function as for the 25th and 75th percentiles
p +
  geom_line() +
  stat_summary(aes(group = 1), geom = "point",
    fun.y = quantile, probs = c(.25, .75), shape = 17, size = 3) +
  facet_grid(. ~ male)

The point is that the framework is flexible---you can theoretically use any function for a summary. Now, we might be interested in estimating the overall trend in the data. One option is to add a line using locally weighted regression (lowess) to "smooth" over all the variability and give a sense of the overall or average trend. It just takes one short line of code and is automatically calculated separate by male.


## note again, we use group = 1, so the smooth is not calculated separately for each id
p +
  geom_line() +
  stat_smooth(aes(group = 1)) +
  stat_summary(aes(group = 1), geom = "point", fun.y = mean, shape = 17, size = 3) +
  facet_grid(. ~ male)

The lowess looks fairly linear, so we might choose a linear smooth and turn off the standard error shading.


p +
  geom_line() +
  stat_smooth(aes(group = 1), method = "lm", se = FALSE) +
  stat_summary(aes(group = 1), geom = "point", fun.y = mean, shape = 17, size = 3) +
  facet_grid(. ~ male)

Adding smooths is more flexible than we have shown thus far. Suppose that between time 1 and 2, an intervention occurred, and we wish to fit a piecewise linear model rather than an overall smooth. We can do this by creating a dummy variable (pre/post intervention) and its interaction with time. The only change is a slightly more complex formula. The default is y ~ x. "I(x > 1)" creates a dummy (TRUE/FALSE) variable if x (time in this case) is greater than 1. The "*" in the formula asks for the main effects and the interaction between x and the dummy variable from x. Of course ggplot2 takes care of fitting the model separately by male and plotting it for us. Now we can see that the trend line 'jumps' after time 1, and the slope is allowed to change (although the change appears minimal suggestion there is not an interaction between our hypothetical intervention and time.


p +
  geom_line() +
  stat_smooth(aes(group = 1), method = "lm", formula = y ~ x * I(x > 1), se = FALSE) +
  stat_summary(aes(group = 1), fun.y = mean, geom = "point", shape = 17, size = 3) +
  facet_grid(. ~ male)

Additional Types of Smooths

stat_smooth is not limited to simple linear models. We convert the factor variable male back to 0/1 for female/male and plot the data points as well as fit a logistic regression predicting male from exposure. The line plotted is the predicted probability.


ggplot(tolerance, aes(x = exposure, y = as.numeric(male) - 1)) +
  geom_point() +
  stat_smooth(size = 1, method = "glm", se = FALSE, family = binomial)

We are not even limited by basic or default modeling functions. To demonstrate this, we will use a different data set that is built into R, the 'mtcars' data. Specifically, we will look at the relationship between miles per gallon (mpg) and horsepower (hp).


head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

## plot base + points
p2 <- ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point()

One thing to notice is that into the 'p2' object, we saved both the basic plot setup and the request to add points. This saves typing down the road if we know we always want points plotted in our graph. A quick visual of the data indicates the relationship may not be linear. This is confirmed when we look at a linear smooth. The fit is poor at the extremes.


## a quick visual indicates we may not have a linear relationship
p2

## looking at a linear fit, we see it is poor at the extremes
p2 + stat_smooth(method = "lm", formula = y ~ x, size = 1)

For the sake of demonstration, we will try to model the data using a generalized additive model (GAM) from the 'mgcv' package with a smooth on the x predictor variable. First we load the required package, and then show how it is easily used inside our graph.


## load a package to fit generalized additive models (GAMs)
require(mgcv)

## we now fit a GAM adding a penalized smoother with x
p2 + stat_smooth(method = "gam", formula = y ~ s(x), size = 1)

The GAM with a smooth seems to fit the data better than the straight line did. We could also customize the basis dimension. Arbitrarily, we choose 3.


p2 + stat_smooth(method = "gam", formula = y ~ s(x, k = 3), size = 1)

Another, perhaps more traditional, way of modeling the data would be to use orthogonal polynomial terms, with (judging from the GAM models) a second order (quadratic) polynomial.


## R can automatically creates these using the poly() function
p2 + stat_smooth(method = "lm", formula = y ~ poly(x, 2), size = 1)

In summary, we have created a variety of plots using ggplot2 to visualize longitudinal data and demonstrated how it is possible to easily add summary statistics, look for interactions with categorical variables through faceting, try data transformations, and look at linear and nonlinear effects.

Created using R version 2.13.1 (2011-07-08), ggplot2_0.8.9, and mgcv_1.7-6. Last updated: 09/27/2011.


How to cite this page

Report an error on this page or leave a comment

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.