UCLA Academic Technology Services HomeServicesClassesContactJobs

R Learning Module
Coding for Categorical Variables in Regression Models

In R there are at least three different functions that can be used to obtain contrast variables for use in regression or ANOVA.  For those shown below, the default contrast coding is "treatment" coding, which is another name for "dummy" coding. This is the coding most familiar to statisticians. "Dummy" or "treatment" coding basically consists of creating dichotomous variables where each level of the categorical variable is contrasted to a specified reference level. In the case of the variable race which has four levels, a typical dummy coding scheme would involve specifying a reference level, let's pick level 1 (which is the default), and then creating three dichotomous variables, where each variable would contrast each of the other levels with level 1. So, we would have a variable which would contrast level 2 with level 1, another variable that would contrast level 3 with level 1 and a third variable that would contrast level 4 with level 1. There are actually four different contrasts coding that have built in functions in R, but we will focus our attention on the treatment (or dummy) coding since it is the most popular choice for data analysts. For more information about different contrasts coding systems and how to implement them in R, please refer to R Library: Coding systems for categorical variables.
For the examples on this page we will be using the hsb2 data frame hsb2.csv .
Let's first read in the data frame and create the factor variable race.f based on the variable race.  We will then use the is.factor function to determine if the variable we create is indeed a factor variable, and then we will use the lm function to perform a regression.
hsb2 <- read.table('c:/hsb2.csv', header=T, sep=",")
attach(hsb2)

1. The factor function

# creating the factor variable
race.f <- factor(race)
is.factor(race.f)
[1] TRUE
race.f[1:15]
[1] 4 4 4 4 4 4 3 1 4 3 4 4 4 4 3
Levels: 1 2 3 4
lm(write ~ (race.f))
Call:
lm(formula = write ~ (race.f))

Coefficients:
(Intercept) race.f2   race.f3   race.f4 
    46.458   11.542    1.742     7.597

You can also use the factor function within the lm function, saving the step of creating the factor variable first.

lm(write ~ factor(race))
Call:
lm(formula = write ~ factor(race))

Coefficients:
(Intercept) factor(race)2 factor(race)3 factor(race)4 
46.458 11.542 1.742 7.597 

2.  Using the C function

The C function (this must be a upper-case "C") allows you to create several different kinds of contrasts, including treatment, Helmert, sum and poly.  Treatment is another name for dummy coding.  Sum stands for contrasts that sum to zero, such as the type used in ANOVA models.  Poly is short for polynomial.  Three arguments are used with this function.  The first one names the factor to be used, the second indicated the type of contrast to be used (e.g., treatment, Helmert, etc.), and the third indicates the number of contrasts to be set.  The default is one less than the number of levels of the factor variable.  We will start out by using the treatment contrast.  We will accept the default setting for the number of levels, so that argument can be omitted.
race.ct <- C(race.f, treatment)
attributes(race.ct)
$levels
[1] "1" "2" "3" "4"

$class
[1] "factor"

$contrasts
[1] "contr.treatment"

lm(write ~ (race.ct))

Call:
lm(formula = write ~ (race.ct))

Coefficients:
(Intercept) race.ct2 race.ct3 race.ct4 
  46.458     11.542   1.742    7.597
Now we will try an example using the Helmert coding system which compares each subsequent level to the mean of the previous levels.  For example, the third level will be compared with the mean of the first two levels, and the fourth level will be compared to the mean of the first three levels.  Also note that, like most functions in R, C is case-sensitive:  the arguments for the type of contrast must be in all lower case letters (i.e., typing Helmert will give you a strange error message that does not indicate that the problem is that you need to use a lower-case h (helmert)).  We will make two objects using this type of coding:  for the first one we will accept the default number of contrasts to be created, and in the second one we will specify that three contrasts are to be made (because the variable race has four levels).  As you will see, the difference is found in the output of the attributes function, not in the results of the lm.
race.ch <- C(race.f, helmert)

attributes(race.ch)

$levels
[1] "1" "2" "3" "4"

$class
[1] "factor"

$contrasts
[1] "contr.helmert"
lm(write ~ (race.ch))

Call:
lm(formula = write ~ (race.ch))

Coefficients:
(Intercept)   race.ch1   race.ch2   race.ch3 
  51.6784      5.7708     -1.3431    0.7923 
race.ch1 <- C(race.f, helmert, 3)

attributes(race.ch1)

$levels
[1] "1" "2" "3" "4"

$class
[1] "factor"

$contrasts
   [,1] [,2] [,3]
1   -1   -1   -1
2    1   -1   -1
3    0    2   -1
4    0    0    3

lm(write ~ race.ch1)

Call:
lm(formula = write ~ race.ch1)

Coefficients:
(Intercept)   race.ch11   race.ch12   race.ch13 
  51.6784      5.7708      -1.3431     0.7923

3.  Using the contr. function

The contr. function is a little different from the preceding functions, in that it is really two functions.  In most cases, you will have function on both sides of <- .  On the left side you will usually have the contrasts() function, and on the right contr.treatment(), contr.helmert(), or whatever contrast you want to use.  We suggest that you first look at the help file for this function, as the arguments are different for each type of contrast (i.e., treatment, Helmert, sum and poly).  For the treatment contrast, the arguments are n, base and contrasts.  There is no default for the n argument, so this number must be specified.  The default for the base argument is 1, meaning that the first level is used as the reference level.  The default for the contrasts argument is TRUE.

One advantage to using the two function method is that it allows you to change the default reference level if you like.  We will not show that here, but it can be done using the options() function (see the help file for contrasts for an example of how to do this).

First, we will use the contrasts() function by itself simply to show what it is doing.  Please note that while the example works for treatment coding, it does not work for other types of coding.

a <- contrasts(race.f)

a

  2 3 4
1 0 0 0
2 1 0 0
3 0 1 0
4 0 0 1

Now let's use the contrasts() function with the contr.treatment() function.  The results from the linear model (the lm() function) should match those that we have obtained previously.  Note that the number given in the parentheses is the number of levels of the factor variable race.

contrasts(race.f) <- contr.treatment(4)

lm(write ~ race.f)

Call:
lm(formula = write ~ race.f)

Coefficients:
(Intercept)   race.f2   race.f3   race.f4 
  46.458       11.542    1.742     7.597

Now let's try changing the reference level to the second level of race.f.

contrasts(race.f) <- contr.treatment(4, base = 2)

lm(write ~ race.f)

Call:
lm(formula = write ~ race.f)

Coefficients:
(Intercept)   race.f1   race.f3   race.f4 
  58.000       -11.542   -9.800    -3.945 

Another way of doing the same thing would be to specify which levels of the factor variable race.f are to be included in the model.

lm(write ~ I(race.f == 1) + I(race.f == 3) + I(race.f == 4))

Call:
lm(formula = write ~ I(race.f == 1) + I(race.f == 3) + I(race.f == 4))

Coefficients:
(Intercept)   I(race.f == 1)TRUE   I(race.f == 3)TRUE   I(race.f == 4)TRUE 
  58.000         -11.542              -9.800               -3.945 

Now let's try using the Helmert coding.

contrasts(race.f) <- contr.helmert(4)

lm(write ~ race.f)

Call:
lm(formula = write ~ race.f)

Coefficients:
(Intercept)   race.f1   race.f2   race.f3 
  51.6784      5.7708    -1.3431   0.7923
detach(hsb2)

How to cite this page

Report an error on this page

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.