|
|
|
||||
|
|
|||||
In R there are at least three different functions that can be used to obtain contrast variables for use in regression or ANOVA. For those shown below, the default contrast coding is "treatment" coding, which is another name for "dummy" coding. This is the coding most familiar to statisticians. "Dummy" or "treatment" coding basically consists of creating dichotomous variables where each level of the categorical variable is contrasted to a specified reference level. In the case of the variable race which has four levels, a typical dummy coding scheme would involve specifying a reference level, let's pick level 1 (which is the default), and then creating three dichotomous variables, where each variable would contrast each of the other levels with level 1. So, we would have a variable which would contrast level 2 with level 1, another variable that would contrast level 3 with level 1 and a third variable that would contrast level 4 with level 1. There are actually four different contrasts coding that have built in functions in R, but we will focus our attention on the treatment (or dummy) coding since it is the most popular choice for data analysts. For more information about different contrasts coding systems and how to implement them in R, please refer to R Library: Coding systems for categorical variables.
For the examples on this page we will be using the hsb2 data frame hsb2.csv .
Let's first read in the data frame and create the factor variable race.f based on the variable race. We will then use the is.factor function to determine if the variable we create is indeed a factor variable, and then we will use the lm function to perform a regression.
hsb2 <- read.table('c:/hsb2.csv', header=T, sep=",")
attach(hsb2)
# creating the factor variable race.f <- factor(race)
is.factor(race.f)
[1] TRUE
race.f[1:15]
[1] 4 4 4 4 4 4 3 1 4 3 4 4 4 4 3 Levels: 1 2 3 4
lm(write ~ (race.f))
Call:
lm(formula = write ~ (race.f))
Coefficients:
(Intercept) race.f2 race.f3 race.f4
46.458 11.542 1.742 7.597
You can also use the factor function within the lm function, saving the step of creating the factor variable first.
lm(write ~ factor(race))
Call: lm(formula = write ~ factor(race)) Coefficients: (Intercept) factor(race)2 factor(race)3 factor(race)4 46.458 11.542 1.742 7.597
The C function (this must be a upper-case "C") allows you to create several different kinds of contrasts, including treatment, Helmert, sum and poly. Treatment is another name for dummy coding. Sum stands for contrasts that sum to zero, such as the type used in ANOVA models. Poly is short for polynomial. Three arguments are used with this function. The first one names the factor to be used, the second indicated the type of contrast to be used (e.g., treatment, Helmert, etc.), and the third indicates the number of contrasts to be set. The default is one less than the number of levels of the factor variable. We will start out by using the treatment contrast. We will accept the default setting for the number of levels, so that argument can be omitted.
race.ct <- C(race.f, treatment)
attributes(race.ct)
$levels [1] "1" "2" "3" "4" $class [1] "factor" $contrasts [1] "contr.treatment" lm(write ~ (race.ct)) Call: lm(formula = write ~ (race.ct)) Coefficients: (Intercept) race.ct2 race.ct3 race.ct4 46.458 11.542 1.742 7.597
Now we will try an example using the Helmert coding system which compares each subsequent level to the mean of the previous levels. For example, the third level will be compared with the mean of the first two levels, and the fourth level will be compared to the mean of the first three levels. Also note that, like most functions in R, C is case-sensitive: the arguments for the type of contrast must be in all lower case letters (i.e., typing Helmert will give you a strange error message that does not indicate that the problem is that you need to use a lower-case h (helmert)). We will make two objects using this type of coding: for the first one we will accept the default number of contrasts to be created, and in the second one we will specify that three contrasts are to be made (because the variable race has four levels). As you will see, the difference is found in the output of the attributes function, not in the results of the lm.
race.ch <- C(race.f, helmert) attributes(race.ch) $levels [1] "1" "2" "3" "4" $class [1] "factor" $contrasts [1] "contr.helmert"
lm(write ~ (race.ch)) Call: lm(formula = write ~ (race.ch)) Coefficients: (Intercept) race.ch1 race.ch2 race.ch3 51.6784 5.7708 -1.3431 0.7923
race.ch1 <- C(race.f, helmert, 3) attributes(race.ch1) $levels [1] "1" "2" "3" "4" $class [1] "factor" $contrasts [,1] [,2] [,3] 1 -1 -1 -1 2 1 -1 -1 3 0 2 -1 4 0 0 3 lm(write ~ race.ch1) Call: lm(formula = write ~ race.ch1) Coefficients: (Intercept) race.ch11 race.ch12 race.ch13 51.6784 5.7708 -1.3431 0.7923
The contr. function is a little different from the preceding functions, in that it is really two functions. In most cases, you will have function on both sides of <- . On the left side you will usually have the contrasts() function, and on the right contr.treatment(), contr.helmert(), or whatever contrast you want to use. We suggest that you first look at the help file for this function, as the arguments are different for each type of contrast (i.e., treatment, Helmert, sum and poly). For the treatment contrast, the arguments are n, base and contrasts. There is no default for the n argument, so this number must be specified. The default for the base argument is 1, meaning that the first level is used as the reference level. The default for the contrasts argument is TRUE.One advantage to using the two function method is that it allows you to change the default reference level if you like. We will not show that here, but it can be done using the options() function (see the help file for contrasts for an example of how to do this).
First, we will use the contrasts() function by itself simply to show what it is doing. Please note that while the example works for treatment coding, it does not work for other types of coding.
a <- contrasts(race.f) a 2 3 4 1 0 0 0 2 1 0 0 3 0 1 0 4 0 0 1
Now let's use the contrasts() function with the contr.treatment() function. The results from the linear model (the lm() function) should match those that we have obtained previously. Note that the number given in the parentheses is the number of levels of the factor variable race.
contrasts(race.f) <- contr.treatment(4) lm(write ~ race.f) Call: lm(formula = write ~ race.f) Coefficients: (Intercept) race.f2 race.f3 race.f4 46.458 11.542 1.742 7.597
Now let's try changing the reference level to the second level of race.f.
contrasts(race.f) <- contr.treatment(4, base = 2) lm(write ~ race.f) Call: lm(formula = write ~ race.f) Coefficients: (Intercept) race.f1 race.f3 race.f4 58.000 -11.542 -9.800 -3.945
Another way of doing the same thing would be to specify which levels of the factor variable race.f are to be included in the model.
lm(write ~ I(race.f == 1) + I(race.f == 3) + I(race.f == 4)) Call: lm(formula = write ~ I(race.f == 1) + I(race.f == 3) + I(race.f == 4)) Coefficients: (Intercept) I(race.f == 1)TRUE I(race.f == 3)TRUE I(race.f == 4)TRUE 58.000 -11.542 -9.800 -3.945
Now let's try using the Helmert coding.
contrasts(race.f) <- contr.helmert(4) lm(write ~ race.f) Call: lm(formula = write ~ race.f) Coefficients: (Intercept) race.f1 race.f2 race.f3 51.6784 5.7708 -1.3431 0.7923
detach(hsb2)
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services