|
|
|
||||
|
|
|||||
Note: the code on this page works with R 2.5.0.
Example 2. A biologist may be interested in food choices that alligators make. Adult alligators might have difference preference than young ones. The outcome variable here will be the types of food, and the predictor variables might be the length of the alligators and other environmental variables.
Example 3. Several brands of similar products are on the market, and you want to study brand choices based on gender and age. For example, a recent finding of a market research group claims that among digital camera choices, women prefer Kodak more than men and men prefer Canon more than women.
For our data analysis example, we will expand our third example with a hypothetical data set. The data set contains information on 735 subjects who were asked their preference on three brands of some product (e.g., car or TV). Included in the data set are the information on subjects' gender and age. You can get access to the data set in R by typing the command below. Note that R requires forward slashes (/) not back slashes (\) when specifying a file location even if the file is on your hard drive.
mydata <- read.csv(url("http://www.ats.ucla.edu/stat/r/dae/mlogit.csv"))
attach(mydata)
names(mydata)
[1] "brand" "female" "age"
The outcome variable is brand. The variable female is coded as 0 for male and 1 for female. Let's start with some descriptive statistics on the variables of our interest.
table(brand)
brand
1 2 3
207 307 221
table(female)
female
0 1
269 466
summary(age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
24.0 32.0 32.0 32.9 34.0 38.0
sd(age)
[1] 2.333195
xtabs(~ female + brand)
brand
female 1 2 3
0 92 99 78
1 115 208 143
Now we have warmed up to building our model. Our goal is to associate the brand choices with age and gender. We will assume a linear relationship between the transformed outcome variable and our predictor variables female and age. In order to run multinomial logistic regression in R, you first need to install the package (sometimes called a library) VGAM. Once you have installed VGAM, you will need to tell R you want to use this library before you can access the commands in it you do this using the command library() shown in the first line of code below. The following line of code specifies the model. vgl( ) is the R function that runs the model. Within the parentheses first we see the model brand~female+age. The command family=multinomial() indicates the type of model to be run (multinomial logistic regression). Finally, the command na.action=na.pass, tells R that when it encounters a missing value, it should omit that case. The final line asks for a summary of the model.
library(VGAM)
mlogit<- vglm(brand~female+age, family=multinomial(), na.action=na.pass)
summary(mlogit)
Call:
vglm(formula = brand ~ female + age, family = multinomial(),
na.action = na.pass)
Pearson Residuals:
Min 1Q Median 3Q Max
log(mu[,1]/mu[,3]) -5.5632 -0.44331 -0.32370 0.55468 7.7720
log(mu[,2]/mu[,3]) -4.7219 -0.68004 -0.44685 0.97285 1.7861
Coefficients:
Value Std. Error t value
(Intercept):1 22.721396 2.058016 11.04043
(Intercept):2 10.946741 1.493160 7.33126
female:1 -0.465941 0.226089 -2.06087
female:2 0.057873 0.196427 0.29463
age:1 -0.685908 0.062626 -10.95243
age:2 -0.317702 0.044007 -7.21939
Number of linear predictors: 2
Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
Dispersion Parameter for multinomial family: 1
Residual Deviance: 1405.941 on 1464 degrees of freedom
Log-likelihood: -702.9707 on 1464 degrees of freedom
Number of Iterations: 5
In the output above, the first thing we see is the call, this is R reminding us what the model we ran was, what options we specified, etc.. Next we see information on the distribution of the pearson's residuals which help us assess how well the model fits. There are two sets of these residuals, one for the predictions for being in category 1 (brand=1) versus category 3 (brand=3), and the other is for category 2 versus category 3.
The next part of the output is labeled "Coefficients," and includes the coefficients, as well as their standard errors and associated t-values. For each independent variable (and the intercept) are listed twice, once for each of the categories of the outcome variable brand. They correspond to two equations:
log(P(brand=1)/P(brand=3)) = b_10 + b_11*female + b_12*age
log(P(brand=2)/P(brand=3)) = b_20 + b_21*female + b_22*age,
with b's being the raw regression coefficients from the output.
For example, we can say that for one unit change in the variable age, the log of the ratio of the two probabilities, P(brand=1)/P(brand=3), will be decrease by 0.696, and the log of the ratio of the two probabilities P(brand=2)/P(brand=3) will decrease by 0.318. Therefore, we can say that, in general, the older a person is, the more he/she will prefer brand 3.
The ratio of the probability of choosing one outcome category over the probability of choosing the reference category is often referred as relative risk (and it is also sometimes referred as odds). So another way of interpreting the regression results is in terms of relative risk. We can say that for one unit change in the variable age, we expect the relative risk of choosing brand 1 over 3 to increase by exp(-0.686) = 0.504. So we can say that the relative risk is lower for older people. For a dichotomous predictor variable such as female, we can say that the ratio of the relative risks of choosing brand 1 over 2 for females (compared to males) is exp(-0.465)=0.628. To get the exponentiated coefficients in R, type:
exp(coef(mlogit)) (Intercept):1 (Intercept):2 female:1 female:2 age:1 age:2 7.375253e+09 5.676874e+04 6.275441e-01 1.059580e+00 5.036326e-01 7.278198e-01
The intercepts are produced, but they are not particularly interpretable in this form, as discussed above, our primary interest is in the coefficients themselves.
We can also present the regression results in terms of probabilities. For example, we can fix age at its mean and calculate the probabilities of choosing each brand for both females and males. The first line of the code shown below creates an object called female, it is a vector containing the two values that the variable female takes on (0 and 1). The second line of the code creates an object called age that takes on only one value, in this case we have set age to its mean using the command mean(data.frame$varname), but they could be set at any value (e.g. age=35). These objects must have the same names as the variables in your logistic regression above (e.g. in this example the mean for age must be named age). On the third line of code, we combine the objects we just created (female and age) into a data frame called newdata1. This is the new dataset we will used to create our predicted probabilities.
female <- c(0,1) age <- c(mean(mydata$age)) newdata1 <- data.frame(female,age)
Now that we have the data frame we want to use to calculate the predicted probabilities, we can tell R to create the predicted probabilities. While relatively compact, the first line of code below is important and contains a lot of instructions. The newdata1$predicted tells R that we want to create a new variable in the dataset (data frame) newdata1 called predicted, and that the values of predicted should be predictions made using the predict( ) command. The commands within the parentheses tell R that the predictions should be based on the analysis mlogit, with values of the predictor variables coming from newdata1 and that the type of prediction is a predicted probability (type="response"). The second line of the code tells R to list the values in the data frame newdata1. Although not particularly pretty, this is a table of predicted probabilities. The values of female and age are listed, followed by three columns labeled "predicted.1," "predicted.2," and "predicted.3," these contain the predicted probability that brand equals 1, 2, and 3 respectively. The predicted probability of a female, of average (mean) age selecting brand 1 is .21, the predicted probability for a male of the same age selecting brand 1 is .31.
newdata1$predicted <-predict(mlogit,newdata=newdata1,type="response") newdata1 female age predicted.1 predicted.2 predicted.3 1 0 32.90068 0.3066382 0.4306335 0.2627283 2 1 32.90068 0.2111245 0.5006218 0.2882537
We can do something very similar to create a table of predicted probabilities varying the value of age. Above, we created objects called age and female, and then bound them into a single data frame. This time, we are going to create newdata2 by creating a data frame and declaring the values within it at the same time. In the first line below, the command data.frame( ) tells R that we want to create a data frame. Within the parentheses, the command female=mean(female) tells R that the first variable in the data frame should be called female and that it should equal to the mean of female for every case in the data frame. The command age=seq(24,38,1) tells R that the second variable in the data frame should be called age and that it should take on a sequence of values from 24 to 38, in increments of 1 (e.g. the first three values of age in the data frame will be 24, 25, and 26). The second line of code creates an object called predicted, which contains the predicted probability that each case in newdata2 will belong to each of the categories of brand (i.e. that they will select a given brand). The third line of this code is a little different than the code for varying female, instead of instructing R to output all of the data in the data frame as we did above, we instruct R to print the values of age from newdata2 in the first column, followed by the values (predicted probabilities) from the object predicted. The column labeled age contains the values of age for which the predicted probabilities were calculated, the three columns to the right of age (labeled 1, 2, and 3, corresponding to the categories of brand) contain the predicted probability that a respondent at a given age will select each brand, holding gender constant. In the table below we see that the predicted probability of a 24 year old respondent selecting brand one (brand=1) is .93, while the predicted probability of a 38 year old selecting brand 1 is only .02, holding female constant.
newdata2<-data.frame(female=mean(female),age=seq(24,38,1)) predicted<-predict(mlogit,newdata=newdata2,type="response") cbind(age=newdata2$age,predicted) age 1 2 3 1 24 0.92899816 0.06861511 0.002386724 2 25 0.89941144 0.09600047 0.004588090 3 26 0.85882705 0.13247402 0.008698921 4 27 0.80448944 0.17933102 0.016179541 5 28 0.73417486 0.23650733 0.029317809 6 29 0.64732127 0.30135266 0.051326074 7 30 0.54638721 0.36759172 0.086021074 8 31 0.43766645 0.42551854 0.136815012 9 32 0.33049966 0.46436160 0.205138740 10 33 0.23458232 0.47631100 0.289106674 11 34 0.15670566 0.45982232 0.383472017 12 35 0.09901625 0.41987680 0.481106948 13 36 0.05960989 0.36529504 0.575095068 14 37 0.03446841 0.30525074 0.660280857 15 38 0.01929236 0.24690554 0.733802098
We can also present the regression result graphically. For example, we can create objects containing the predicted probabilities, and then plot them against values of the predictor variable. In the example below, we plot the predicted probability that the respondent selects brand 1 (brand=1) against age separated by the variable female (i.e. predicted probabilities are calculated separately for males and females). Much like the example above, the first two lines of code below create a new data frame where age varies, but this time, female is set equal to zero. The new data frame is then used to create an object called male_p, which contains the predicted probabilities. The third and forth lines of code repeat the same process, except that in newdata4, female=1 so that the object female_p contains the predicted probabilities for females. The next block of code creates the objects that will be used in the plot command, age contains the values of age that will go on the x axis. The next line of code creates an object containing the predicted probability that brand equals 1 at different values of age (these are the values to be plotted on the y-axis). Because the object male_p contains predicted probabilities for all three values of brand, each in its own column, the command male_p[,1] tells R that we want male_p1 to contain all values in the first column of male_p. The next line of code does the same for females by taking on the values from the first column of female_p. If we wanted the predicted probabilities for brand=2 we would take all the values from the second column of male_p using the command male_p[,2].
newdata3<-data.frame(female=0,age=seq(24,38,1)) male_p<-predict(mlogit,newdata=newdata3,type="response") newdata4<-data.frame(female=1,age=seq(24,38,1)) female_p<-predict(mlogit,newdata=newdata4,type="response")X age<-newdata3$age male_p1<-male_p[,1] female_p1<-female_p[,1]
Now that we have set up the objects we want to graph, we use the plot command to plot the predicted probabilities against age. The first line of code tells R that we want to create a plot, that we want values of the object male_p1 to be graphed on the y-axis, and values of the object age on the x-axis. The option type="l" indicates that the plot should be a line graph, col="blue" indicates that the line should be blue, lwd=1 indicates that the line width should be 1 (a thin line), and the commands ylab= and xlab= set the labels for the y and x axes respectively, if these are not specified, they default to the object names. The line of code lines(female_p1~age,col="red",lwd=1) adds an additional line to our graph using the values from the object female_p1 for the y-axis (this plots the predicted probabilities for females), and the values from the object age on the x-axis. The command col="red" indicates that this line should be red. The final line of code adds a legend to the plot. The first two values given (33.5,.93) are the y- and x-coordinates indicating where the legend should go (note, these values are based on the scale of the variables on the y and x axes). The command c("Males","Females") provides the labels for the lines (the first line is males, the second line is females), col=c("blue","red"), indicates that the color of the first line is blue, and the second line is red, and lwd=c(1,1) indicates that both the first and second lines are of width 1. All of these options must be specified for your legend to work properly. Note: You can save this graph by right clicking on it, or using the savePlot command.
plot(male_p1~age,type="l",col="blue",lwd=1,
ylab="Predicted Probability for Brand = 1",
xlab="Age")
lines(female_p1~age,col="red",lwd=1)
legend(33.5,.93,c("Males","Females"),col=c("blue","red"),lwd=c(1,1))
The current version of xtable does not work with the package VGAM, nor are we aware of any other R package to output formatted tables based on the output from VGAM. One possible setup for a multinomial logistic regression table is shown below.
b S.E. z exp(b) brand = 1 vs 3 Intercept 22.721396*** 2.058016 11.04043 female -0.465941* 0.226089 -2.06087 .627544 age -0.685908*** 0.062626 -10.95243 .503633 brand = 2 vs 3 Intercept 10.946741*** 1.493160 7.33126 female 0.057873 0.196427 0.29463 1.059580 age -0.317702*** 0.044007 -7.21939 .503633
Presenting the multinomial regression results can be somewhat tricky since there are multiple equations and multiple comparisons to present. One has to keep in mind which predictor variable is changing, as well as the comparison being made (e.g. category 1 versus category 3 of the variable being predicted). For example, the effect of female is 1.0596 on the relative risk of choosing brand 2 over 3, meaning that the percent increase of relative risk (or rather loosely, the odds) of choosing brand 2 over 3 from male to female is about 6 percent.
Sometimes, it might be desirable to present the results in terms of probabilities. But because that multinomial logistic model is not a linear model, it is not sufficient to present the regression results for just one set of values. For example, we can create a table of probabilities based on gender for a subject with an average age as shown below.
newdata5<-data.frame(female=c(0,1),age=mean(mydata$age)) newdata5$pred<-predict(mlogit,newdata=newdata5,type="response") newdata5 female age pred.1 pred.2 pred.3 1 0 32.90068 0.3066382 0.4306335 0.2627283 2 1 32.90068 0.2111245 0.5006218 0.2882537
We can say that overall a person of average age will be more likely to choose brand 2 regardless of gender. All seems good. But what if we change age from its mean to two units down?
newdata6<-data.frame(female=c(0,1),age=mean(mydata$age)-2) newdata6$pred<-predict(mlogit,newdata=newdata6,type="response") newdata6 female age pred.1 pred.2 pred.3 1 0 30.90068 0.5291631 0.3558369 0.1150000 2 1 30.90068 0.4029471 0.4575086 0.1395443
Now we can see that we will have to change our description! In general, the effect of the variable female depends on the values of other predictor variables in the model. In our case, it depends on the value of the variable age. For this reason, it is more informative to present the predicted probability plots (as shown before).
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services