|
|
|
||||
|
Help the Stat Consulting Group by
giving a gift
| |||||
|
Loading
|
|||||
Note: the code on this page works with R 2.8.1.
Example 2. A biologist may be interested in food choices that alligators make. Adult alligators might have difference preference than young ones. The outcome variable here will be the types of food, and the predictor variables might be the length of the alligators and other environmental variables.
Example 3. Several brands of similar products are on the market, and you want to study brand choices based on gender and age. For example, a recent finding of a market research group claims that among digital camera choices, women prefer Kodak more than men and men prefer Canon more than women.
For our data analysis example, we will expand our third example with a hypothetical data set. The data set contains information on 735 subjects who were asked their preference on three brands of some product (e.g., car or TV). Included in the data set are the information on subjects' gender and age. You can get access to the data set in R by typing the command below. Note that R requires forward slashes (/) not back slashes (\) when specifying a file location even if the file is on your hard drive.
mydata <- read.csv(url("http://www.ats.ucla.edu/stat/r/dae/mlogit.csv"))
attach(mydata)
names(mydata)
[1] "brand" "female" "age"
The outcome variable is brand. The variable female is coded as 0 for male and 1 for female. Let's start with some descriptive statistics on the variables of our interest.
table(brand)
brand
1 2 3
207 307 221
table(female)
female
0 1
269 466
summary(age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
24.0 32.0 32.0 32.9 34.0 38.0
sd(age)
[1] 2.333195
xtabs(~ female + brand)
brand
female 1 2 3
0 92 99 78
1 115 208 143
Now we have warmed up to building our model. Our goal is to associate the brand choices with age and gender. We will assume a linear relationship between the transformed outcome variable and our predictor variables female and age. In order to run multinomial logistic regression in R, you first need to install the package (sometimes called a library) mlogit. Once you have installed mlogit, you will need to tell R you want to use this library before you can access the commands in it you do this using the command library() shown in the first line of code below.
Before running the model, we need to prepare the data. First we indicate that the outcome variable, brand, is a factor. Then we use the mlogit.data command to expand the data. For each observation in our original dataset, we now have 3 observations--one for each value of brand. In this new dataset, the brand number appears under alt and whether or not the given brand was selected by the person in the brand variable as a TRUE or FALSE.
library(mlogit)
mydata[1:10,]
brand female age
1 1 0 24
2 1 0 26
3 1 0 26
4 1 1 27
5 1 1 27
6 3 1 27
7 1 0 27
8 1 0 27
9 1 1 27
10 1 0 27
mydata$brand<-as.factor(mydata$brand)
mldata<-mlogit.data(mydata, varying=NULL, choice="brand", shape="wide")
mldata[1:10,]
chid alt brand female age
1.1 1 1 TRUE 0 24
1.2 1 2 FALSE 0 24
1.3 1 3 FALSE 0 24
2.1 2 1 TRUE 0 26
2.2 2 2 FALSE 0 26
2.3 2 3 FALSE 0 26
3.1 3 1 TRUE 0 26
3.2 3 2 FALSE 0 26
3.3 3 3 FALSE 0 26
4.1 4 1 TRUE 1 27
We can now run the model. In the mlogit command, we can indicate alternative-specific and individual-specific variables. We are only interested in individual-specific variables, which are indicated after the | symbol. We will consider brand 1 to be our reference level and indicate this in the model. The final line asks for a summary of the model.
mlogit.model<- mlogit(brand~1|female+age, data = mldata, reflevel="1")
summary(mlogit.model)
Call:
mlogit(formula = brand ~ 1 | female + age, data = mldata, reflevel = "1")
Frequencies of alternatives:
1 2 3
0.28163 0.41769 0.30068
Newton-Raphson maximisation
gradient close to zero. May be a solution
5 iterations, 0h:0m:0s
g'(-H)^-1g = 7.65E-19
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
alt2 -11.774655 1.774612 -6.6351 3.244e-11 ***
alt3 -22.721396 2.058028 -11.0404 < 2.2e-16 ***
alt2:female 0.523814 0.194247 2.6966 0.007004 **
alt3:female 0.465941 0.226090 2.0609 0.039315 *
alt2:age 0.368206 0.055003 6.6943 2.167e-11 ***
alt3:age 0.685908 0.062627 10.9524 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Log-Likelihood: -702.97
McFadden R^2: 0.70981
Likelihood ratio test : chisq = 185.85 (p.value=< 2.22e-16)
In the output above, the first thing we see is the call. This is R reminding us what the model we ran was, what options we specified, etc.. Next we see the distribution of the three outcome levels followed by information on the model-fitting process: the maximization method used, the number of iterations required, the run time needed.
The next part of the output is labeled "Coefficients," and includes the coefficients, as well as their standard errors and associated t-values and p-values. Each independent variable (and the intercept, presented first) appear twice, once for each of the categories of the outcome indicated by "alt2" and "alt3". They correspond to two equations:
log(P(brand=2)/P(brand=1)) = b_10 + b_11*female + b_12*age
log(P(brand=3)/P(brand=1)) = b_20 + b_21*female + b_22*age,
with b's being the raw regression coefficients from the output.
For example, we can say that for one unit change in the variable age, the log of the ratio of the two probabilities, P(brand=2)/P(brand=1), will be increased by 0.368, and the log of the ratio of the two probabilities P(brand=3)/P(brand=1) will be increased by 0.686. Therefore, we can say that, in general, the older a person is, the more he/she will prefer brand 2 or 3 to brand 1.
The ratio of the probability of choosing one outcome category over the probability of choosing the reference category is often referred as relative risk (and it is also sometimes referred as odds). So another way of interpreting the regression results is in terms of relative risk. We can say that for one unit change in the variable age, we expect the relative risk of choosing brand 2 over 1 to increase by exp(.3682) = 1.45. So we can say that the relative risk is higher for older people. For a dichotomous predictor variable such as female, we can say that the ratio of the relative risks of choosing brand 2 over 1 for female and male is exp(.5238) = 1.69. We can exponentiate our coefficients to display the regression results in the language of risk.
exp(coef(mlogit.model))
alt2 alt3 alt2:female alt3:female alt2:age alt3:age
7.697192e-06 1.355886e-10 1.688456e+00 1.593514e+00 1.445140e+00 1.985574e+00The intercepts are produced, but they are not particularly interpretable in this form, as discussed above, our primary interest is in the coefficients themselves.
We can also present the regression results in terms of probabilities. For example, we can look at a range of age values and calculate the probabilities of choosing each brand for both females and males. In the code below, we first create a new dataset called newdata. We then generate the logit-scale predicted values using the model coefficients. For brand = 1, this value is zero.
newdata <- data.frame(cbind(age = rep(24:38, 2), female = c(rep(0, 15), rep(1, 15)))) logit1 <- rep(0, 30) logit2 <- -11.774655 + 0.523814*newdata$female + 0.368206*newdata$age logit3 <- -22.721396 + 0.465941*newdata$female + 0.685908*newdata$age
We can exponentiate these logit values and scale them, knowing that the probabilities of the three brands must sum to one.
logits <- cbind(logit1, logit2, logit3)
p.unscaled <- exp(logits)
p <- cbind(newdata, (p.unscaled / rowSums(p.unscaled)))
colnames(p) <- c("age", "female", "pred.1", "pred.2", "pred.3")
p
age female brand1 brand2 brand3
1 24 0 0.94795816 0.05022934 0.001812499
2 25 0 0.92560894 0.07087706 0.003514003
3 26 0 0.89429648 0.09896224 0.006741277
4 27 0 0.85114671 0.13611383 0.012739463
5 28 0 0.79313274 0.18329623 0.023571034
6 29 0 0.71788189 0.23975656 0.042361548
7 30 0 0.62507370 0.30168836 0.073237938
8 31 0 0.51809923 0.36136835 0.120532423
9 32 0 0.40487478 0.40810106 0.187024165
10 33 0 0.29639700 0.43174829 0.271854709
11 34 0 0.20299479 0.42731836 0.369686853
12 35 0 0.13057937 0.39723827 0.472182360
13 36 0 0.07951504 0.34957118 0.570913780
14 37 0 0.04627580 0.29400185 0.659722352
15 38 0 0.02598192 0.23854878 0.735469305
16 24 1 0.91532111 0.08189009 0.002788802
17 25 1 0.88079306 0.11387846 0.005328489
18 26 1 0.83412901 0.15585141 0.010019579
19 27 1 0.77287790 0.20868837 0.018433729
20 28 1 0.69562006 0.27143715 0.032942796
21 29 1 0.60315882 0.34012512 0.056716067
22 30 1 0.49959271 0.40712986 0.093277432
23 31 1 0.39240326 0.46212482 0.145471914
24 32 1 0.29086656 0.49502864 0.214104800
25 33 1 0.20320808 0.49978974 0.297002182
26 34 1 0.13411359 0.47668223 0.389204181
27 35 1 0.08404267 0.43168392 0.484273412
28 36 1 0.05034163 0.37368279 0.575975587
29 37 1 0.02903195 0.31143107 0.659536975
30 38 1 0.01623118 0.25162033 0.732148482
Although not particularly pretty, this is a table of predicted probabilities. The values of female and age are listed, followed by three columns labeled "pred.1," "pred.2," and "pred.3," these contain the predicted probability that brand equals 1, 2, and 3 respectively.
We can present these regression predictions graphically. For example, we can plot the predicted probabilities that the respondent selects brand 1 against age separated by the variable female using the dataset above. The first line of code below creates a plot with the values of pred.1 where female is 0 to be graphed against age. The option type="l" indicates that the plot should be a line graph, col="blue" indicates that the line should be blue. The next line of code adds an additional line to our graph for the predicted probabilities where female is 1. We then add a legend.
plot(p$pred.1[p$female==0]~p$age[p$female==0],type="l",col="blue",lwd=1,
ylab="Predicted Probability for Brand = 1",
xlab="Age")
lines(p$pred.1[p$female==1]~p$age[p$female==1],col="red",lwd=1)
legend(33.5,.93,c("Males","Females"),col=c("blue","red"),lwd=c(1,1))
Multinomial logit models assume the independence of irrelevant alternatives (IIA). We can test this assumption in R using the Hausman-McFadden test. To implement this test, we first run two mlogit models, one with all of the brands and one with a subset of brands. We then use the hmftest command.
m1<- mlogit(brand~1|female+age, data = mldata, reflevel="1")
m2<- mlogit(brand~1|female+age, data = mldata, reflevel="1", alt.subset=c("1", "2"))
hmftest(m1, m2)
Hausman-McFadden test
data: mldata
chisq = 1.7323, df = 3, p-value = 0.6298
alternative hypothesis: IIA is rejected
Based on the p-value, we fail to reject the null of IIA.
The current version of xtable does not work with the package mlogit. One possible setup for a multinomial logistic regression table is shown below.
b S.E. t exp(b) brand = 2 vs 1 Intercept -11.774655*** 1.774612 -6.6351 female 0.523814* 0.194247 2.6966 1.688455 age 0.368206*** 0.055003 6.6943 1.445140 brand = 3 vs 1 Intercept -22.721396*** 2.058028 -11.0404 female 0.465941* 0.226090 2.0609 1.593513 age 0.685908*** 0.062627 10.9524 1.985574
Presenting the multinomial regression results can be somewhat tricky since there are multiple equations and multiple comparisons to present. One has to keep in mind which predictor variable is changing, as well as the comparison being made (e.g. category 1 versus category 3 of the variable being predicted).
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services