UCLA Academic Technology Services HomeServicesClassesContactJobs
Help the Stat Consulting Group by giving a gift             
Loading

R Data Analysis Examples
Multinomial Logistic Regression

Note: the code on this page works with R 2.8.1.


Examples

Example 1. People's occupational choices might be influenced by their parents' occupations and their own education level. We can study the relationship of one's occupation choice with education level and father's occupation.  The occupational choices will be the outcome variable which consists of categories of occupations.

Example 2. A biologist may be interested in food choices that alligators make. Adult alligators might have difference preference than young ones. The outcome variable here will be the types of food, and the predictor variables might be the length of the alligators and other environmental variables.

Example 3. Several brands of similar products are on the market, and you want to study brand choices based on gender and age. For example, a recent finding of a market research group claims that among digital camera choices, women prefer Kodak more than men and men prefer Canon more than women.

Description of the Data

For our data analysis example, we will expand our third example with a hypothetical data set. The data set contains information on 735 subjects who were asked their preference on three brands of some product (e.g., car or TV).  Included in the data set are the information on subjects' gender and age. You can get access to the data set in R by typing the command below. Note that R requires forward slashes (/) not back slashes (\) when specifying a file location even if the file is on your hard drive.

mydata <- read.csv(url("http://www.ats.ucla.edu/stat/r/dae/mlogit.csv"))
attach(mydata)
names(mydata)

[1] "brand"  "female" "age"   

The outcome variable is brand. The variable female is coded as 0 for male and 1 for female. Let's start with some descriptive statistics on the variables of our interest.

table(brand)
brand
  1   2   3 
207 307 221 

table(female)
female
  0   1 
269 466 

summary(age)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
24.0    32.0    32.0    32.9    34.0    38.0 

sd(age)
[1] 2.333195

xtabs(~ female + brand)
      brand
female   1   2   3
     0  92  99  78
     1 115 208 143

Some Strategies You Might Try

Using the Multinomial Logit Model

Now we have warmed up to building our model. Our goal is to associate the brand choices with age and gender. We will assume a linear relationship between the transformed outcome variable and our predictor variables female and age. In order to run multinomial logistic regression in R, you first need to install the package (sometimes called a library) mlogit. Once you have installed mlogit, you will need to tell R you want to use this library before you can access the commands in it you do this using the command library() shown in the first line of code below.

Before running the model, we need to prepare the data. First we indicate that the outcome variable, brand, is a factor. Then we use the mlogit.data command to expand the data.  For each observation in our original dataset, we now have 3 observations--one for each value of brand. In this new dataset, the brand number appears under alt and whether or not the given brand was selected by the person in the brand variable as a TRUE or FALSE.

library(mlogit)

mydata[1:10,]
   brand female age
1      1      0  24
2      1      0  26
3      1      0  26
4      1      1  27
5      1      1  27
6      3      1  27
7      1      0  27
8      1      0  27
9      1      1  27
10     1      0  27

mydata$brand<-as.factor(mydata$brand)
mldata<-mlogit.data(mydata, varying=NULL, choice="brand", shape="wide")

mldata[1:10,]
    chid alt brand female age
1.1    1   1  TRUE      0  24
1.2    1   2 FALSE      0  24
1.3    1   3 FALSE      0  24
2.1    2   1  TRUE      0  26
2.2    2   2 FALSE      0  26
2.3    2   3 FALSE      0  26
3.1    3   1  TRUE      0  26
3.2    3   2 FALSE      0  26
3.3    3   3 FALSE      0  26
4.1    4   1  TRUE      1  27

We can now run the model.  In the mlogit command, we can indicate alternative-specific and individual-specific variables. We are only interested in individual-specific variables, which are indicated after the | symbol.  We will consider brand 1 to be our reference level and indicate this in the model. The final line asks for a summary of the model.


mlogit.model<- mlogit(brand~1|female+age, data = mldata, reflevel="1")
summary(mlogit.model)

Call:
mlogit(formula = brand ~ 1 | female + age, data = mldata, reflevel = "1")

Frequencies of alternatives:
      1       2       3 
0.28163 0.41769 0.30068 

Newton-Raphson maximisation 
gradient close to zero. May be a solution 
5 iterations, 0h:0m:0s 
g'(-H)^-1g = 7.65E-19 

Coefficients :
              Estimate Std. Error  t-value  Pr(>|t|)    
alt2        -11.774655   1.774612  -6.6351 3.244e-11 ***
alt3        -22.721396   2.058028 -11.0404 < 2.2e-16 ***
alt2:female   0.523814   0.194247   2.6966  0.007004 ** 
alt3:female   0.465941   0.226090   2.0609  0.039315 *  
alt2:age      0.368206   0.055003   6.6943 2.167e-11 ***
alt3:age      0.685908   0.062627  10.9524 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Log-Likelihood: -702.97
McFadden R^2:  0.70981 
Likelihood ratio test : chisq = 185.85 (p.value=< 2.22e-16)

In the output above, the first thing we see is the call. This is R reminding us what the model we ran was, what options we specified, etc.. Next we see the distribution of the three outcome levels followed by information on the model-fitting process: the maximization method used, the number of iterations required, the run time needed.

The next part of the output is labeled "Coefficients," and includes the coefficients, as well as their standard errors and associated t-values and p-values. Each independent variable (and the intercept, presented first) appear twice, once for each of the categories of the outcome indicated by "alt2" and "alt3". They correspond to two equations:

log(P(brand=2)/P(brand=1)) = b_10 + b_11*female + b_12*age
log(P(brand=3)/P(brand=1)) = b_20 + b_21*female + b_22*age,

with b's being the raw regression coefficients from the output.

For example, we can say that for one unit change in the variable age, the log of the ratio of the two probabilities, P(brand=2)/P(brand=1), will be increased by 0.368, and the log of the ratio of the two probabilities P(brand=3)/P(brand=1) will be increased by 0.686. Therefore, we can say that, in general, the older a person is, the more he/she will prefer brand 2 or 3 to brand 1.

The ratio of the probability of choosing one outcome category over the probability of choosing the reference category is often referred as relative risk (and it is also sometimes referred as odds).  So another way of interpreting the regression results is in terms of relative risk. We can say that for one unit change in the variable age, we expect the relative risk of choosing brand 2 over 1 to increase by exp(.3682) = 1.45. So we can say that the relative risk is higher for older people. For a dichotomous predictor variable such as female, we can say that the ratio of the relative risks of choosing brand 2 over 1 for female and male is exp(.5238) = 1.69. We can exponentiate our coefficients to display the regression results in the language of risk.

exp(coef(mlogit.model))
        alt2         alt3  alt2:female  alt3:female     alt2:age     alt3:age 
7.697192e-06 1.355886e-10 1.688456e+00 1.593514e+00 1.445140e+00 1.985574e+00

The intercepts are produced, but they are not particularly interpretable in this form, as discussed above, our primary interest is in the coefficients themselves.

We can also present the regression results in terms of probabilities. For example, we can look at a range of age values and calculate the probabilities of choosing each brand for both females and males.  In the code below, we first create a new dataset called newdata. We then generate the logit-scale predicted values using the model coefficients.  For brand = 1, this value is zero.


newdata <- data.frame(cbind(age = rep(24:38, 2), female = c(rep(0, 15), rep(1, 15))))
logit1 <- rep(0, 30)
logit2 <- -11.774655 + 0.523814*newdata$female + 0.368206*newdata$age
logit3 <- -22.721396 + 0.465941*newdata$female + 0.685908*newdata$age

We can exponentiate these logit values and scale them, knowing that the probabilities of the three brands must sum to one.


logits <- cbind(logit1, logit2, logit3)
p.unscaled <- exp(logits)
p <- cbind(newdata, (p.unscaled / rowSums(p.unscaled)))
colnames(p) <- c("age", "female", "pred.1", "pred.2", "pred.3")

p
   age female     brand1     brand2      brand3
1   24      0 0.94795816 0.05022934 0.001812499
2   25      0 0.92560894 0.07087706 0.003514003
3   26      0 0.89429648 0.09896224 0.006741277
4   27      0 0.85114671 0.13611383 0.012739463
5   28      0 0.79313274 0.18329623 0.023571034
6   29      0 0.71788189 0.23975656 0.042361548
7   30      0 0.62507370 0.30168836 0.073237938
8   31      0 0.51809923 0.36136835 0.120532423
9   32      0 0.40487478 0.40810106 0.187024165
10  33      0 0.29639700 0.43174829 0.271854709
11  34      0 0.20299479 0.42731836 0.369686853
12  35      0 0.13057937 0.39723827 0.472182360
13  36      0 0.07951504 0.34957118 0.570913780
14  37      0 0.04627580 0.29400185 0.659722352
15  38      0 0.02598192 0.23854878 0.735469305
16  24      1 0.91532111 0.08189009 0.002788802
17  25      1 0.88079306 0.11387846 0.005328489
18  26      1 0.83412901 0.15585141 0.010019579
19  27      1 0.77287790 0.20868837 0.018433729
20  28      1 0.69562006 0.27143715 0.032942796
21  29      1 0.60315882 0.34012512 0.056716067
22  30      1 0.49959271 0.40712986 0.093277432
23  31      1 0.39240326 0.46212482 0.145471914
24  32      1 0.29086656 0.49502864 0.214104800
25  33      1 0.20320808 0.49978974 0.297002182
26  34      1 0.13411359 0.47668223 0.389204181
27  35      1 0.08404267 0.43168392 0.484273412
28  36      1 0.05034163 0.37368279 0.575975587
29  37      1 0.02903195 0.31143107 0.659536975
30  38      1 0.01623118 0.25162033 0.732148482

Although not particularly pretty, this is a table of predicted probabilities. The values of female and age are listed, followed by three columns labeled "pred.1," "pred.2," and "pred.3," these contain the predicted probability that brand equals 1, 2, and 3 respectively.

We can present these regression predictions graphically. For example, we can plot the predicted probabilities that the respondent selects brand 1 against age separated by the variable female using the dataset above. The first line of code below creates a plot with the values of pred.1 where female is 0 to be graphed against age. The option type="l" indicates that the plot should be a line graph, col="blue" indicates that the line should be blue. The next line of code adds an additional line to our graph for the predicted probabilities where female is 1. We then add a legend.


plot(p$pred.1[p$female==0]~p$age[p$female==0],type="l",col="blue",lwd=1,
      ylab="Predicted Probability for Brand = 1",
      xlab="Age")
lines(p$pred.1[p$female==1]~p$age[p$female==1],col="red",lwd=1)
legend(33.5,.93,c("Males","Females"),col=c("blue","red"),lwd=c(1,1))

Testing the assumption of independence of irrelevant alternatives

Multinomial logit models assume the independence of irrelevant alternatives (IIA).  We can test this assumption in R using the Hausman-McFadden test.  To implement this test, we first run two mlogit models, one with all of the brands and one with a subset of brands.  We then use the hmftest command. 

m1<- mlogit(brand~1|female+age, data = mldata, reflevel="1")
m2<- mlogit(brand~1|female+age, data = mldata, reflevel="1", alt.subset=c("1", "2"))

hmftest(m1, m2)

        Hausman-McFadden test

data:  mldata 
chisq = 1.7323, df = 3, p-value = 0.6298
alternative hypothesis: IIA is rejected

Based on the p-value, we fail to reject the null of IIA. 

Sample Write-up of the Analysis

The current version of xtable does not work with the package mlogit. One possible setup for a multinomial logistic regression table is shown below.

                      b         S.E.        t       exp(b)
brand = 2 vs 1
Intercept     -11.774655*** 1.774612  -6.6351
female          0.523814*   0.194247   2.6966    1.688455
age             0.368206*** 0.055003   6.6943    1.445140
brand = 3 vs 1
Intercept     -22.721396*** 2.058028 -11.0404
female          0.465941*   0.226090   2.0609    1.593513
age             0.685908*** 0.062627  10.9524    1.985574    

Presenting the multinomial regression results can be somewhat tricky since there are multiple equations and multiple comparisons to present. One has to keep in mind which predictor variable is changing, as well as the comparison being made (e.g. category 1 versus category 3 of the variable being predicted).

Cautions, Flies in the Ointment

See Also


How to cite this page

Report an error on this page or leave a comment

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California