UCLA Academic Technology Services HomeServicesClassesContactJobs

Stata Data Analysis Examples
Multinomial Logistic Regression

Examples

Example 1. People's occupational choices might be influenced by their parents' occupations and their own education level. We can study the relationship of one's occupation choice with education level and father's occupation.  The occupational choices will be the outcome variable which consists of categories of occupations.

Example 2. A biologist may be interested in food choices that alligators make. Adult alligators might have difference preference than young ones. The outcome variable here will be the types of food, and the predictor variables might be the length of the alligators and other environmental variables.

Example 3. Several brands of similar products are on the market, and you want to study brand choices based on gender and age. For example, a recent finding of a market research group claims that among digital camera choices, women prefer Kodak more than men and men prefer Canon more than women.

Description of the Data

For our data analysis example, we will expand our third example with a hypothetical data set. The data set contains information on 735 subjects who were asked their preference on three brands of some product (e.g., car or TV).  Included in the data set are the information on subjects' gender and age. You can get access to the data set in Stata by typing:

use http://www.ats.ucla.edu/stat/stata/dae/mlogit, clear
describe

Contains data from mlogit.dta
  obs:           735                          brand choices
 vars:             3                          19 Jan 2006 09:43
 size:         5,145 (99.4% of memory free)   (_dta has notes)
-------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
brand           byte   %9.0g                  
female          byte   %8.0g                  
age             byte   %9.0g                  
-------------------------------------------------------------------------------
Sorted by:  age

The outcome variable is brand. The variable female is coded as 0 for male and 1 for female. Let's start with some descriptive statistics of the variables of our interest.

tab brand
      brand |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |        207       28.16       28.16
          2 |        307       41.77       69.93
          3 |        221       30.07      100.00
------------+-----------------------------------
      Total |        735      100.00
bysort brand: sum age female
-------------------------------------------------------------------------
-> brand = 1
    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
         age |       207    31.48792    2.108374         24         38
      female |       207    .5555556    .4981086          0          1
-------------------------------------------------------------------------
-> brand = 2
    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
         age |       307    32.84365    1.824395         28         38
      female |       307    .6775244     .468187          0          1
-------------------------------------------------------------------------
-> brand = 3
    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
         age |       221    34.30317    2.347811         27         38
      female |       221    .6470588    .4789695          0          1

Some Strategies You Might Try

Using the Multinomial Logit Model

Now we have warmed up to building our model. Our goal is to associate the brand choices with age and gender. We will assume a linear relationship between the transformed outcome variable and our predictor variables female and age. Since there are multiple categories, we will choose a base category as the comparison group. Here our choice is the first brand (brand=1).
 mlogit brand female age, base(1)
Iteration 0:   log likelihood = -795.89581
Iteration 1:   log likelihood = -709.10396
Iteration 2:   log likelihood = -703.08391
Iteration 3:   log likelihood = -702.97081
Iteration 4:   log likelihood =  -702.9707
Multinomial logistic regression                   Number of obs   =        735
                                                  LR chi2(4)      =     185.85
                                                  Prob > chi2     =     0.0000
Log likelihood =  -702.9707                       Pseudo R2       =     0.1168
------------------------------------------------------------------------------
       brand |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
2            |
      female |   .5238143   .1942466     2.70   0.007      .143098    .9045307
         age |   .3682065   .0550031     6.69   0.000     .2604024    .4760106
       _cons |  -11.77466    1.77461    -6.64   0.000    -15.25283   -8.296483
-------------+----------------------------------------------------------------
3            |
      female |   .4659414   .2260895     2.06   0.039      .022814    .9090688
         age |   .6859082   .0626265    10.95   0.000     .5631626    .8086539
       _cons |   -22.7214   2.058027   -11.04   0.000    -26.75505   -18.68774
------------------------------------------------------------------------------
(brand==1 is the base outcome)

The output above has two parts, labeled with the categories of the outcome variable brand. They correspond to two equations:

log(P(brand=2)/P(brand=1)) = b_10 + b_11*female + b_12*age
log(P(brand=3)/P(brand=1)) = b_20 + b_21*female + b_22*age,

with b's being the raw regression coefficients from the output.

For example, we can say that for one unit change in the variable age, the log of the ratio of the two probabilities, P(brand=2)/P(brand=1), will be increased by 0.368, and the log of the ratio of the two probabilities P(brand=3)/P(brand=1) will be increased by 0.686. Therefore, we can say that, in general, the older a person is, the more he/she will prefer brand 2 or 3.

The ratio of the probability of choosing one outcome category over the probability of choosing the reference category is often referred as relative risk (and it is also sometimes referred as odds).  So another way of interpreting the regression results is in terms of relative risk. We can say that for one unit change in the variable age, we expect the relative risk of choosing brand 2 over 1 to increase by exp(.3682) = 1.45. So we can say that the relative risk is higher for older people. For a dichotomous predictor variable such as female, we can say that the ratio of the relative risks of choosing brand 2 over 1 for female and male is exp(.5238) = 1.69. We can use the rrr option for mlogit command to display the regression results in the language of risk.

mlogit, rrr
Multinomial logistic regression                   Number of obs   =        735
                                                  LR chi2(4)      =     185.85
                                                  Prob > chi2     =     0.0000
Log likelihood =  -702.9707                       Pseudo R2       =     0.1168
------------------------------------------------------------------------------
       brand |        RRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
2            |
      female |   1.688456   .3279768     2.70   0.007     1.153843    2.470772
         age |    1.44514   .0794872     6.69   0.000     1.297452     1.60964
-------------+----------------------------------------------------------------
3            |
      female |   1.593514   .3602768     2.06   0.039     1.023076     2.48201
         age |   1.985574   .1243495    10.95   0.000     1.756218    2.244884
------------------------------------------------------------------------------
(brand==1 is the base outcome)

We can also present the regression results in terms of probabilities. For example, we can fix age at its mean and calculate the probabilities of choosing each brand for both females and males.  There is actually a suite of Stata commands called spost, written by J. Scott Long and Jeremy Freese, for the post-estimation interpretation of regression models including multinomial regression models. It  can downloaded by following the link presented by the Stata command findit spost9_ado (see How can I use the findit command to search for programs and get additional help? for more information about using findit). We will use the prtab command for getting the predicted probabilities.

prtab female
mlogit: Predicted probabilities for brand
Predicted probability of outcome 2
----------------------
   female | Prediction
----------+-----------
        0 |     0.4306
        1 |     0.5006
----------------------
Predicted probability of outcome 3
----------------------
   female | Prediction
----------+-----------
        0 |     0.2627
        1 |     0.2883
----------------------
Predicted probability of outcome 1
----------------------
   female | Prediction
----------+-----------
        0 |     0.3066
        1 |     0.2111
----------------------
       female        age
x=  .63401361   32.90068

We can also present the regression result graphically. For example, we can create three variables p1, p2 and p3 for the predicted probabilities and plot them against a predictor variable. In the example below, we plot p1 against age separated by the variable female.

predict p1 p2 p3
sort age
line p1 age if female ==0 || line p1 age if female==1, legend(order(1 "male" 2 "female"))

Sample Write-up of the Analysis

We will use the estout command to create a table of the results that might be more appropriate for publication.  This command is user-written, so type findit estout to download it (see How can I use the findit command to search for programs and get additional help? for more information about using findit).

estout, eform drop(_cons) cells(b(star fmt(%8.4f)) se(par)) stats(r2_p chi2 p, labels("Pseudo R-Square")) ///
        unstack varwidth(20) modelwidth(10) collabels(, none) mlabels("Multinomial Model")
                        Multinomial Model                    
                                 2               3   
female                      1.6885**        1.5935*  
                          (0.3280)        (0.3603)   
age                         1.4451***       1.9856***
                          (0.0795)        (0.1243)   
Pseudo R-Square             0.1168                   
chi2                      185.8502                   
p                           0.0000                   

Presenting the multinomial regression results can be somewhat tricky since there are multiple equations and multiple comparisons to present. For example, the table above only shows the relative risk ratio for 2 versus 1 and 3 versus 1. How about 2 versus 3? As the number of the outcome categories increases, the possible number of comparisons will go up as well, and in much greater speed. A very useful command is listcoef, written also by Long and Freese, and it will save us from further trouble. For example:

listcoef female, help
mlogit (N=735): Factor Change in the Odds of brand 
Variable: female (sd=     .48)
    Odds comparing|
Group 1 vs Group 2|      b         z     P>|z|     e^b   e^bStdX
------------------+---------------------------------------------
2       -3        |   0.05787    0.295   0.768   1.0596   1.0283
2       -1        |   0.52381    2.697   0.007   1.6885   1.2872
3       -2        |  -0.05787   -0.295   0.768   0.9438   0.9725
3       -1        |   0.46594    2.061   0.039   1.5935   1.2518
1       -2        |  -0.52381   -2.697   0.007   0.5923   0.7769
1       -3        |  -0.46594   -2.061   0.039   0.6275   0.7988
----------------------------------------------------------------
       b = raw coefficient
       z = z-score for test of b=0
   P>|z| = p-value for z-test
     e^b = exp(b) = factor change in odds for unit increase in X
 e^bStdX = exp(b*SD of X) = change in odds for SD increase in X

With the help option above, we get the explanation for each column of the output. We might actually present this table as part of the write-up as well. In more detailed fashion, we can say that holding all the other variables constant, the effect of female is 1.0596 on the relative risk of choosing brand 2 over 3, meaning that the percent increase of relative risk (or rather loosely, the odds) of choosing brand 2 over 3 from male to female is about 6 percent.

Sometimes, it might be desirable to present the results in terms of probabilities. But because that multinomial logistic model is not a linear model, it is not sufficient to present the regression results for just one set of values. For example, we can create a table of probabilities based on gender for a subject with an average age as shown below. 

preserve 
sum age
replace age=r(mean)
drop p1 p2 p3
predict p1 p2 p3
tabstat p1 p2 p3, by(female) 
restore
  female |        p1        p2        p3
---------+------------------------------
       0 |  .3066382  .4306335  .2627283
       1 |  .2111244  .5006219  .2882537
---------+------------------------------
   Total |  .2460812  .4750071  .2789118
----------------------------------------

We can say that overall a person of average age will be more likely to choose brand 2 regardless of gender and there isn't a clearly preferred brand among average aged females. All seems good. But what if we change age from its mean to two units down?

preserve 
sum age
replace age=r(mean)-2
drop p1 p2 p3
predict p1 p2 p3
tabstat p1 p2 p3, by(female) 
restore
  female |        p1        p2        p3
---------+------------------------------
       0 |  .5291631  .3558369      .115
       1 |  .4029471  .4575086  .1395443
---------+------------------------------
   Total |  .4491404  .4202981  .1305614
----------------------------------------

Now we can see that we will have to change our description! In general, the effect of the variable female depends on the values of other predictor variables in the model. In our case, it depends on the value of the variable age. For this reason, it is more informative to present the predicted probability plots (as shown before).

Cautions, Flies in the Ointment

Additional Examples

See Also


How to cite this page

Report an error on this page

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.