### Mplus Class Notes Analyzing Data: Latent Class and Other Mixture Models

Mixture models are measurement models that use observed variables as indicators of one or more nominal latent variables (i.e. categorical variables). One way to think about mixture models that one is attempting to identify subsets or "classes" of observations within the observed data. The latent variable (classes) is categorical, but the indicators may be either categorical or continuous. The term latent class analysis is often used to refer to a mixture model in which all of the observed indicator variables are categorical.

Mplus version 5.2 was used for these examples.

### 1.0 Latent class analysis

The examples on this page use a dataset with information on high school students' academic histories.  In the first example below, a 2 class model is estimated using four dichotomous variables as indicators (category 1 = no, category 2 = yes). The variables are whether the student had taken honors math (hm), honors English (he), or vocational classes (voc); and whether the student reported they were unlikely to go to college (nocol). The expected classes are academically oriented students (i.e. students who took honors classes, did not take vocational classes and reported they were likely to go to college), and students who are less academically oriented. The dataset for this example is lca.dat.

The input file for this model is shown below. The usevariables option of the of the variables: command specifies which variables will be used in this analysis (necessary when not all of the variables in the dataset are used). The classes option identifies the name of the latent variable (in this case c), followed by the number of classes to be estimated in parentheses (in this case 2). Note that the class variable(s) can be assigned any valid variable name. The categorical option of the variables: command tells Mplus which variables are categorical.  The type option of the analysis: command specifies the type of model to be estimated, in this case a mixture model.

TITLE:	A latent class analysis (LCA)
Data:
file is lca.dat;
Variable:
names are hm he voc nocol ach9-ach12;
usevariables are hm he voc nocol ;
classes = c (2);
categorical = hm he voc nocol ;
Analysis:
type = mixture;


The output for this model is shown below.

INPUT READING TERMINATED NORMALLY

A latent class analysis (LCA)

SUMMARY OF ANALYSIS

Number of groups                                                 1
Number of observations                                         500

Number of dependent variables                                    4
Number of independent variables                                  0
Number of continuous latent variables                            0
Number of categorical latent variables                           1

Observed dependent variables

Binary and ordered categorical (ordinal)
HM          HE          VOC         NOCOL

Categorical latent variables
C

Estimator                                                      MLR
Information matrix                                        OBSERVED
Optimization Specifications for the Quasi-Newton Algorithm for
Continuous Outcomes
Maximum number of iterations                                 100
Convergence criterion                                  0.100D-05
Optimization Specifications for the EM Algorithm
Maximum number of iterations                                 500
Convergence criteria
Loglikelihood change                                 0.100D-06
Relative loglikelihood change                        0.100D-06
Derivative                                           0.100D-05
Optimization Specifications for the M step of the EM Algorithm for
Categorical Latent variables
Number of M step iterations                                    1
M step convergence criterion                           0.100D-05
Basis for M step termination                           ITERATION
Optimization Specifications for the M step of the EM Algorithm for
Censored, Binary or Ordered Categorical (Ordinal), Unordered
Categorical (Nominal) and Count Outcomes
Number of M step iterations                                    1
M step convergence criterion                           0.100D-05
Basis for M step termination                           ITERATION
Maximum value for logit thresholds                            15
Minimum value for logit thresholds                           -15
Minimum expected cell size for chi-square              0.100D-01
Optimization algorithm                                         EMA
Random Starts Specifications
Number of initial stage random starts                         10
Number of final stage optimizations                            2
Number of initial stage iterations                            10
Initial stage convergence criterion                    0.100D+01
Random starts scale                                    0.500D+01
Random seed for generating random starts                       0

Input data file(s)
lca.dat
Input data format  FREE

SUMMARY OF CATEGORICAL DATA PROPORTIONS

HM
Category 1    0.678
Category 2    0.322
HE
Category 1    0.686
Category 2    0.314
VOC
Category 1    0.322
Category 2    0.678
NOCOL
Category 1    0.334
Category 2    0.666

RANDOM STARTS RESULTS RANKED FROM THE BEST TO THE WORST LOGLIKELIHOOD VALUES

Final stage loglikelihood values at local maxima, seeds, and initial stage start numbers:

-965.244  93468            3
-965.244  939021           8

THE MODEL ESTIMATION TERMINATED NORMALLY

TESTS OF MODEL FIT

Loglikelihood

H0 Value                        -965.244
H0 Scaling Correction Factor       1.013
for MLR

Information Criteria

Number of Free Parameters              9
Akaike (AIC)                    1948.488
Bayesian (BIC)                  1986.420
(n* = (n + 2) / 24)

Chi-Square Test of Model Fit for the Binary and Ordered Categorical
(Ordinal) Outcomes

Pearson Chi-Square

Value                              6.287
Degrees of Freedom                     6
P-Value                           0.3918

Likelihood Ratio Chi-Square

Value                              5.605
Degrees of Freedom                     6
P-Value                           0.4688

FINAL CLASS COUNTS AND PROPORTIONS FOR THE LATENT CLASSES
BASED ON THE ESTIMATED MODEL

Latent
Classes

1        136.38198          0.27276
2        363.61802          0.72724

FINAL CLASS COUNTS AND PROPORTIONS FOR THE LATENT CLASS PATTERNS
BASED ON ESTIMATED POSTERIOR PROBABILITIES

Latent
Classes

1        136.38170          0.27276
2        363.61830          0.72724

CLASSIFICATION QUALITY

Entropy                         0.904

CLASSIFICATION OF INDIVIDUALS BASED ON THEIR MOST LIKELY LATENT CLASS MEMBERSHIP

Class Counts and Proportions

Latent
Classes

1              127          0.25400
2              373          0.74600

Average Latent Class Probabilities for Most Likely Latent Class Membership (Row)
by Latent Class (Column)

1        2

1   0.986    0.014
2   0.030    0.970

MODEL RESULTS

Two-Tailed
Estimate       S.E.  Est./S.E.    P-Value

Latent Class 1

Thresholds
HM$1 -2.063 0.373 -5.536 0.000 HE$1              -1.724      0.300     -5.755      0.000
VOC$1 2.331 0.389 5.985 0.000 NOCOL$1            2.078      0.320      6.490      0.000

Latent Class 2

Thresholds
HM$1 2.091 0.182 11.502 0.000 HE$1               2.056      0.180     11.401      0.000
VOC$1 -2.187 0.203 -10.760 0.000 NOCOL$1           -1.937      0.183    -10.613      0.000

Categorical Latent Variables

Means
C#1               -0.981      0.116     -8.468      0.000

RESULTS IN PROBABILITY SCALE

Latent Class 1

HM
Category 1         0.113      0.037      3.025      0.002
Category 2         0.887      0.037     23.799      0.000
HE
Category 1         0.151      0.038      3.934      0.000
Category 2         0.849      0.038     22.056      0.000
VOC
Category 1         0.911      0.031     28.987      0.000
Category 2         0.089      0.031      2.817      0.005
NOCOL
Category 1         0.889      0.032     28.072      0.000
Category 2         0.111      0.032      3.514      0.000

Latent Class 2

HM
Category 1         0.890      0.018     50.016      0.000
Category 2         0.110      0.018      6.181      0.000
HE
Category 1         0.887      0.018     48.873      0.000
Category 2         0.113      0.018      6.256      0.000
VOC
Category 1         0.101      0.018      5.472      0.000
Category 2         0.899      0.018     48.748      0.000
NOCOL
Category 1         0.126      0.020      6.267      0.000
Category 2         0.874      0.020     43.498      0.000

LATENT CLASS ODDS RATIO RESULTS

Latent Class 1 Compared to Latent Class 2

HM
Category > 1      63.670     25.875      2.461      0.014
HE
Category > 1      43.795     14.941      2.931      0.003
VOC
Category > 1       0.011      0.005      2.351      0.019
NOCOL
Category > 1       0.018      0.007      2.768      0.006

QUALITY OF NUMERICAL RESULTS

Condition Number for the Information Matrix              0.600E-01
(ratio of smallest to largest eigenvalue)


Towards the top of the output, under FINAL CLASS COUNTS..., Mplus gives the final counts and proportions for the classes in several ways. First it gives the counts (i.e. the number of cases in each class) and proportions based on the estimated model, and on the posterior probabilities. This gives the proportion (and count) of individuals estimated to be in each class in the model. Below that, Mplus gives the classification based on most likely class membership, which is an alternative method of assigning individuals to classes. Based on the estimated model and posterior probabilities we see that about 27% of students belong to class 1, and about 73% belong to class 2. Based on most likely class membership, about 25% of students belong to class 1 and the remaining 75% to class 2. Under MODEL RESULTS the thresholds for the classes are listed. Thresholds are on the logit scale, and hence, can be somewhat difficult to interpret. The same information is given in a more interpretable scale under RESULTS IN PROBABILITY SCALE. Here we see that the probability that an individual in class 1 will be in category 2 of the variable hm is .89. In other words, the estimated probability of a student in class 1 taking honors math is about .89.

### 2.0 Using both categorical and continuous indicator variables

Above we estimated a specific case of a mixture model, a latent class analysis, in which all of the indicators are categorical, in this example the model contains both categorical and continuous indicators. In addition to the four categorical variables used in the example above, this model includes four continuous variables, the students score on a measure of academic achievement for each of the four years of high school (ach9-ach12). The achievement variables have been centered so that each has a mean of zero. The only difference between the input file for this model and the one for the LCA estimated above is that the usevariables option has been dropped because all variables in the dataset are used in the  model. In general, the only difference between the input file for a mixture model with all categorical indicators and the input for a model that includes continuous variables is the type of variables included.

Title: Categorical and continuous indicators
Data:
file is lca.dat;
Variable:
names are hm he voc nocol ach9-ach12;
classes = c (2);
categorical = hm he voc nocol ;
Analysis:
type = mixture;

Below is the output for this model.

*** WARNING in MODEL command
All variables are uncorrelated with all other variables within class.
Check that this is what is intended.
1 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS

Categorical and continuous indicators

SUMMARY OF ANALYSIS

Number of groups                                                 1
Number of observations                                         500

Number of dependent variables                                    8
Number of independent variables                                  0
Number of continuous latent variables                            0
Number of categorical latent variables                           1

Observed dependent variables

Continuous
ACH9        ACH10       ACH11       ACH12

Binary and ordered categorical (ordinal)
HM          HE          VOC         NOCOL

Categorical latent variables
C

Estimator                                                      MLR
Information matrix                                        OBSERVED
Optimization Specifications for the Quasi-Newton Algorithm for
Continuous Outcomes
Maximum number of iterations                                 100
Convergence criterion                                  0.100D-05
Optimization Specifications for the EM Algorithm
Maximum number of iterations                                 500
Convergence criteria
Loglikelihood change                                 0.100D-06
Relative loglikelihood change                        0.100D-06
Derivative                                           0.100D-05
Optimization Specifications for the M step of the EM Algorithm for
Categorical Latent variables
Number of M step iterations                                    1
M step convergence criterion                           0.100D-05
Basis for M step termination                           ITERATION
Optimization Specifications for the M step of the EM Algorithm for
Censored, Binary or Ordered Categorical (Ordinal), Unordered
Categorical (Nominal) and Count Outcomes
Number of M step iterations                                    1
M step convergence criterion                           0.100D-05
Basis for M step termination                           ITERATION
Maximum value for logit thresholds                            15
Minimum value for logit thresholds                           -15
Minimum expected cell size for chi-square              0.100D-01
Optimization algorithm                                         EMA
Random Starts Specifications
Number of initial stage random starts                         10
Number of final stage optimizations                            2
Number of initial stage iterations                            10
Initial stage convergence criterion                    0.100D+01
Random starts scale                                    0.500D+01
Random seed for generating random starts                       0

Input data file(s)
lca.dat
Input data format  FREE

SUMMARY OF CATEGORICAL DATA PROPORTIONS

HM
Category 1    0.678
Category 2    0.322
HE
Category 1    0.686
Category 2    0.314
VOC
Category 1    0.322
Category 2    0.678
NOCOL
Category 1    0.334
Category 2    0.666

RANDOM STARTS RESULTS RANKED FROM THE BEST TO THE WORST LOGLIKELIHOOD VALUES

Final stage loglikelihood values at local maxima, seeds, and initial stage start numbers:

-3842.353  unperturbed      0
-3842.353  462953           7

THE MODEL ESTIMATION TERMINATED NORMALLY

TESTS OF MODEL FIT

Loglikelihood

H0 Value                       -3842.353
H0 Scaling Correction Factor       0.982
for MLR

Information Criteria

Number of Free Parameters             21
Akaike (AIC)                    7726.706
Bayesian (BIC)                  7815.213
(n* = (n + 2) / 24)

Chi-Square Test of Model Fit for the Binary and Ordered Categorical
(Ordinal) Outcomes

Pearson Chi-Square

Value                              7.628
Degrees of Freedom                     6
P-Value                           0.2666

Likelihood Ratio Chi-Square

Value                              6.974
Degrees of Freedom                     6
P-Value                           0.3233

FINAL CLASS COUNTS AND PROPORTIONS FOR THE LATENT CLASSES
BASED ON THE ESTIMATED MODEL

Latent
Classes

1        367.56581          0.73513
2        132.43419          0.26487

FINAL CLASS COUNTS AND PROPORTIONS FOR THE LATENT CLASS PATTERNS
BASED ON ESTIMATED POSTERIOR PROBABILITIES

Latent
Classes

1        367.56581          0.73513
2        132.43419          0.26487

CLASSIFICATION QUALITY

Entropy                         0.998

CLASSIFICATION OF INDIVIDUALS BASED ON THEIR MOST LIKELY LATENT CLASS MEMBERSHIP

Class Counts and Proportions

Latent
Classes

1              368          0.73600
2              132          0.26400

Average Latent Class Probabilities for Most Likely Latent Class Membership (Row)
by Latent Class (Column)

1        2

1   0.999    0.001
2   0.000    1.000

MODEL RESULTS

Two-Tailed
Estimate       S.E.  Est./S.E.    P-Value

Latent Class 1

Means
ACH9              -2.058      0.055    -37.121      0.000
ACH10             -2.061      0.051    -40.656      0.000
ACH11             -0.987      0.055    -18.070      0.000
ACH12             -0.990      0.052    -19.023      0.000

Thresholds
HM$1 2.021 0.162 12.453 0.000 HE$1               2.075      0.166     12.521      0.000
VOC$1 -2.075 0.166 -12.525 0.000 NOCOL$1           -1.931      0.157    -12.280      0.000

Variances
ACH9               1.116      0.073     15.346      0.000
ACH10              0.956      0.058     16.601      0.000
ACH11              1.031      0.059     17.382      0.000
ACH12              0.946      0.060     15.727      0.000

Latent Class 2

Means
ACH9               1.988      0.091     21.870      0.000
ACH10              1.971      0.087     22.653      0.000
ACH11              0.987      0.081     12.248      0.000
ACH12              0.829      0.080     10.425      0.000

Thresholds
HM$1 -2.101 0.282 -7.440 0.000 HE$1              -1.954      0.266     -7.354      0.000
VOC$1 2.267 0.302 7.514 0.000 NOCOL$1            2.306      0.303      7.617      0.000

Variances
ACH9               1.116      0.073     15.346      0.000
ACH10              0.956      0.058     16.601      0.000
ACH11              1.031      0.059     17.382      0.000
ACH12              0.946      0.060     15.727      0.000

Categorical Latent Variables

Means
C#1              1.021      0.102     10.055      0.000

RESULTS IN PROBABILITY SCALE

Latent Class 1

HM
Category 1         0.883      0.017     52.665      0.000
Category 2         0.117      0.017      6.977      0.000
HE
Category 1         0.888      0.016     54.096      0.000
Category 2         0.112      0.016      6.792      0.000
VOC
Category 1         0.112      0.016      6.794      0.000
Category 2         0.888      0.016     54.114      0.000
NOCOL
Category 1         0.127      0.017      7.283      0.000
Category 2         0.873      0.017     50.207      0.000

Latent Class 2

HM
Category 1         0.109      0.027      3.974      0.000
Category 2         0.891      0.027     32.487      0.000
HE
Category 1         0.124      0.029      4.296      0.000
Category 2         0.876      0.029     30.323      0.000
VOC
Category 1         0.906      0.026     35.304      0.000
Category 2         0.094      0.026      3.658      0.000
NOCOL
Category 1         0.909      0.025     36.451      0.000
Category 2         0.091      0.025      3.632      0.000

LATENT CLASS ODDS RATIO RESULTS

Latent Class 1 Compared to Latent Class 2

HM
Category > 1       0.016      0.005      3.066      0.002
HE
Category > 1       0.018      0.006      3.188      0.001
VOC
Category > 1      76.870     26.448      2.906      0.004
NOCOL
Category > 1      69.181     23.612      2.930      0.003

QUALITY OF NUMERICAL RESULTS

Condition Number for the Information Matrix              0.275E-01
(ratio of smallest to largest eigenvalue)

Towards the top of the output is a message warning us that all of the variables are uncorrelated within clusters. This "warning" does not imply a problem with the model, it is merely there to remind the user that the restriction exists, whether this restriction is appropriate must be determined by the user. In addition to the thresholds for the categorical items (which were included in the output for the previous example), the output for this model includes means and variances for the continuous indicators (i.e. ach9-ach12). The means for the academic achievement variables (ach9-ach12) are all lower in the first class than the second class. The first class is also less likely to have taken honors classes (hm and he) and more likely to have taken vocational classes (voc) and to say they don't intend to go to college (nocol). Although the order of the classes has reversed (i.e. the class we have called "academically oriented students" is class 2 in this model) the results of this model are consistent with the results from the model in the first example. The models in both examples are consistent with hypothesis that there are two types of students, those who are academically oriented, and those who are not. Note that by default, Mplus specifies the model so that it assumes the variances of the continuous class indicators (ach9-ach12) are equal across all classes, this assumption may or may not be appropriate.

### 3.0 Saving Class Assignments

In addition to the output file produced by Mplus, it is possible to save class membership information for each case in the dataset to a text file. This text file can later be used with Mplus or read into another statistical package. To do this the savedata: command is added to the input file. The file option gives the name of the file in which the class assignments should be saved (i.e. class.txt). Whenever the file option is used, all of the variables used in the analysis are saved in an external file. The save = cprob; option specifies that the class probabilities should be saved, in addition to the variables used in estimation. Additional variables that were not used in the analysis, but which you wish to include in the saved file, for example, an id variable, can be included by adding the auxiliary option (e.g. auxiliary = id;) to the variable: command.

Title:	Saving class probabilities
Data:
file is lca.dat;
Variable:
names are hm he voc nocol ach9-ach12;
usevariables are hm he voc nocol ;
classes = c (2);
categorical = hm he voc nocol ;
Analysis:
type = mixture;
Savedata:
file is class.txt;
save = cprob;

The output file for this model contains all of the information contained in the output for the model in the first example, plus additional output associated with the savedata: command. This additional output appears towards the end of the output file, and is shown below.

SAVEDATA INFORMATION

Order and format of variables

HM             F10.3
HE             F10.3
VOC            F10.3
NOCOL          F10.3
CPROB1         F10.3
CPROB2         F10.3
C              F10.3

Save file
class.txt

Save file format
7F10.3

Save file record length    5000

The additional output associated with the savedata: command lists the variables in the order in which they appear in the saved dataset. Note that the 4 observed variables used in estimation are listed first, followed by three variables associated with the latent class assignment. The variables CPROB1 and CPROB2 give the probability that each case is in class 1 or class 2, respectively. The variable C contains the class assignment based on posterior probabilities. Below the list of variables the name of the file, and information on the format of the file are shown.

The file class.txt is a text file that can be read by a large number of programs. The first few lines of this file are shown below. Based on the information in the output file, we know that the first four columns contain each student's value for the variables hm, hw, voc, and nocol (in that order), the remaining three columns are each student's predicted probability for each of the two classes, and the final column contains the student's class membership.

     1.000     1.000     0.000     1.000     0.963     0.037     1.000
1.000     0.000     0.000     0.000     0.971     0.029     1.000
0.000     0.000     1.000     1.000     0.000     1.000     2.000
1.000     1.000     0.000     0.000     0.999     0.001     1.000
1.000     1.000     0.000     0.000     0.999     0.001     1.000

### 4.0 Plots

Plots based on the estimated model can also be requested by adding the plot: command to the input file. The type option specifies the type of plots desired, in this case, plot3 requests all plots available for this model. The series option gives the variables to be included in the plots, this can contain either categorical or continuous variables (but not both at the same time). The list of variables in the series option is followed by (*) this uses the defaults for the scaling of the x-axis in the plots. For more information on scaling of the x-axis see the Mplus manual.

Title: Categorical and continuous indicators
Data:
file is lca.dat;
Variable:
names are hm he voc nocol ach9-ach12;
classes = C (2);
categorical = hm he voc nocol ;
Analysis:
type = mixture;
Plot:
type = plot3;
series = ach9-ach12(*);

From the Graph menu select View graphs. Because the variables we wish to plot are continuous, we select Estimated means, for categorical variables we would select Estimated probabilities. The options under View graphs are somewhat limited for this model, if you were to specify a model where class membership was predicted by additional variables, then a larger variety of graphs is available.

This graph, sometimes called a profile plot, shows graphically the latent class means given in the MODEL RESULTS section of the output for the second example. By default, the x-axis starts at zero and increases in units of one for each of the observed variables. In our example, this means that the means for the variable ach9 shown at 0, followed by ach10 at 1, etc.

The legend tells us that class 1 is shown in red, and class 2 in green. It also gives the proportion of cases in each class, in this case an estimated 26% of students are in class 1, and 74% are in class 2. This information can be found in the output under the heading "Final Class Counts and Proportions for the latent Classes Based on the Estimated Model". Consistent with the means shown in the output for example 2,the plot shows that students in class 1 have lower average scores on all four of the achievement variables (ach9-ach12) than students in class 2.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.