Programs for doing Latent Class Analysis

I have used a few different programs for doing latent class analysis, but recently found some excellent web pages at:

http://members.aol.com/KMarkus/lca.html,

and

http://ourworld.compuserve.com/homepages/jsuebersax/index.htm

These page lists the programs I use, along with several others. Keith Markus also gives a bibliography, though I might add the (older) book:

Haberman SJ (1979): Analysis of qualitative data. Volume 2. New developments. New York: Academic Press.

and another program:

the e1071 R library function lca() (see CRAN, the R repository).

A very good program listed on these sites is:

LEM, Jeroen Vermunt's program for "log-linear and event history analysis with missing data using the EM algorithm".

This is able to fit the same models as the program LCAG, which I have previously preferred, and has Windows and DOS versions. This can be freely downloaded.

Examples using LEM

Below are a couple of examples from genetic epidemiology fitted using LEM, including an analysis of a classical twin study.

An ordinary latent class analysis

There are 10 binary variables observed in a sample of asthmatics: a positive or negative skin prick test for nine aeroallergens, and high or low total serum IgE. Because the table is so sparse, I have used the AIC to select the "best" number of latent classes. The solution is the same as that selected when a representative subset of variables is used in order to increase the cell sizes. A principal components analysis of the rank correlations of the original variables (wthout dichotomization) extracts two components that are interpretable as the two dimensions underlying the preferred four latent class solution in the LCA.

*
* Unrestricted latent class model
*
* 2 latent classes with ten manifest variables:
*
* Skin prick tests for
*  Alternia, Aspergillus, Canary Grass, Rye Grass, [ "Outdoor" allergens ]
*  Cat, Dog, Cockroach, D. pter, house dust        [ "Indoor" allergens  ]
* plus High total serumIgE (above 100 IU/ml) 
* in a sample of unrelated asthmatics (one per twin pair)
* 
* One latent variable
*
lat 1
*
* Ten manifest variables
*
man 10
*
* All the manifest variables have two levels
* The number of levels of the latent variable X is increased from run to run
* to choose a parsimonious summary of the data
*
dim 2 2 2 2 2 2 2 2 2 2 2 
*
* labels
*
lab X A B C D E F G H I J  
*
* Model: Frequencies of two classes of X plus conditional probabilities
*        of membership for the 10 manifest variables
*
mod X   
    A|X 
    B|X 
    C|X 
    D|X 
    E|X 
    F|X 
    G|X 
    H|X 
    I|X 
    J|X
*
* 2x2x2x2x2x2x2x2x2x2 contingency table for manifest variables
*
* Just a trifle sparse (!) but one obtains similar answers if use smaller
* number of variables so that cell sizes are larger (collapse across some
* variables below).
*
dat [ 26 0 0 0 6 2 3 2 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0
      0 0 0 1 0 0 0 0 2 4 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 2 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
      0 0 1 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 1 2 0 0 0 0 0 0 1 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 3 0 0 0 0 1 0 0 0
      0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
      0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
      0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
      0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      3 0 0 1 0 2 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      1 0 1 0 0 0 0 0 0 3 0 0 0 0 0 0 0 3 0 0 0 0 0 0 3 3 0 0 1 0
      0 0 0 1 ]

A two group two-latent variable model for twins

Two observed binary traits (asthma and hayfever) are modelled as being due to a single underlying latent variable (atopy). The correlation between the twins in atopy is calculated as a log odds ratio, one for the MZ twin group and DZ twin groups. The log-linear model parameters and Wald tests in the output (the "GXY table") test the goodness-of-fit of the "shared environment only" model (ORMZ=ORDZ), and find it lacking.

The overall model fit is not good: there is evidence for twin concordance in asthma and hayfever above that due to genetic factors common to asthma and hayfever. Trait specific association and symmetry constraints (exchangability of Twin 1 and Twin 2) are easy to add to the model.

*
* MZ and DZ female twins Asthma and Hayfever 1980 - latent variable model
*
* X and Y are the latent variables for atopy in T1 and T2
*
* G is the zygosity group
* A and B are hayfever and asthma in T1
* C and D are hayfever and asthma in T2
*
* 2 latent variables
*
lat 2
*
* 5 manifest variables
*
man 5
*
* All variables have only 2 levels in this case
* List latent variables then manifest varibles
*
dim  2 2 2 2 2 2 2  
lab  X Y G A B C D  
*
* Model XY is the 2x2 table of the latent variable atopy in T1 
* versus atopy in T2.  This table is allowed to be different in
* the MZ and DZ groups.
*
* The measurement model is constant across groups and twins (T1 v T2)
*
mod G
    XY|G
    A|X 
    B|X 
    C|Y eq1 A|X 
    D|Y eq1 B|X
*
* Data for MZ then DZ twins
*
dat [ 585  21  126  26  
       20  12    9   5 
      132   7  137  34  
       28   4   40  46  

      338  21   95  23 
       19   2    4   3
       94   9   65  14 
       28   5   18   9 ]