Latent Class Analysis

Latent Class Analysis

Presented by Nicholas Branic

UCI Stats n’ Snacks

December 9, 2014

Presentation Overview

• What is latent class analysis?

• Writing MPlus code and running LCA

• Importing MPlus output into Stata

• …And fixing an irritating importation problem

What is Latent Class Analysis?

• Data-driven technique for identifying group classifications

• “Latent” classes

• Shared characteristics within a unique dataset

• Groups not specified a priori

• But, groups may mirror existing theory/literature

• Rather, specify variables/attributes for classifications


• Generate predicted probabilities of class membership


• Identify class membership for cases in dataset

• For example, changing home mortgage loan activity across SoCal tracts

Class Tracts Percent Class Tracts Percent

11

13

16

4

17

15

12

14

8

18

275

273

229

205

447

435

383

372

183

173

11.3%

11.0%

9.7%

9.4%

7.0%

6.9%

5.8%

5.2%

4.6%

4.4%

2

19

7

5

3

6

1

9

10

100

86

77

75

51

165

162

148

109

4.2%

4.1%

3.7%

2.8%

2.5%

2.2%

2.0%

1.9%

1.3%


• Other statistical techniques similar to LCA:

• Exploratory factor analysis

• Principle components analysis

• Confirmatory factor analysis

• K-means cluster analysis

• Hot spot analysis

• …And I’m sure there are more examples

So How Can I Use It?

• Steps for using Stata and MPlus:

• In Stata:

• Prepare dataset for LCA

• Use “outfile” command to produce data as .txt file

• In MPlus:

• Write input file to run latent class analysis

• Execute the model

• Produce .txt output file

• In Stata:

• Import .txt output into Stata

• Clean LCA results

• Merge LCA results into original dataset

Preparing Your Data

• Open your full dataset

• Remove any cases that feature entirely missing data

• In Stata, any case that has all “.” values

• A shortcut: use the “mdesc” command (downloadable .ado file)

• Sort your data – not necessary, but not a bad idea

• Save out a copy of prepared data

• You’ll merge the LCA results to this dataset later

Exporting Your Data

• Use the “outfile” command to create a .txt form of your data

Writing LCA Code in MPlus

• To estimate models in Mplus, you need to write an input (.inp) file

• Need to include specific fields in code (e.g. TITLE, DATA, VARIABLE)

• Use “!” to write comments in code (like “*” in Stata)

• Each line of code cannot be longer than 80 characters

• MPlus window shows character count for selected line at bottom (e.g. Col 69)


First, specify TITLE and DATA fields


• Under VARIABLE field, include all variables in dataset


• Indicate MISSING ARE field (for Stata, this will be “.”)

• USEVAR ARE lists the subset of variables to include in the LCA


• The CLASSES field indicates the number of classes to estimate


• ANALYSIS specifies the model you will run

• TYPE = missing mixture

• STARTS indicates the number of model iterations


• Next, specify the MODEL – four important parts


• For MODEL, write “%OVERALL%”


• The “%c#2%” part indicates the class solution

• At a minimum, you will always have at least the c#2 block of code

• For a three-class solution, you would repeat a “%c#3%” section, etc.


• Next, include all of your LCA-selected variables in two sections

• The first section is enclosed in brackets [ ]


• The second section has the same variables but without brackets


• After MODEL, specify the OUTPUT field

• For our purposes, use “sampstat” and “standardized”

Writing LCA Code for MPlus

• Finally, write the SAVEDATA field to kick out the LCA results

• Indicates the name of the file to create (this will always be a .txt file)

• Indicates what information to save out (we want “CPROBABILITIES”)


• Some additional notes:

• You need to include “;” to denote the end of different sections of code

• Read through the example input file to see all of the necessary locations


• Some additional notes:

• Remember that you need to keep each line of code within 80 characters

• Otherwise, MPlus will cut off any code at 81 characters or beyond

• Remember that you can use “!” to include comments in your code

• All files referenced in your input file will be .txt files

• The data you’re calling in (e.g. “mplus_df2_hmda_test.txt”)

• The output that you save out (e.g. “df2_c2.txt”)

Running the Latent Class Analysis

• Click on the “Run” icon to begin estimation


• MPlus will estimate your model according to the number of iterations specified in your input file

(e.g. 100 iterations), each with different starting values for estimation


• How long will estimation take?

• Depends on a few factors:

• The size of your dataset

• The number of included variables

• The number of specified classes

• The number of specified model iterations

• Your computer’s processing speed and memory

• I’ve had models take 30 seconds and models that run for 60+ hours

I Ran My Model…Now What?

• After completing your LCA estimation, scroll through and review the output file (.out) that MPlus generates

• Some things to look for:

• “MODEL ESTIMATION TERMINATED NORMALLY”

Post-Estimation Review


• Bayesian Information Criterion (BIC)

BIC is used to compare different models (e.g. a two-class versus a three-class solution) and see which provides a better fit for your data.

A lower BIC value indicates a better fit, so keep testing models until the BIC stops declining and begins to increase again



• The number/percent of cases that fall into each estimated class



• The entropy score for your model

“Entropy” values range between zero and one, where a value of one means that each class is perfectly unique from the others

You want this value to be as close to one as possible. I don’t know if there is an accepted threshold or cutoff for entropy levels that are

“too low.” I also don’t know whether entropy is reported in published research as an indicator of model fit or quality.



• The end of the output file shows how long your model took to estimate


• After estimating your model, try estimating a new model with one additional class

• For this example, I ran a two-class solution, so next I would specify three classes

• This way, I can find the optimal class solution for my data (by comparing BIC values between models)

Running the Next LCA Model

• Open the two-class input file, use “Save As” to save a new three-class input file, and make just a few edits:

• Change the number of classes from (2) to (3)


• Copy and paste the variable list in the MODEL field and then change the header to “%c#3%” for a three-class model

Note: you still need to keep the “%c#2%” section from before, so now you will have a c#2 section followed by a c#3 section in your input file.


• Change the name of the .txt data file that MPlus will kick out


• After running a three-class solution, review the output file created by

MPlus -- if the BIC decreased, then create a new input file for a fourclass solution and estimate this new model

• Repeat these steps until your BIC value stops decreasing and instead begins to increase – the model with the lowest BIC is your optimal solution!

Class 1

Class 2

Class 3

Class 4

Class 5

BIC

Entropy


C2 Model C3 Model C4 Model C5 Model

891

193

82.2%

17.8%

142

669

273

13.1%

61.7%

25.2%

129

177

397

381

11.9%

16.3%

36.6%

35.1%

96

216

156

288

328

8.9%

20.0%

14.4%

26.6%

30.3%

100,225.73

0.969

94,359.24

0.945

91,605.63

0.948

89,817.38

0.949

Importing MPlus Output into Stata

• After identifying your optimal model, read your .txt LCA output back into

Stata and merge into your original dataset

• My preference: use the “stcmd” commands, which call StatTransfer from within Stata

• Easily convert .txt to .dta format

• For example:

• inputst df2_c5.txt

• outputst df2_c5.dta /y

• Alternatively, you could open the .txt file in Excel, save as a .csv file, and use the “import delimited” command in Stata

…What Just Happened?

• MPlus uses asterisks (“*”) to denote missing data in its output file

• Conversely, Stata uses periods (“.”)

• These asterisks cause a number of issues in your dataset:

• Turn numeric variables into strings

• Cause data to “shift” columns to the left

• Pull your predicted probabilities and class ID out of proper columns

The Solution? Shift the Data Back

• I wrote an .ado file for Stata that will automatically reverse the data shifting problem with MPlus LCA output

• I gave this .ado file an imaginative title:

• “mpluslcafix”

Fixing Your MPlus Output

• With your MPlus output loaded into Stata, enter the following commands: adopath + “<foldercontainingadofile>” mpluslcafix




• The “mpluslcafix” command will save out a new .dta file with your corrected LCA data

• Merge this new file back into your original dataset: use <originaldataset>, clear merge 1:1 _n using <newdataset>

• For example: use df2_hmda_lca_test_tomerge, clear merge 1:1 _n using df2_c5


• Now, you can use your latent class analysis results in statistical models!

• Hooray!

Thanks for Listening!

• Please feel free to email me with questions, comments, issues:

• nbranic@uci.edu

• Also, please help me to “stress test” the .ado file!

• Try it on different types of data

• Try to break it

• Let me know if you find glitches so that I can fix them

Latent Class Analysis

Latent Class Analysis

Related documents

Products

Support

Latent Class Analysis

Latent Class Analysis

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib