Chapter 2

advertisement
Chapter 2
GD 18/01/09
Writing a simple Mplus program: learning the language
2.1
INTRODUCTION
In this chapter we introduce you to the basic and most commonly used commands in
Mplus. We make no attempt to be comprehensive – for most analyses, only a small
subset of the Mplus commands is needed. Once you have got used to the limited number
of commands introduced here it will then be relatively simple to build up your repertoire
as you proceed through the book or by consulting the Mplus User’s Guide. We illustrate
the use of the Mplus commands by slowly building an input file to fit a simple factor
model to the string measurements given in Display 1.1. After reading this chapter you
should have an understanding of the basic vocabulary and syntax of Mplus and,
hopefully, be able to put together your own input files. You should also be able to read
and understand simple input files prepared by other Mplus users, together with the output
produced by executing the commands in the Mplus programs.
2.2
RUNNING Mplus
Mplus runs as a batch program. This means that in order to fit a model we have to use the
Mplus Editor to set up an input file containing the appropriate variable definitions and
model-fitting commands. Execution of the input file using RUN then leads to the creation
of an output file that can then be examined using the Editor at leasure. Prior to execution,
the input file will be saved using a user-specified file name such as string.inp, and the
corresponding output file produced by Mplus will automatically be called string.out.
Typically, the input file will refer to a third file containing the data (string.dat, for
example).
2.3
CREATING AN INPUT FILE
Having entered the Mplus Editor, the first step is either to open an existing input file or to
create a new one from scratch (it is nearly always more convenient to proceed by editing
and existing input file, if available). Our first example is shown in Display 2.1. Note that
although we have used words or phrases typed with either upper case letters (TITLE, for
example) or lower case letters (type=general, for example), Mplus makes no distinction
between upper and lower case letters. We have introduced these differences here in order
to facilitate the explanations of what we are describing and asking Mplus to do. There are
ten Mplus commands, most of which are optional:
TITLE
DATA
VARIABLE
DEFINE
ANALYSIS
(Optional)
(Required)
(Required)
(Optional)
(Optional)
1
MODEL
OUTPUT
SAVEDATA
PLOT
MONTECARLO
(Optional)
(Optional)
(Optional)
(Optional)
(Optional)
The DATA and VARIABLE commands are required in all analyses. Our simple example
in Display 2.1 uses TITLE, DATA, VARIABLE, ANALYSIS, MODEL and OUTPUT.
Note that Mplus commands can come in any order. All commands should begin on a
new line (the fact that there blank lines in our input file is ignored by Mplus – these have
been introduced for clarity) and end with a colon (:). The words and phrases printed in
lower case letters in our input file are the command options. Semicolons (;) separate
command options. There can be more than one option per line, but we have put them on
separate lines to improve clarity. Commands and options can be shortened to four or
more letters for convenience, but we have not done so in our example. Any text on a line
to the right of an exclamation mark (!) is ignored by Mplus and can therefore be used to
provide explanatory comments. Finally, note that the maximum line length permitted
within Mplus is 80 characters.
2.4
PROGRAM INPUT
It is easy to be overwhelmed by the many possibilities in a programme such as Mplus.
There are very many options (a great strength of Mplus) and the more advanced ones can
be a source of confusion, particularly for the beginner. We start, therefore, by describing
the ten key commands, together with the most commonly used options. The latter are not
always necessary in any given run (Mplus uses a sensible set of defaults) but it is
probably a good idea for the beginner (and anyone else for that matter) to get into the
habit of making the essential options explicit. It makes it easier for a third party to
understand what has been done, if nothing else. We will now describe the ten commands
in turn, again indicating which are optional by the addition of ‘Optional’ in brackets
where appropriate.
TITLE:
(Optional)
The TITLE command is a way in which you can describe the analysis. The content can
be anything you like!
DATA:
The DATA command provides information about the data set to be used in the analysis.
In our example (Display 2.1) we have provided the option ‘file is string.dat;’. The
string.dat file contains the following:
6.3
4.1
5.1
5.0
5.0
3.2
3.6
4.5
4.8
3.1
3.8
4.1
6.0
3.5
4.5
4.3
2
5.7
3.3
1.3
5.8
2.8
6.7
1.5
2.1
4.6
7.6
2.5
4.0
2.5
1.7
4.8
2.4
5.2
1.2
1.8
3.4
6.0
2.2
5.2
2.8
1.4
4.2
2.0
5.3
1.1
1.6
4.1
6.3
1.6
5.0
2.6
1.6
5.5
2.1
6.0
1.2
1.8
3.9
6.5
2.0
This is a raw data file (i.e. free format, containing each of the individual observations or
measurements, obtained from a single sample or group of individuals), which could have
been explicitly specified as ‘format=free; type=individual; ngroups=1;’. For once, we
have ignored our own recommendation to make all options explicit! It is possible to read
a data from two or more groups of observations and it is also possible to read summary
statistics (covariances or correlations, for example) rather than the raw data themselves
(an attractive option for secondary analyses of published data that are only available in
this form). We refer you to the Mplus User’s Guide for these options (although some of
them will be used later in the present book).
VARIABLE:
The VARIABLE command specifies the characteristics of the variables in the data set to
be analysed. First we need to provide names (in order) of all the variables in the data file.
Those in string.dat are defined by the option ‘names are ruler graham brian andrew;’ (see
Display 2.1). Next we need to list the variables that are to be used in any given analysis.
The input program listed in Display 2.1 is for a simple linear regression of Graham’s
guesses on the measurements provided by the ruler (we are not fitting a model involving
the guesses provided by brian or Andrew). Here we specify the option ‘usevariables ruler
graham’.
Other commonly-used options within the VARIABLE command are those used to define
missing values and to distinguish binary/ordinal variables from quantitative ones. The
option ‘missing are ruler(99);’, for example, would tell Mplus that the code 99 is being
used to identify missing values for the measurements provided by the ruler (but there are
none in our data set). In a typical social science data set we might have a variable called
‘sex’ with values of 1 (male) or 2 (female). Here we would specify the option ‘categorical
are sex;’ (we do not have to specify the actual codes for the categories).
DEFINE:
(Optional)
This command is used to transform existing variables or to create new ones.
ANALYSIS:
(Optional)
3
This is used to provide the technical details of the analysis. Typically, we use
‘type=general; estimator=ml;’ as a default. But we do not always use this option. If we
were to fit a latent class (finite mixture) model, for example, then we would use
‘type=general; estimator=ml;’. Quite often we might wish to boostrapped standard errors
(and confidence intervals – see the OUTPUT command, below). If so, we add, for
example, ‘bootstrap=1000;’ (asking for 1000 bootstrapped samples – this number being
specified by the user).
MODEL:
(Optional)
This is used to describe details of the model to be estimated. The simple regression in
Display 2.1, for example, is specified by ‘graham on ruler;’. A single factor model for the
three guesses of string length would be described by ‘factor by graham brian andrew;’
(the new name for the latent variable ‘factor’ being provided by the investigator). It is
important to carefully distinguish the use of ‘on’ (in a regression model) from ‘by’ (in a
measurement or factor analysis model). Other options will be described as we come
across them.
OUTPUT:
(Optional)
Mplus produces a lot of output by default! But if you want more then there is plenty
available. Particularly useful options here are ‘sampstat;’ (producing simple summary
statistics for the variables being analysed, ‘standardized;’ (producing a variety of
standardised parameter estimates), and ‘cinterval;’ or possibly ‘cinterval(bootstrap);’
(producing confidence intervals for the parameters – the latter using bootstrap sampling,
used in association with ‘bootstrap=1000;’ in the ANALYSIS command). The use of
‘residual;’ produces residuals for to aid examination of the adequacy of the model.
SAVEDATA
(Optional)
This command is used to save data of various sorts and a variety of analysis results.
PLOT
(Optional)
This provides graphical displays of both observed data and analysis results.
MONTECARLO
(Optional)
The final command allows us to specify Monte Carlo simulation studies.
2.5
PUTTING IT ALL TOGETHER
You have already seen Display 2.1. An example of an input file containing the commands
for a simple factor analysis of the string data (excluding the ruler) is shown in Display
2.2. Apart from its title, the second input file is the same as the first until we get to
4
‘usevariables graham brian andrew;’. The next difference comes in the MODEL
command where we specify ‘tlength by graham brian andrew;’ (tlength being a latent
variable or factor). We have added four lines of comment (partly to illustrate the use of
comments to annotate an input file and partly to explain where the name ‘tlength’ has
come from – we could just as easily used variable names such as ‘f’ or ‘f1’).
2.6
THE OUTPUT FILE
Running the program input file in Display 2.2 produces several pages of output. We have
split this output into four consecutive sections (Displays 2.3 to 2.6). We start with
Display 2.3. This is just a reprint of the initial input file but with the added line at the
bottom ‘INPUT READING TERMINATED NORMALLY’. This is a good sign!
Display 2.4 provides a summary of the analysis that has been requested. There are 15
observations (i.e. 15 pieces of string) from a single sample (number of groups is equal to
1). There are three dependent variables (graham, brian and andrew) and a single
continuous latent variable (tlength). The estimation procedure is maximum likelihood
(ml) – we’ll ignore the other technical details. Finally, at the bottom of this display are
the specifically requested summary statistics – sample means, covariances and
correlations.
The key statement at the top of Display 2.5 is ‘THE MODEL ESTIMATION TERMINATED
NORMALLY’. If this is not seen in your own output files (i.e. it is replaced by some sort or
error message or warning), or if is accompanied by a warning or error message, then it is
a sign that something may have gone wrong.
Display 2.5 provides several indicators of model fit. For the time being we will ignore all
of these except the first two chi-square tests. The first one (labelled ‘Chi-Square Test
of Model Fit’ provides us with test of whether the model fits the observed data (the
latter being the observed means and covariance matrix). The chi-square given is zero with
zero degrees of freedom. You will remember that this particular model (a single factor
model for three measurements) is just identified and therefore fits the data perfectly.
Ignore the P-Value given with this test – it is not defined for zero degrees of freedom (it
is not zero!). More interesting chi-square tests will arise when we begin to introduce
model constraints (see next chapter). The next chi-square test – the one labeled ‘ChiSquare Test of Model Fit for the Baseline Model’ provides a test for the
correlations between the three guesses of string length. The baseline model specifies that
these three correlations (hence three degrees of freedom) are all equal to zero. Luckily the
baseline model does not fit (P-Value<0.001)! A low chi-square value for the baseline
model would have implied that there were no associations to model.
Now we get to the interesting bit of the output – the parameter estimates (Display 2.6).
This part of the outcome consists of a table with four columns. For each parameter there
is the estimated value (except when it is subject to a constraint), its standard error, the
ratio of the estimate to its standard error, and, finally, a two-tailed P-value based on this
5
ratio. Note that this P-Value is only of any use if the interesting null hypothesis is the one
that specifies that the true value of the parameter is zero.
We start with the estimated factor loadings (regression coefficients reflecting the linear
dependence of the guesses on the unknown factor). That for graham is constrained to be 1
by default (this sets the scale of measurement). The loadings for brian and andrew are
then interpreted as systematic relative biases of theses two sets of guesses with respect to
those of graham. Both Brian and Andrew produce higher guesses (on average) than does
Graham. Note that the default value for the mean of the latent variable (tlength) is zero
(again helping to determine the scale of measurement). The estimated intercept terms (i.e.
the mean values of the guesses when factor=0) for the three guesses are all very similar.
We then get an estimated value for the variance of factor, together with the estimated
error variances for the three sets of guesses. Remember that a relatively large error
variance implies low precision. Taken at face value, the results appear to indicate that
andrew has the greatest precision, followed by graham, then brian. But we should be
careful. We have not yet tried a formal test of the equality of the three error variances.
An, in fact, that test would only make sense if we’d already established a common value
for the three factor loadings (i.e. a common scale of measurement). We’ll return to this
problem in Chapter 3.
Now let’s think about the reported P-values. When we a looking at the estimated factor
loadings the null hypothesis of interest is equality across the three sets of guesses. We are
not interested in a value of zero (this would be equivalent to asking whether a particular
set of guesses were correlated with the measurements provided by the ruler. We know
they are! Similarly, we are not interested in testing whether the intercepts are zero (not, at
least when the model has been specified so that the mean of factor is zero). Finally, we
know that string lengths are not guessed without random measurement error, so the Pvalues for the three error variances are of no real interest to us here. The take home
message is that you should not automatically be searching for and interpreting P-Values
associated with the parameter estimates. They will, however, frequently be of great
interest, but not always. They have no interesting function in the present output. We will
return to these string data in the next Chapter to illustrate how you might set up a series
sensible hypotheses based on a slightly different specification of the basic measurement
model (we’ll even acknowledge that we’ve used a ruler).
2.7
SUMMARY
We have described the function of the basic building blocks for the construction of an
Mplus input file. We have then described how to put them together and illustrated the
ideas through fitting a simple measurement (factor analysis) model to the guesses of
string length described in Chapter 1. We have also described how to read and interpret
the essential components of the resulting output file. You should now be in a position to
start running simple Mplus jobs, at least for the simpler regression and factor analysis
models. The next chapter will describe this model-fitting activity in a bit more detail.
6
Display 2.1 A simple Mplus input file
TITLE:
Simple bivariate regression of
Graham's guesses on measuremetns provided
by the ruler
- using date from Display 1.1
DATA:
file is string.dat;
VARIABLE: names are ruler graham brian andrew;
usevariables ruler graham;
ANALYSIS: type=general;
estimator=ml;
MODEL:
graham on ruler;
OUTPUT:
sampstat;
7
DISPLAY 2.2
A simple factor analysis
TITLE:
Single common factor model
Factor indicated by guesses of Graham,
Brian and Andrew
- using data from Display 1.1
DATA:
file is string.dat;
VARIABLE: names are ruler graham brian andrew;
usevariables graham brian andrew;
ANALYSIS: type=general;
estimator=ml;
MODEL:
!
!
!
!
!
OUTPUT:
tlength by graham brian andrew;
tlength (i.e. true length) is a new name
provided by the analyst, labelling the latent
variable explaining the covariance between
the measured (manifest) variables
sampstat;
8
DISPLAY 2.3
The start of an output file
INPUT INSTRUCTIONS
TITLE:
Single common factor model
Factor indicated by guesses of Graham,
Brian and Andrew
- using data from Display 1.1
DATA:
file is string.dat;
VARIABLE: names are ruler graham brian andrew;
usevariables graham brian andrew;
ANALYSIS: type=general;
estimator=ml;
MODEL:
!
!
!
!
!
OUTPUT:
tlength by graham brian andrew;
tlength (i.e. true length) is a new name
provided by the analyst, labelling the latent
variable explaining the covariance between
the measured (manifest) variables
sampstat;
INPUT READING TERMINATED NORMALLY
9
DISPLAY 2.4
The next bit of output – the model and the data
Single common factor model
Factor indicated by guesses of Graham,
Brian and Andrew
- using data from Display 1.1
SUMMARY OF ANALYSIS
Number of groups
Number of observations
1
15
Number of dependent variables
Number of independent variables
Number of continuous latent variables
3
0
1
Observed dependent variables
Continuous
GRAHAM
BRIAN
ANDREW
Continuous latent variables
TLENGTH
Estimator
Information matrix
Maximum number of iterations
Convergence criterion
Maximum number of steepest descent iterations
ML
OBSERVED
1000
0.500D-04
20
Input data file(s)
string.dat
Input data format
FREE
SAMPLE STATISTICS
SAMPLE STATISTICS
1
Means
GRAHAM
________
3.433
GRAHAM
BRIAN
ANDREW
Covariances
GRAHAM
________
1.980
2.116
2.398
GRAHAM
BRIAN
ANDREW
Correlations
GRAHAM
________
1.000
0.955
0.981
BRIAN
________
3.427
ANDREW
________
3.767
BRIAN
________
ANDREW
________
2.478
2.649
BRIAN
________
1.000
0.968
3.020
ANDREW
________
1.000
10
DISPLAY 2.5
The output file – the fit of the model
THE MODEL ESTIMATION TERMINATED NORMALLY
TESTS OF MODEL FIT
Chi-Square Test of Model Fit
Value
Degrees of Freedom
P-Value
0.000
0
0.0000
Chi-Square Test of Model Fit for the Baseline Model
Value
Degrees of Freedom
P-Value
90.842
3
0.0000
CFI/TLI
CFI
TLI
1.000
1.000
Loglikelihood
H0 Value
H1 Value
-38.647
-38.647
Information Criteria
Number of Free Parameters
Akaike (AIC)
Bayesian (BIC)
Sample-Size Adjusted BIC
(n* = (n + 2) / 24)
9
95.294
101.667
74.191
RMSEA (Root Mean Square Error Of Approximation)
Estimate
90 Percent C.I.
Probability RMSEA <= .05
0.000
0.000
0.000
0.000
SRMR (Standardized Root Mean Square Residual)
Value
0.000
11
DISPLAY 2.6
The output file – the parameter estimates
MODEL RESULTS
Two-Tailed
P-Value
Estimate
S.E.
Est./S.E.
TLENGTH BY
GRAHAM
BRIAN
ANDREW
1.000
1.105
1.252
0.000
0.088
0.066
999.000
12.612
18.934
999.000
0.000
0.000
Intercepts
GRAHAM
BRIAN
ANDREW
3.433
3.427
3.767
0.363
0.406
0.449
9.451
8.431
8.395
0.000
0.000
0.000
Variances
TLENGTH
1.915
0.723
2.649
0.008
Residual Variances
GRAHAM
BRIAN
ANDREW
0.064
0.141
0.018
0.034
0.060
0.040
1.873
2.352
0.442
0.061
0.019
0.658
QUALITY OF NUMERICAL RESULTS
Condition Number for the Information Matrix
(ratio of smallest to largest eigenvalue)
0.968E-03
12
Download