Psy 524 Ainsworth Psy 524 Lab #2 Data Screening

advertisement
Psy 524
Ainsworth
Psy 524 Lab #2
Data Screening
Write any answers and paste any figures into this document after the appropriate question and
either print it out and hand it in or email to by class on Tuesday, Feb. 17 th.
1. Create a .lsp file out of the forclass.sav data (download the creating a .lsp file
document on the class webpage).
a. Save forclass.sav as a tab delimited file (forclass.dat).
b. Open arcinstruction.txt in NOTEPAD and make sure to insert actual
information in any line that has a ! in front of it and delete the first
two lines of directions.
c. Open up forclass.dat in notepad as well (you may need to open up
notepad separately so that both are open at the same time). Delete the
variable names from the top of the data file and use them in the .lsp
syntax. Copy all of the columns in the forclass.dat data and insert in
between the parantheses. “Save as” forclass.lsp.
Dataset = Forclass
Begin description
Data set I had lying around and thought I’d use as an example
End description
Begin variables
Col 0 = gender = female = 1 and male =0
Col 1 = ethn = ethnicity
Col 2 = sos = social opinion survey
Col 3 = ego = egotism
Col 4 = n = neroticism
Col 5 = e = extroversion
Col 6 = o = openess
Col 7 = a = agreeableness
Col 8 = c = conscientiousness
col 9 = soitot
End variable
Begin data
(
1
0
89
162.5 26
35
36
36
0
1
52
151
38
36
34
40
0
0
57
185.5 32
43
43
52
1
0
57
179.5 28
46
31
35
1
0
90
167
42
53
47
37
….
…
…
…
…
…
…
…
51
33
48
48
31
…
-.487677316090359
-.987537222508109
-.840172443534852
.53362825330652
-.362217276609241
…………………….
All the way to the end of the data file, make sure to include the close parantheses below
)
2. Looking at graphical relationship among variables. Open Arc and load
forclass.lsp. Create a Multipanel plot of SOI (fixed axis on horizontal) versus
N, E, A, O, C, ego and SOS, marking by gender (assume females are 1). To
do this go to Graph and Fit, move SOI into “Fixed Axis” and the others (n to
sos) into “changing axis”, move gender into “mark by” and then hit OK. Put
OLS = 1 and use the bottom slide bar to change through predictors. Which
variables have or appear to have relatively strong relationships with SOI?
Copy and paste the graph with the strongest relationship below.
Psy 524
Ainsworth
3. Looking at the same multiple panel plot from put OLS back at 0 and select
“Remove Linear Trend”, do any of the variables seem to be heteroscedastic?
There’s a slight pattern of tighter variance at the lower sos values.
4. Examine the histograms for each of the variables (except gender, ethn and
case numbers) and identify any variables with outliers (disconnection) or
skewness (use 10 bins in each histogram). Go to graph and fit, plot of and
Psy 524
Ainsworth
move each variable in one at a time. If there is a variable with outlier copy
and paste it below and then delete the outliers from the distribution. If there
is a variable that is skewed copy and paste it below.
ego has an outlier
soitot is skewed
Psy 524
Ainsworth
5. If there was a variable from #3 that is skewed transform it. Go to forclass
button (if this is what you put in the “dataset =” part of the .lsp syntax).
Click on transform, move the variable over, change it to log transform, put
the number 3 in the “c” box (this is a constant your adding to take it out of a
scale that includes 0), and hit OK. Plot a histogram of this newly
transformed variable and paste it below.
6. Download the salary data set from the class webpage. Using SPSS open it (it
is called “salery.sav”, it was spelled that way when I got it).
a. Create a casenum variable by using compute -> casenum =
$casenum.
b. Calculate mahalanobis distances. Analyze -> Regression -> Linear
and move casenum into dependent and everything else except
“female” into Independent(s). Hit “Save” and click on
“Mahalanobis”. Continue -> OK. Close the output because you don’t
need it. What is the cutoff value for multivariate outliers given these
predictors? The cutoff would be 18.467 (chi square with 4 dfs and at
alpha equals .001). And are there any cases that qualify as
multivariate outliers? Yes If so which case(s)? Subject number 14
with a mahal score of 19.03417.
c. Split the file by Data -> Split File click on “compare groups” and
move “female” into “Groups based on”. What function does this
serve? This allows you to screen women and men separately for
outliers. Repeat “b” above. Did the values change? Yes the values
changed. Are there any multivariate outliers this time? No. What
did this exercise demonstrate? That screening by groups can lead to
different results than screening without grouping. If you are
Psy 524
Ainsworth
performing a grouped analysis (ANOVA, MANOVA, etc.) screen by
groups separately.
7. Download “social.sav” from the class webpage. Perform a missing value
analysis.
a. Transform -> Compute and enter num=nvalid(ciccomp to supcomp)
-> OK. Data -> Select Cases -> If condition satisfies -> If, then enter
num <> 0 into the window -> continue, select “unselected cases are”
DELETED. This gets rid of any subject without scores on any
quantitative variable.
b. Analyze -> Missing Value Analysis, include everything except gender,
order and num into Quantative Variables.
c. Click on “Descriptives” and select “t-test with groups…” and
“Include probabilities in table”, continue.
d. Select “EM”, then hit new “EM” button -> “Save completed data” ->
“File” and title it social2.sav, click on continue and then hit OK.
e. Interpret the output and pick the variable causing the largest
dependency on the other missing values. What is the probability of
Little’s MCAR analysis? Repeat the steps above removing the
variable causing biggest problem. What is the MCAR probability
now? Pretend it’s OK and continue.
CICCOMP
QDICOMP
SRCHCOMP
EICOMP
SUBCOMP
OOCOMP
SUPCOMP
EICOMP
SRCHCOMP
QDICOMP
CICCOMP
Separate Variance t Testsa
t
df
P(2-tail)
# Present
# Missing
Mean(Present)
Mean(Missing)
t
df
P(2-tail)
# Present
# Missing
Mean(Present)
Mean(Missing)
t
df
P(2-tail)
# Present
# Missing
Mean(Present)
Mean(Missing)
t
df
P(2-tail)
# Present
# Missing
Mean(Present)
Mean(Missing)
t
df
P(2-tail)
# Present
# Missing
.
.
.
228
0
67.7412
.
1.4
13.0
.177
216
12
67.9213
64.5000
-.9
11.2
.392
217
11
67.6175
70.1818
.7
17.1
.466
212
16
67.8821
65.8750
-.7
33.6
.497
200
.2
31.5
.841
216
26
22.1574
22.0000
.
.
.
242
0
22.1405
.
-.8
9.7
.451
232
10
22.0991
23.1000
.4
16.8
.677
225
17
22.1867
21.5294
-.5
33.1
.618
216
2.1
33.5
.046
217
28
22.2995
19.8571
-3.0
15.2
.009
232
13
21.8405
25.2308
.
.
.
245
0
22.0204
.
2.9
18.4
.009
229
16
22.2489
18.7500
-1.2
33.0
.245
218
1.0
29.0
.304
212
26
22.0094
20.7308
-2.2
14.6
.041
225
13
21.7378
24.1538
.1
8.5
.914
229
9
21.8777
21.6667
.
.
.
238
0
21.8697
.
-2.0
33.2
.059
213
.2
30.7
.830
200
25
36.3800
36.0400
-.6
9.2
.582
216
9
36.2963
37.4444
.8
6.2
.466
218
7
36.4495
33.0000
1.6
11.5
.128
213
12
36.6244
31.3333
.
.
.
225
.5
30.6
.590
218
26
22.1514
21.5769
-.3
12.0
.782
232
12
22.0690
22.5000
-1.5
10.9
.162
233
11
21.9871
24.2727
1.5
17.3
.154
227
17
22.2555
19.8824
.1
28.6
.929
219
-1.0
29.3
.333
201
25
36.0199
37.7200
.9
11.0
.386
215
11
36.3163
34.0909
-1.0
5.3
.370
220
6
36.1273
39.1667
-1.5
15.7
.147
212
14
36.0425
38.7143
1.1
30.2
.290
202
Mean(Present)
67.5600
22.1019
21.8716
21.6901
36.3422
22.1005
36.3812
Mean(Missing)
69.0357
22.4615
23.2222
23.4000
.
22.0000
34.7500
t
df
P(2-tail)
# Present
# Missing
Mean(Present)
1.8
9.5
.109
218
10
.4
10.8
.685
232
10
-1.2
11.6
.255
233
12
-.5
10.6
.619
227
11
.4
5.2
.725
219
6
.
.
.
244
0
.8
9.7
.464
216
10
SUBCOMP
This is a piece of the output, if you look at the rows for eicomp and oocomp you’ll
see that there is a significant
dependency
with
srchcomp.
You
could25get rid of
28
26
27
25
0
24
Psy 524
Ainsworth
eicomp and oocomp because both of them have missing dependent on srchcomp but
you’re losing two variables. I would cheat get rid of search instead.
Little's MCAR test: Chisquare = 200.386, df = 167, Prob = .04, is significant so you
really shouldn’t inpute the data.
If you remove srchcomp the dependencies seem to be OK, there’s one equaling p =
.041 but it’s close enough not to worry. But Little’s test is still at a probability of
.044, so that’s not too good but even though I can’t find a reference for it I bet you
can interpret this as a violation when it is lower than .001 or something like that
(don’t quote me).
f. Highlight gender and order in the data window and copy them. Open
social 2 and insert the variables into the new data set.
8. Explore the new data separately by gender. You should:
a. Test for skewness and univariate outliers using Explore
i. Calculate Z for skewness for each variable and tell me which
variables violate this test.
qdicomp, eicomp, subcomp and oocomp all violate skewness
ii. Take care of any outliers as you see fit (paste graphs with
outliers, tell me why it’s an outlier and explain what you did to
fix it(delete or change, etc…).
This is all up to you, there is no right and wrong with this so
I’m not going to give you a definitive answer, because there
isn’t one.
Although it does seem like the skewness in qdicomp is fixed by
simply fixing the one extreme outlier (delete or move in)
b. Do a further test of normality by asking for a P-P plot (make sure and
split the file first). Graphs -> Legacy Dialogues -> P-P, move the
variables you want over and hit continue. Paste the graph that seems
to show the worst violation and interpret it (tell me what it means,
refer to the book for help).
Any of the plots that show the dots do fall on the line indicates that
they are not normally distributed and in the detrended plot most of
the violation seems to indicate a curvilinear relationship in stead of
linear.
c. Perform square root transformations using Compute and page 83,
“Syntax for Common Data Transformations” or the last page of the
“Data screening check list” on the webpage (I would use “Paste” and
run it from the syntax, hint hint). Run explore again and tell me
which variables this worked for (normalized) if any by looking at the
histograms and Z for skewness.
Seems like oocomp and eicomp are the best candidates for
transformation. With the square root transform sqrt(29-eicomp) and
sqrt(29-oocomp) they get normalized in terms of the Z test even
though they are rather odd looking.
Psy 524
Ainsworth
d. Do the same thing as in C for any variables square root did normalize,
but using a log10 transformation of the original variables (not the
square rooted ones). Does this help any? For which variables?
Don’t really need to do it.
Download