Statistical Modeling - University of Idaho

advertisement
Statistical Modeling:
Building a Better Mouse Trap, and others
Dec 10, 2012 at the University of Hong Kong
Stephen Sauchi Lee
Associate Professor of Statistics
Affiliated Professor of Bioinformatics and Computational Biology
Department of Statistics
University of Idaho
Moscow, Idaho, USA
Statistical Modeling
On 3 projects
I. Building a Better Mouse Trap?
The Incremental Utility Behind the
Methodology of Risk Assessment
II. Predicting Parkinson Disease Status
III. Demographic Impacts on Social
Vulnerability in Norway
Academy of Criminal Justice Sciences NYC 2012-03-16
Zachary Hamilton, PhD
Melanie-Angela Neuilly, PhD
Robert Barnoski, PhD
Washington State University
Pullman, Washington
Stephen S. Lee, PhD
University of Idaho
Moscow, Idaho

Four generations:
•
•
•
•

1) clinical judgment
2) static predictors
3) dynamic factors
4) automated
Regression methods utilized for instrument creation
• LSI-R: Logistic Regression
• COMPAS: Survival Regression

Recent advancements in prediction rarely utilized for
criminal risk assessment
• Decision trees
• Neural networks
• Latent class analysis
 Linear
approaches assume equal additive quality
for all factors (Steadman, et al., 2000)
• Typically neglect interaction effects
 Machine-learning
mirror diagnostic processes
more closely (Steadman, et al., 2000)
 Machine-learning approaches most commonly
used:
• Classification Trees (CT) and other recursive partitioning
models (CHAID, CART, ICT, Random Forests, etc.)
• Neural Networks (NN)
 Hierarchical
question-decision tree model
(Breiman, 1984)
• The final answer is the result of a series of conditioning
answers (If this -> then that, etc.)
• Used in diagnostic reasoning
• No statistical significance
 Random
Forests
• Inductive statistical learning
• Aggregation of hundreds of Classification Trees
Starting
point
First
split
Age at first
arrest ≤25
Total
sample
Age at first
arrest ≥25
Second
split
Final
split
Prior
arrests ≤2
No
recidivism
Prior
arrests ≥2
Recidivism
No
recidivism
 Developed
in Artificial Intelligence research
 Data mining technique for pattern recognition
 Aim at modeling the lower level brain functions
 Layered nodes of fact-sets instead of rules, used
to train the network
 Based on the training data, the network “learns”
to deduce the right answer to any new piece of
information
 Used in psychiatric diagnostic
Weights
Weights
Activation
functions
Recalculation of
weights based on
predicted and
actual outputs
Predicted
Outputs
Inputs
Hidden
neurons
 Studies
using CT-like analyses, as well as NN
tend to make use of smaller samples (≤ 1,500)
(except Berk et al., 2009; Palocsay et al., 2000; and Silver et al.,
2000)
 Overall,
results are mixed, but those finding
significant improvement via CT use lack proper
validations (Liu, 2010)
 Studies using NN show very split results (Liu, 2010)
 Overall,
very few studies have investigated the
utility of CT and NN for predicting recidivism
 Previous studies have been limited
• In power
• To violence prediction
 The
current study remedy such limitations
• Close to ½ million cases
• General recidivism as well as possibilities for investigating
offense-specific recidivism

Previously utilized LSI-R
• Found laborious by community corrections officers
• Evaluated to be strengthened by increase of static items (Barnowski, 2003)

Created current instrument in 2006
• Factors strongly related to recidivism: demographics, juvenile record,
commitments to DOC, felonies, misdemeanors, and violations
• Removed dynamic items (interview not required)
• Instrument scored from logistic regression - logit weights

Comparable Predictive Validity for WA Sample (WSIPP, 2007)
• LSI-R AUC = .66
• WA Static Risk AUC = .74

24 variables included from current risk prediction instruement
• 3 year follow-up (release from incarceration)
• Any felony recidivism

2 step creation
• Construction sample
 All offenders released from prison or jail placed on community supervision from
1986 to 2000 (N = 287,417)
• Validation sample
 All offenders released from prison or jail placed on community supervision from
2001 to 2002 (N = 71,957)

Compare methods of prediction models
• Area under the receiver operating characteristic (AUC)
 Values of .500s indicate no predictive accuracy
 Where .600s are weak, .700s moderate, and above .800 strong predictive accuracy
Descriptive Statistics (N=359,374)
Predictor
White (not included in model)
1. Male
2. Age At Risk
3. Adult Felonies
4. Juvenile Felony Score
5. Juvenile Person Score
6. Number of DOC Commitments
7. Homicide/Manslaughter
8. Felony Sex
9. Felony Violent Property
10. Felony Non-Dometic Violence Assault
11. Felony Dometic Violence Assault
12. Felony Weapon
13. Felony Property
14. Felony Drug
15. Felony Escape
16. Misdemenor Non-Dometic Violence Assault
17. Misdemenor Dometic Violence Assault
18. Misdemenor Sex
19. Misdemenor Dometic Violence Other
20. Misdemenor Weapon
21. Misdemenor Property
22. Misdemenor Drug
23. Misdemenor Escape
24. Misdemenor Alcohol
NewFelony (Outcome)
%/Mean(SD)
79.7
18.7
31.7(10.2)
2.1(1.9)
32
6
2.0(1.7)
1
7
9
16
2
4
85
62
8
23
21
3
1
4
52
17
1
17
44
 Extended
validation sample of original
instrument construction
• Strongest model predictors (weights) were: 1) Misd.
Property, 2) Juvenile Felony, 3) Misd. Dometic Violence
Assault, 4) Misd. Drug, 5)Misd. Sex, 6)Male
• Findings comparable to original instrument construction
Models
Construction Sample ROCs Validation Sample ROCs
Original Sample
.756
.742
Extended Sample
.750
.749
Variable Importance in Random Foest Model
IncNodePurity
3000
2500
2000
1500
1000
500
0
TotalAdultFelonyAdjs
TotalMisdPropertyScore
SentenceCntScore
TotalJuvenileFelonyScore
AgeScore
TotalFelPropertyScore
TotalMisdNonDvAsltScore
TotalMisdDvAssaultScore
TotalMisdDrugScore
TotalFelDrugScore
TotalMisdAlcoholScore
TotalFelSexScore
TotalFelNonDvAsltScore
MaleScore
TotalFelVioPropScore
TotalMisdSexScore
TotalJuvenilePersonScore
TotalMisdWeaponScore
TotalFelEscapeScore
TotalWeaponScore
TotalFelHomicideScore
TotalMisdEscapeScore
TotalMisdDvOtherScore
TotalFelDvAssaultScore
Strongest Model
Predictors :
1)Felony
Adjudications,
2) Misd. Property,
3)Sentence Length
4)Juvenile Felony
5)Age
6) Felony Property
Model Comparisons
Models
Construction Sample ROC (SE) Validation Sample ROC (SE)
Logistic Regression
.750 (.001)
.749* (.002)
Neural Network
.755* (.001)
.750* (.002)
Random Forest
.750 (.001)
.734 (.002)
ROC Curves for the Construction Sample
0.4
0.4
Hit Rate
Hit Rate
0.6
0.6
0.8
0.8
1.0
1.0
ROC Curves for the Validate Sample
Losgistic Regression
Neural Net
Random Forest
0.0
0.0
0.2
0.2
Losgistic Regression
Neural Net
Random Forest
0.0
0.2
0.4
0.6
False Alarm Rate
0.8
1.0
0.0
0.2
0.4
0.6
False Alarm Rate
0.8
1.0
Model Comparisons
 Significant
differences found
• Neural network significantly greater predictive validity
+
+
0.735
0.740
0.745
+
Construction Sample
Validation Sample
0.730
ROC_Area
0.750
0.755
than random forest
• Neural network significantly greater predictive validity
than logistic regression but only construction sample
Logistic Regression
Neural Net
Random Forest
 Neural
networks performed best, followed by
logistic regression and random forest
 ROC
differences of methods found to be
significant but not universally
 Preliminary
nature of findings are stressed
 Lack
of specificity of outcome measure and
sample heterogeneity
• Any felony within 3 years
• Specialization and taxonomic structures not considered
 Unit
of analysis is incarceration cycle
• Violation of independence assumption for repeat
incarcerations
 Exclusion
of dynamic predictors
 Add
dynamic predictors to models
• Available since 2008
• Prior/preliminary findings indicate only modest improvement
 Examine
impact of latent variable methods
• 4th potential model
 Disentangle
heterogeneity
• Subgroup analyses based on offense specialties
 i.e. drug, violent, sex offender
Predicting Parkinson’s disease status
with vocal dysphonia measurements
Roxana Hickey
Bioinformatics & Computational Biology
Statistics 519 Multivariate Statistics Term Project
Professor Stephen Lee
April 27, 2011
Outline
• Background
▫ Parkinson’s disease
▫ Vocal dysphonia
• Study dataset
• Statistical analyses
• Conclusions
Parkinson’s disease
• Neurological disorder that leads
to shaking and difficulty with
walking, movement and
coordination1
• Affects >1 million people in
North America2
▫ rapidly increased prevalence
after age 603
• No cure, but medication
available to alleviate symptoms,
especially in early stages4
▫ early detection key to effective
treatment strategies
http://www.healthtree.com/articles/parkinsons-disease/causes/
Parkinson’s disease & vocal impairment
• ~90% of individuals with Parkinson’s disease have
some form of vocal impairment5, 6
▫ characteristics7
 dysphonia (impaired production of vocal sounds)
 dysarthria (problems with normal articulation in speech)
▫ may be one of earliest indicators of onset of illness8
• Tests for vocal impairment9,10
▫ sustained phonations11, 12 (focus of this study)
 produce single vowel and hold pitch constant
▫ running speech12
 speak standard sentences that contains representative
sample of linguistic units
Measures of assessing vocal dysphonia
• Traditional methods11, 12
▫
▫
▫
▫
▫
pitch (F0, fundamental frequency of vocal oscillation)
absolute sound pressure level (loudness)
jitter (variation in F0 from vocal cycle to vocal cycle)
shimmer (variation in amplitude)
noise-to-harmonics ratio
• Novel methods13, 14
▫ nonlinear dynamical systems theory and nonlinear
time series analysis
▫ recurrence period density entropy
▫ detrended fluctuation analysis
Measures of assessing vocal dysphonia
• Measurements differ in robustness14
▫ uncontrolled variation in acoustic environment
▫ physical condition and characteristics of subject
• Therefore, chosen measurement methods should
be as robust as possible to this variation
▫ Goal of the study: identify an optimal feature set
that is both robust to uncontrolled variation and
able to classify patients with Parkinson’s disease
based on vocal dysphonic symptoms
 Additional advantage: possibility of monitoring
patients remotely
http://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/parkinsons.data
Subjects & methods
• Subjects
▫ 31 individuals
 8 healthy
 23 with Parkinson’s disease (PD)
▫ average of six sustained vowel phonations
recorded from each subject
 Total n=195
• Calculation of features via software programs
▫ traditional measures
▫ non-standard measures, including new measure
proposed by authors: pitch period entropy
Variables
Grouping variable:
status =0 (healthy)
=1 (PD)
Attribute
Description
Attribute
MDVP:Jitter (%)
MDVP jitter as percentage
NHR
Measures
MDVP:Jitter(Abs)
MDVP absolute jitter in microseconds
MDVP:RAP
MDVP Relative Amplitude Perturbation
MDVP:PPQ
MDVP five-point Period Perturbation Quotient
Jitter:DDP
Average absolute difference of differences
between cycles, divided by the average period
DFA
Detrended Fluctuation Analysis
MDVP:Shimmer
MDVP local shimmer
spread1
Nonlinear measure of fundamental
frequency variation
MDVP:Shimmer(dB
)
MCVP local shimmer in decibels
spread2
Shimmer:APQ3
3-pt Amplitude Perturbation Quotient
PPE
Pitch period entropy
MDVP:Fo(Hz)
Average vocal fundamental
frequency
Measures of variation in amplitude
Measures of variation in fundamental
5-pt Amplitude Perturbation Quotient
Shimmer:APQ5
frequency
Description
Noise-to-Harmonics
of ratio
of noiseRatio
to tonal
Harmonics-to-Noise
Ratio
HNR
components
in voice
Recurrence Period Density Entropy
RPDE
Nonlinear dynamical
complexity
Correlation dimension
D2
measures
Single fractal scaling exponent
Nonlinear
measures of
Nonlinear measure of fundamental
variationvariation
fundamentalfrequency
frequency
MDVP:APQ
MDVP 11-point Amplitude Perturbation
Quotient
MDVP:Fhi(Hz
)
Maximum vocal fundamental
frequency
Shimmer:DDA
Avg abs. diff. between consecutive differences
between the amplitudes of consecutive periods
MDVP:Flo(Hz)
Minimum vocal fundamental
frequency
MDVP = (Kay Pentax) Multi-Dimensional Voice Program
Statistical analyses
•
•
•
•
•
•
EDA
PCA
MANOVA
Hotelling’s T2
QDA
Classification tree (with random forest)
EDA
0=healthy
1=PD
EDA
0=healthy
1=PD
EDA
0=healthy
1=PD
EDA
0=healthy
1=PD
EDA
0=healthy
1=PD
PCA
MANOVA
• template
H0: µhealthy = µParkinson’s
Hotelling’s T2 test
• H0: µhealthy = µParkinson’s
(p=22)
• T-square test statistic =
187.48
• df = 48 + 147 – 2 = 193
• critical 20.05, 22, 193 47
(extrapolated)
• Conclusion: reject H0
(=0.05)
QDA
Actual
 park.qda.cv <qda(park.g[,2:23],
park.g$group, CV=T)

table(Actual=park.g$gro
up,
Classified=park.qda.cv$
class)
Classified
0
1
0
35
13
1
9
138
CV error rate:
(13+9)/195=11.28%
Classification tree (CART)
table(Actual=park.g$group,
Classified=park.cart.pred)
Actual
Classified
0
1
0 38
10
1
141
6
Error rate:
(10+6)/195= 8.21%
Random forests
park.rf <- randomForest(group~., data=park.g,
importance=TRUE, proximity=TRUE)
park.rf
Call:
randomForest(formula = group ~ ., data = park.g,
importance = TRUE, proximity = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4
OOB estimate of error rate: 9.23%
Confusion matrix:
0
1
class.error
0
35
13
0.2708
1
5
142
0.0340
Cumulative error rates
overall
healthy
PD
Random forests
varImpPlot(park.rf)
Random forests
Conclusions
• Measurements of vocal dysphonia differ between
healthy and PD individuals (MANOVA,
Hotelling’s T2)
• QDA and classification tree able to separate
healthy from PD individuals using a reduced
“feature set”
▫ Error rates ~9-12%
▫ PPE, shimmer, average and high fundamental
frequency measurements
 Little et al. (2009) concluded PPE greatly improved
classification performance; cautioned against using
traditional methods alone and suggested instead
using novel methods such as PPE
REGIONAL DEMOGRAPHIC
IMPACTS ON DRIVERS OF
SOCIAL VULNERABILITY:
A LOCAL VIEW OF
NORWAY
Patrick Fitzsimons
Supported by National Science Foundation
Office of Polar Programs Grant ARC-0909191
Special Thanks to:
47


Committee:
Harley Johansen Ph.D.
Tim Frazier Ph.D.
Stephen Lee Ph.D.
Family and Friends
Outline
48



Climate Change developments in the Arctic
Migration Patterns relation to Social Vulnerability
Methods and Results
 Creation
of index for accessibility
 Examining recent migration patterns
 Multivariate analysis with social
vulnerability drivers

Discussion
49
Counties
50
Municipalities
Barents &
Non-Barents
0 40 80
¯
160 Miles
Research Questions
51

Did the national government succeed in centralization and
creating urban centers within northern Norway ?


Have migration patterns changed with the changing
development policy ?


Accessibility Index
Calculate municipal net migration over past 2 decades
Is there a significant difference amongst driving variables of
social vulnerability between northern and southern
Norway ?

Multivariate analysis
Climate Change Effects in Arctic Europe
52





Highest warming rate in last 3 decades was in Arctic.
Barents Sea ice-free during past 5 years and
declined at 12% decade, faster than models
predicted.
Boreal forest moving poleward. Shrubs becoming
trees. Tree line altitude increase = 100 meters,
poleward = >30 km in some locations since 1909.
Near-surface thaw of permafrost causing
infrastructure problems and loss of palsa mires,
causing release of methane and CO2 .
Reindeer and fish migrations changing.
Social Vulnerability
53



“Social vulnerability occurs when unequal exposure
to risk is coupled with unequal access to
resources”(Morrow 2008).
Groups potentially discriminated against are the
socially, culturally, and economically marginalized
(Mustafa 1998, Morrow 2008).
Variables promoting social vulnerability include:
poverty, minority status, gender, age, disability,
human capital (Morrow 2008).
Why look at social vulnerability?
54



Previous migration patterns can be detrimental
Northern Norway still tied to natural resources
Welfare state
 North
receives subsidies and its debts are burdened by
the national government.
 OECD
states Norway will have to change its social programs
and transfer programs to maintain a growth in the economy.
Methods and Results
55

Indices of accessibility for the region

Migration rates during the past 2 decades

Multivariate analysis with social vulnerability
variables.
Accessibility Index
56



= cities with population of 20,000 or
greater. (Magnet Cities)
= distance from municipal centroid to city ( ).
To be included, city had to be within 200km of
centroid.
Magnet Cities
Cities > 20,000 people
¯
57
0 50 100
200 Miles
Accessibility Index
<3
3.1 - 8.6
8.7 - 18.6
> 18.6
¯
58
0 50 100
200 Miles
Net Migration (%) 1990 - 2009
59
Population Change Aged 16-29
From 1990 to 2010
-29.3 - -16.0
-16.0 - -11.7
-11.7 - 0.0
0.1 - 7.9
7.9 - 28.2
¯
60
0
55 110
220 Miles
Pressures from Outmigration
61





Loss of potential labor force
Loss of human capital
Makes region less attractive for potential employers
Less of a tax base
Loss of potential progeny
Social Vulnerability
62



“Social vulnerability occurs when unequal exposure
to risk is coupled with unequal access to
resources”(Morrow 2008).
Groups potentially discriminated against are the
socially, culturally, and economically marginalized
(Mustafa 1998, Morrow 2008).
Variables promoting social vulnerability include:
poverty, minority status, gender, age, disability,
human capital (Morrow 2008).
Variables used in multivariate analysis
____________________________________________________________________________________________________
-Percent Net-migration (1990-2009)
-Percent Household Income < 150,000 NOK ≈ $22,000 USD (2010)
-Percent Household Income > 500,000 NOK ≈ $85,000 USD (2010)
-Percent elderly (Old age dependency) (2010)
-Percent employed in primary industries ie. mining, fishing, farming (2010)
-Percent Labor Force participation (2010)
-Percent unemployed (2010)
-Percent paid for Social Assistance (2009)
-Percent over age 25 with only completed primary education (2010)
-Percent over age 25 with secondary education attainment (2010)
-Percent over age 25 with attainment beyond secondary w/o completion of tertiary (2010)
-Percent over age 25 with attainment of tertiary education (2010)
-Percent Voter turnout (2008)
-Percent Municipal Net Loan to Gross Revenue (2010)
-Percent Municipal Net loan debt per capita (2010)
-Percent Municipal Long term debt to Revenue of (2010)
63
Barents vs. Non-Barents
Barents
(N=88)
Non-Barents
(N=342)
Municipalities
N=430
64
¯
Df Hotelling-Lawley approx F num Df den Df
Pr(>F)
barentsF
1
1.5733
43.424
15
414 < 2.2e-16 ***
Residuals 428
65
Component
Eigenvalues
% of total
Variance
Variables and (component loadings)
_____________________________________________________________________________________________________________________
1. Age, Income, School,
Migration and Labor
Force
2.324
33.75%
Percent Elderly (0.730)
Income < 150,000 (0.762)
Percent Primary Sector (0.651)
Percent Tertiary School1 (-0.738)
Percent Tertiary School2 (-0.641)
Net Migration (-0.695)
Labor Force Part. (-0.612)
Income > 500,000 (-0.814)
2. Social Welfare
1.692
17.90%
Percent Unemploy. (0.599)
Upper Secondary Ed. (-0.576)
Social Assistance (0.543)
Labor Force Part. (-0.527)
3 . Debt
1.305
10.65%
Net Loan to Gross Rev. (0.617)
Long Term debt (0.637)
Net Loan Debt/capita (0.709)
4. Education
1.115
7.77%
Tertiary Ed (-0.543)1
66
67
68
Plots of municipalities on First 3 Principal Components
Barents
Non-Barents
69
Standard Deviations on First
Principal Component
< -2.5 Std. Dev.
-2.5 - -1.5 Std. Dev.
-1.5 - -0.50 Std. Dev.
-0.50 - 0.50 Std. Dev.
0.50 - 1.5 Std. Dev.
> 1.5 Std. Dev.
70
QDA Analysis
71


Quadratic Discriminant Analysis on the same 16
variables.
Results illustrate a discernible difference between
North and South.
1 = Non-Barents
2 = Barents
***correct classification rate of 94.65%
***cross-validated
Discussion
72

Distinction between North and South urbanization
Migration and social vulnerability
Life Biography (20 somethings)

Caveats



Missing variables (ethnic minority)


Indigenous group Sami
Further research

Community level analysis
73
Photo by Hildegun Johnsen
Questions, Feedback?
Thank you
Determining the
Geographic Origin of
Potatoes with Trace
Metal Analysis Using
Statistical and Neural
Network Classifiers
The objective of this research was to develop a method to confirm the geographical
authenticity of Idaho-labeled potatoes as Idaho-grown potatoes. Elemental analysis
(K, Mg, Ca, Sr, Ba, V, Cr, Mn, Fe, Co, Ni, Cu, Zn, Mo, S, Cd, Pb, and P)
PCA, CDA, discriminant function analysis,
k-nearest neighbors, and neural network
Download