Statistical Modeling: Building a Better Mouse Trap, and others Dec 10, 2012 at the University of Hong Kong Stephen Sauchi Lee Associate Professor of Statistics Affiliated Professor of Bioinformatics and Computational Biology Department of Statistics University of Idaho Moscow, Idaho, USA Statistical Modeling On 3 projects I. Building a Better Mouse Trap? The Incremental Utility Behind the Methodology of Risk Assessment II. Predicting Parkinson Disease Status III. Demographic Impacts on Social Vulnerability in Norway Academy of Criminal Justice Sciences NYC 2012-03-16 Zachary Hamilton, PhD Melanie-Angela Neuilly, PhD Robert Barnoski, PhD Washington State University Pullman, Washington Stephen S. Lee, PhD University of Idaho Moscow, Idaho Four generations: • • • • 1) clinical judgment 2) static predictors 3) dynamic factors 4) automated Regression methods utilized for instrument creation • LSI-R: Logistic Regression • COMPAS: Survival Regression Recent advancements in prediction rarely utilized for criminal risk assessment • Decision trees • Neural networks • Latent class analysis Linear approaches assume equal additive quality for all factors (Steadman, et al., 2000) • Typically neglect interaction effects Machine-learning mirror diagnostic processes more closely (Steadman, et al., 2000) Machine-learning approaches most commonly used: • Classification Trees (CT) and other recursive partitioning models (CHAID, CART, ICT, Random Forests, etc.) • Neural Networks (NN) Hierarchical question-decision tree model (Breiman, 1984) • The final answer is the result of a series of conditioning answers (If this -> then that, etc.) • Used in diagnostic reasoning • No statistical significance Random Forests • Inductive statistical learning • Aggregation of hundreds of Classification Trees Starting point First split Age at first arrest ≤25 Total sample Age at first arrest ≥25 Second split Final split Prior arrests ≤2 No recidivism Prior arrests ≥2 Recidivism No recidivism Developed in Artificial Intelligence research Data mining technique for pattern recognition Aim at modeling the lower level brain functions Layered nodes of fact-sets instead of rules, used to train the network Based on the training data, the network “learns” to deduce the right answer to any new piece of information Used in psychiatric diagnostic Weights Weights Activation functions Recalculation of weights based on predicted and actual outputs Predicted Outputs Inputs Hidden neurons Studies using CT-like analyses, as well as NN tend to make use of smaller samples (≤ 1,500) (except Berk et al., 2009; Palocsay et al., 2000; and Silver et al., 2000) Overall, results are mixed, but those finding significant improvement via CT use lack proper validations (Liu, 2010) Studies using NN show very split results (Liu, 2010) Overall, very few studies have investigated the utility of CT and NN for predicting recidivism Previous studies have been limited • In power • To violence prediction The current study remedy such limitations • Close to ½ million cases • General recidivism as well as possibilities for investigating offense-specific recidivism Previously utilized LSI-R • Found laborious by community corrections officers • Evaluated to be strengthened by increase of static items (Barnowski, 2003) Created current instrument in 2006 • Factors strongly related to recidivism: demographics, juvenile record, commitments to DOC, felonies, misdemeanors, and violations • Removed dynamic items (interview not required) • Instrument scored from logistic regression - logit weights Comparable Predictive Validity for WA Sample (WSIPP, 2007) • LSI-R AUC = .66 • WA Static Risk AUC = .74 24 variables included from current risk prediction instruement • 3 year follow-up (release from incarceration) • Any felony recidivism 2 step creation • Construction sample All offenders released from prison or jail placed on community supervision from 1986 to 2000 (N = 287,417) • Validation sample All offenders released from prison or jail placed on community supervision from 2001 to 2002 (N = 71,957) Compare methods of prediction models • Area under the receiver operating characteristic (AUC) Values of .500s indicate no predictive accuracy Where .600s are weak, .700s moderate, and above .800 strong predictive accuracy Descriptive Statistics (N=359,374) Predictor White (not included in model) 1. Male 2. Age At Risk 3. Adult Felonies 4. Juvenile Felony Score 5. Juvenile Person Score 6. Number of DOC Commitments 7. Homicide/Manslaughter 8. Felony Sex 9. Felony Violent Property 10. Felony Non-Dometic Violence Assault 11. Felony Dometic Violence Assault 12. Felony Weapon 13. Felony Property 14. Felony Drug 15. Felony Escape 16. Misdemenor Non-Dometic Violence Assault 17. Misdemenor Dometic Violence Assault 18. Misdemenor Sex 19. Misdemenor Dometic Violence Other 20. Misdemenor Weapon 21. Misdemenor Property 22. Misdemenor Drug 23. Misdemenor Escape 24. Misdemenor Alcohol NewFelony (Outcome) %/Mean(SD) 79.7 18.7 31.7(10.2) 2.1(1.9) 32 6 2.0(1.7) 1 7 9 16 2 4 85 62 8 23 21 3 1 4 52 17 1 17 44 Extended validation sample of original instrument construction • Strongest model predictors (weights) were: 1) Misd. Property, 2) Juvenile Felony, 3) Misd. Dometic Violence Assault, 4) Misd. Drug, 5)Misd. Sex, 6)Male • Findings comparable to original instrument construction Models Construction Sample ROCs Validation Sample ROCs Original Sample .756 .742 Extended Sample .750 .749 Variable Importance in Random Foest Model IncNodePurity 3000 2500 2000 1500 1000 500 0 TotalAdultFelonyAdjs TotalMisdPropertyScore SentenceCntScore TotalJuvenileFelonyScore AgeScore TotalFelPropertyScore TotalMisdNonDvAsltScore TotalMisdDvAssaultScore TotalMisdDrugScore TotalFelDrugScore TotalMisdAlcoholScore TotalFelSexScore TotalFelNonDvAsltScore MaleScore TotalFelVioPropScore TotalMisdSexScore TotalJuvenilePersonScore TotalMisdWeaponScore TotalFelEscapeScore TotalWeaponScore TotalFelHomicideScore TotalMisdEscapeScore TotalMisdDvOtherScore TotalFelDvAssaultScore Strongest Model Predictors : 1)Felony Adjudications, 2) Misd. Property, 3)Sentence Length 4)Juvenile Felony 5)Age 6) Felony Property Model Comparisons Models Construction Sample ROC (SE) Validation Sample ROC (SE) Logistic Regression .750 (.001) .749* (.002) Neural Network .755* (.001) .750* (.002) Random Forest .750 (.001) .734 (.002) ROC Curves for the Construction Sample 0.4 0.4 Hit Rate Hit Rate 0.6 0.6 0.8 0.8 1.0 1.0 ROC Curves for the Validate Sample Losgistic Regression Neural Net Random Forest 0.0 0.0 0.2 0.2 Losgistic Regression Neural Net Random Forest 0.0 0.2 0.4 0.6 False Alarm Rate 0.8 1.0 0.0 0.2 0.4 0.6 False Alarm Rate 0.8 1.0 Model Comparisons Significant differences found • Neural network significantly greater predictive validity + + 0.735 0.740 0.745 + Construction Sample Validation Sample 0.730 ROC_Area 0.750 0.755 than random forest • Neural network significantly greater predictive validity than logistic regression but only construction sample Logistic Regression Neural Net Random Forest Neural networks performed best, followed by logistic regression and random forest ROC differences of methods found to be significant but not universally Preliminary nature of findings are stressed Lack of specificity of outcome measure and sample heterogeneity • Any felony within 3 years • Specialization and taxonomic structures not considered Unit of analysis is incarceration cycle • Violation of independence assumption for repeat incarcerations Exclusion of dynamic predictors Add dynamic predictors to models • Available since 2008 • Prior/preliminary findings indicate only modest improvement Examine impact of latent variable methods • 4th potential model Disentangle heterogeneity • Subgroup analyses based on offense specialties i.e. drug, violent, sex offender Predicting Parkinson’s disease status with vocal dysphonia measurements Roxana Hickey Bioinformatics & Computational Biology Statistics 519 Multivariate Statistics Term Project Professor Stephen Lee April 27, 2011 Outline • Background ▫ Parkinson’s disease ▫ Vocal dysphonia • Study dataset • Statistical analyses • Conclusions Parkinson’s disease • Neurological disorder that leads to shaking and difficulty with walking, movement and coordination1 • Affects >1 million people in North America2 ▫ rapidly increased prevalence after age 603 • No cure, but medication available to alleviate symptoms, especially in early stages4 ▫ early detection key to effective treatment strategies http://www.healthtree.com/articles/parkinsons-disease/causes/ Parkinson’s disease & vocal impairment • ~90% of individuals with Parkinson’s disease have some form of vocal impairment5, 6 ▫ characteristics7 dysphonia (impaired production of vocal sounds) dysarthria (problems with normal articulation in speech) ▫ may be one of earliest indicators of onset of illness8 • Tests for vocal impairment9,10 ▫ sustained phonations11, 12 (focus of this study) produce single vowel and hold pitch constant ▫ running speech12 speak standard sentences that contains representative sample of linguistic units Measures of assessing vocal dysphonia • Traditional methods11, 12 ▫ ▫ ▫ ▫ ▫ pitch (F0, fundamental frequency of vocal oscillation) absolute sound pressure level (loudness) jitter (variation in F0 from vocal cycle to vocal cycle) shimmer (variation in amplitude) noise-to-harmonics ratio • Novel methods13, 14 ▫ nonlinear dynamical systems theory and nonlinear time series analysis ▫ recurrence period density entropy ▫ detrended fluctuation analysis Measures of assessing vocal dysphonia • Measurements differ in robustness14 ▫ uncontrolled variation in acoustic environment ▫ physical condition and characteristics of subject • Therefore, chosen measurement methods should be as robust as possible to this variation ▫ Goal of the study: identify an optimal feature set that is both robust to uncontrolled variation and able to classify patients with Parkinson’s disease based on vocal dysphonic symptoms Additional advantage: possibility of monitoring patients remotely http://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/parkinsons.data Subjects & methods • Subjects ▫ 31 individuals 8 healthy 23 with Parkinson’s disease (PD) ▫ average of six sustained vowel phonations recorded from each subject Total n=195 • Calculation of features via software programs ▫ traditional measures ▫ non-standard measures, including new measure proposed by authors: pitch period entropy Variables Grouping variable: status =0 (healthy) =1 (PD) Attribute Description Attribute MDVP:Jitter (%) MDVP jitter as percentage NHR Measures MDVP:Jitter(Abs) MDVP absolute jitter in microseconds MDVP:RAP MDVP Relative Amplitude Perturbation MDVP:PPQ MDVP five-point Period Perturbation Quotient Jitter:DDP Average absolute difference of differences between cycles, divided by the average period DFA Detrended Fluctuation Analysis MDVP:Shimmer MDVP local shimmer spread1 Nonlinear measure of fundamental frequency variation MDVP:Shimmer(dB ) MCVP local shimmer in decibels spread2 Shimmer:APQ3 3-pt Amplitude Perturbation Quotient PPE Pitch period entropy MDVP:Fo(Hz) Average vocal fundamental frequency Measures of variation in amplitude Measures of variation in fundamental 5-pt Amplitude Perturbation Quotient Shimmer:APQ5 frequency Description Noise-to-Harmonics of ratio of noiseRatio to tonal Harmonics-to-Noise Ratio HNR components in voice Recurrence Period Density Entropy RPDE Nonlinear dynamical complexity Correlation dimension D2 measures Single fractal scaling exponent Nonlinear measures of Nonlinear measure of fundamental variationvariation fundamentalfrequency frequency MDVP:APQ MDVP 11-point Amplitude Perturbation Quotient MDVP:Fhi(Hz ) Maximum vocal fundamental frequency Shimmer:DDA Avg abs. diff. between consecutive differences between the amplitudes of consecutive periods MDVP:Flo(Hz) Minimum vocal fundamental frequency MDVP = (Kay Pentax) Multi-Dimensional Voice Program Statistical analyses • • • • • • EDA PCA MANOVA Hotelling’s T2 QDA Classification tree (with random forest) EDA 0=healthy 1=PD EDA 0=healthy 1=PD EDA 0=healthy 1=PD EDA 0=healthy 1=PD EDA 0=healthy 1=PD PCA MANOVA • template H0: µhealthy = µParkinson’s Hotelling’s T2 test • H0: µhealthy = µParkinson’s (p=22) • T-square test statistic = 187.48 • df = 48 + 147 – 2 = 193 • critical 20.05, 22, 193 47 (extrapolated) • Conclusion: reject H0 (=0.05) QDA Actual park.qda.cv <qda(park.g[,2:23], park.g$group, CV=T) table(Actual=park.g$gro up, Classified=park.qda.cv$ class) Classified 0 1 0 35 13 1 9 138 CV error rate: (13+9)/195=11.28% Classification tree (CART) table(Actual=park.g$group, Classified=park.cart.pred) Actual Classified 0 1 0 38 10 1 141 6 Error rate: (10+6)/195= 8.21% Random forests park.rf <- randomForest(group~., data=park.g, importance=TRUE, proximity=TRUE) park.rf Call: randomForest(formula = group ~ ., data = park.g, importance = TRUE, proximity = TRUE) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 4 OOB estimate of error rate: 9.23% Confusion matrix: 0 1 class.error 0 35 13 0.2708 1 5 142 0.0340 Cumulative error rates overall healthy PD Random forests varImpPlot(park.rf) Random forests Conclusions • Measurements of vocal dysphonia differ between healthy and PD individuals (MANOVA, Hotelling’s T2) • QDA and classification tree able to separate healthy from PD individuals using a reduced “feature set” ▫ Error rates ~9-12% ▫ PPE, shimmer, average and high fundamental frequency measurements Little et al. (2009) concluded PPE greatly improved classification performance; cautioned against using traditional methods alone and suggested instead using novel methods such as PPE REGIONAL DEMOGRAPHIC IMPACTS ON DRIVERS OF SOCIAL VULNERABILITY: A LOCAL VIEW OF NORWAY Patrick Fitzsimons Supported by National Science Foundation Office of Polar Programs Grant ARC-0909191 Special Thanks to: 47 Committee: Harley Johansen Ph.D. Tim Frazier Ph.D. Stephen Lee Ph.D. Family and Friends Outline 48 Climate Change developments in the Arctic Migration Patterns relation to Social Vulnerability Methods and Results Creation of index for accessibility Examining recent migration patterns Multivariate analysis with social vulnerability drivers Discussion 49 Counties 50 Municipalities Barents & Non-Barents 0 40 80 ¯ 160 Miles Research Questions 51 Did the national government succeed in centralization and creating urban centers within northern Norway ? Have migration patterns changed with the changing development policy ? Accessibility Index Calculate municipal net migration over past 2 decades Is there a significant difference amongst driving variables of social vulnerability between northern and southern Norway ? Multivariate analysis Climate Change Effects in Arctic Europe 52 Highest warming rate in last 3 decades was in Arctic. Barents Sea ice-free during past 5 years and declined at 12% decade, faster than models predicted. Boreal forest moving poleward. Shrubs becoming trees. Tree line altitude increase = 100 meters, poleward = >30 km in some locations since 1909. Near-surface thaw of permafrost causing infrastructure problems and loss of palsa mires, causing release of methane and CO2 . Reindeer and fish migrations changing. Social Vulnerability 53 “Social vulnerability occurs when unequal exposure to risk is coupled with unequal access to resources”(Morrow 2008). Groups potentially discriminated against are the socially, culturally, and economically marginalized (Mustafa 1998, Morrow 2008). Variables promoting social vulnerability include: poverty, minority status, gender, age, disability, human capital (Morrow 2008). Why look at social vulnerability? 54 Previous migration patterns can be detrimental Northern Norway still tied to natural resources Welfare state North receives subsidies and its debts are burdened by the national government. OECD states Norway will have to change its social programs and transfer programs to maintain a growth in the economy. Methods and Results 55 Indices of accessibility for the region Migration rates during the past 2 decades Multivariate analysis with social vulnerability variables. Accessibility Index 56 = cities with population of 20,000 or greater. (Magnet Cities) = distance from municipal centroid to city ( ). To be included, city had to be within 200km of centroid. Magnet Cities Cities > 20,000 people ¯ 57 0 50 100 200 Miles Accessibility Index <3 3.1 - 8.6 8.7 - 18.6 > 18.6 ¯ 58 0 50 100 200 Miles Net Migration (%) 1990 - 2009 59 Population Change Aged 16-29 From 1990 to 2010 -29.3 - -16.0 -16.0 - -11.7 -11.7 - 0.0 0.1 - 7.9 7.9 - 28.2 ¯ 60 0 55 110 220 Miles Pressures from Outmigration 61 Loss of potential labor force Loss of human capital Makes region less attractive for potential employers Less of a tax base Loss of potential progeny Social Vulnerability 62 “Social vulnerability occurs when unequal exposure to risk is coupled with unequal access to resources”(Morrow 2008). Groups potentially discriminated against are the socially, culturally, and economically marginalized (Mustafa 1998, Morrow 2008). Variables promoting social vulnerability include: poverty, minority status, gender, age, disability, human capital (Morrow 2008). Variables used in multivariate analysis ____________________________________________________________________________________________________ -Percent Net-migration (1990-2009) -Percent Household Income < 150,000 NOK ≈ $22,000 USD (2010) -Percent Household Income > 500,000 NOK ≈ $85,000 USD (2010) -Percent elderly (Old age dependency) (2010) -Percent employed in primary industries ie. mining, fishing, farming (2010) -Percent Labor Force participation (2010) -Percent unemployed (2010) -Percent paid for Social Assistance (2009) -Percent over age 25 with only completed primary education (2010) -Percent over age 25 with secondary education attainment (2010) -Percent over age 25 with attainment beyond secondary w/o completion of tertiary (2010) -Percent over age 25 with attainment of tertiary education (2010) -Percent Voter turnout (2008) -Percent Municipal Net Loan to Gross Revenue (2010) -Percent Municipal Net loan debt per capita (2010) -Percent Municipal Long term debt to Revenue of (2010) 63 Barents vs. Non-Barents Barents (N=88) Non-Barents (N=342) Municipalities N=430 64 ¯ Df Hotelling-Lawley approx F num Df den Df Pr(>F) barentsF 1 1.5733 43.424 15 414 < 2.2e-16 *** Residuals 428 65 Component Eigenvalues % of total Variance Variables and (component loadings) _____________________________________________________________________________________________________________________ 1. Age, Income, School, Migration and Labor Force 2.324 33.75% Percent Elderly (0.730) Income < 150,000 (0.762) Percent Primary Sector (0.651) Percent Tertiary School1 (-0.738) Percent Tertiary School2 (-0.641) Net Migration (-0.695) Labor Force Part. (-0.612) Income > 500,000 (-0.814) 2. Social Welfare 1.692 17.90% Percent Unemploy. (0.599) Upper Secondary Ed. (-0.576) Social Assistance (0.543) Labor Force Part. (-0.527) 3 . Debt 1.305 10.65% Net Loan to Gross Rev. (0.617) Long Term debt (0.637) Net Loan Debt/capita (0.709) 4. Education 1.115 7.77% Tertiary Ed (-0.543)1 66 67 68 Plots of municipalities on First 3 Principal Components Barents Non-Barents 69 Standard Deviations on First Principal Component < -2.5 Std. Dev. -2.5 - -1.5 Std. Dev. -1.5 - -0.50 Std. Dev. -0.50 - 0.50 Std. Dev. 0.50 - 1.5 Std. Dev. > 1.5 Std. Dev. 70 QDA Analysis 71 Quadratic Discriminant Analysis on the same 16 variables. Results illustrate a discernible difference between North and South. 1 = Non-Barents 2 = Barents ***correct classification rate of 94.65% ***cross-validated Discussion 72 Distinction between North and South urbanization Migration and social vulnerability Life Biography (20 somethings) Caveats Missing variables (ethnic minority) Indigenous group Sami Further research Community level analysis 73 Photo by Hildegun Johnsen Questions, Feedback? Thank you Determining the Geographic Origin of Potatoes with Trace Metal Analysis Using Statistical and Neural Network Classifiers The objective of this research was to develop a method to confirm the geographical authenticity of Idaho-labeled potatoes as Idaho-grown potatoes. Elemental analysis (K, Mg, Ca, Sr, Ba, V, Cr, Mn, Fe, Co, Ni, Cu, Zn, Mo, S, Cd, Pb, and P) PCA, CDA, discriminant function analysis, k-nearest neighbors, and neural network