Data Mining Methods: Applications, Problems and Opportunities in the Public Sector Disease and Adverse Event Reporting, Surveillance, and Analysis DIMACS, October 16 – 18, 2002 Rutgers University, Piscataway, NJ John Stultz, MPH SAS October 18, 2002 Copyright © 2001 , SAS Institute Inc. All rights reserved. Outline Data Mining Methods Used in Surveillance • Classification & Prediction • Association • Clustering • Link Analysis Applications Problems Opportunities Copyright © 2001 , SAS Institute Inc. All rights reserved. What Is Data Mining? SAS Institute defines data mining as the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns of data for an information advantage. Copyright © 2001 , SAS Institute Inc. All rights reserved. What Is Data Mining? “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data. It involves statistical and visualization techniques to discover and present knowledge in a form that may be easily comprehended.” Copyright © 2001 , SAS Institute Inc. All rights reserved. SAS Enterprise Miner Copyright © 2001 , SAS Institute Inc. All rights reserved. Classification and Prediction Classification and Regression Trees Logistic Regression Neural Networks… Copyright © 2001 , SAS Institute Inc. All rights reserved. Classification and Prediction Validation Training Test Copyright © 2001 , SAS Institute Inc. All rights reserved. Comparison Selection Tuning Final Assessment Classification and Prediction Principal Components/ Dmneural Network The Princomp/Dmneural node enables users to fit an additive nonlinear model that uses bucketed principal components as inputs to predict a binary or an interval target variable. The node can also perform a principal components analysis, and then pass the scored principal components to successor nodes for further analysis. Copyright © 2001 , SAS Institute Inc. All rights reserved. Classification and Prediction User Defined Model You can use the User Defined Model node to import and assess a model(s) that was not created with one of the Enterprise Miner modeling nodes. Copyright © 2001 , SAS Institute Inc. All rights reserved. Classification and Prediction Ensemble Models The Ensemble node enables users to combine the results from multiple models to create a single, integrated model for their data. This node performs: stratified modeling bagging boosting combined modeling. Copyright © 2001 , SAS Institute Inc. All rights reserved. Classification and Prediction Stratified Models When you have a stratification variable (for example, a group variable such as GENDER or REGION) defined in a Group Processing node, the modeling node creates a separate model for each level of the stratification variable. Copyright © 2001 , SAS Institute Inc. All rights reserved. Classification and Prediction Bagging and Boosting Bagging and boosting models are created by resampling the training data and fitting a separate model for each sample. The predicted values (for interval targets) or the posterior probabilities (for a class target) are then averaged to form the ensemble model. Copyright © 2001 , SAS Institute Inc. All rights reserved. Classification and Prediction Combined Models Copyright © 2001 , SAS Institute Inc. All rights reserved. Classification and Prediction Two Stage Model Copyright © 2001 , SAS Institute Inc. All rights reserved. Classification and Prediction Memory Based Reasoning Uses k-nearest neighbor approach to categorize or predict observations. Search algorithms include: scan, Reduced Dimensionality Tree. Copyright © 2001 , SAS Institute Inc. All rights reserved. Association Association Discovery • “If item A is part of an event, then item B is also part of the event X percent of the time.” Sequence Discovery • “If item A is part of an event, then item B occurs after event A occurs.” Copyright © 2001 , SAS Institute Inc. All rights reserved. Clustering Clustering places objects into groups or clusters suggested by the data. Methods perform disjoint cluster analysis on the basis of Euclidean distances computed from one or more quantitative variables and seeds that are generated and updated by the algorithm. Copyright © 2001 , SAS Institute Inc. All rights reserved. Self Organizing Maps Kohonen Vector Quantization Kohonen Vector Quantization is a clustering method, whereas Self Organizing Maps (SOMs) are primarily dimension-reduction methods. As with Clustering, after the network maps have been created, the characteristics of the clusters can be profiled graphically and cluster IDs can be assigned to the data. Copyright © 2001 , SAS Institute Inc. All rights reserved. Link Analysis Copyright © 2001 , SAS Institute Inc. All rights reserved. Applications National Database for clinical data centralized from 42 out of 49 hospitals with web access • Indian Health Service Copyright © 2001 , SAS Institute Inc. All rights reserved. Applications Real-Time Emergency Medical Services Surveillance • Health and Human Services, San Diego County Copyright © 2001 , SAS Institute Inc. All rights reserved. Applications Aberration detection methods during short-term syndrome-based surveillance • CDC, California/Florida Departments of Health Copyright © 2001 , SAS Institute Inc. All rights reserved. Applications Trends in Syndromic Surveillance data for Washington DC • District of Columbia Department of Health Copyright © 2001 , SAS Institute Inc. All rights reserved. Applications Ambulance dispatch and ER data sent via FTP to health department database. • New York City Health Department Copyright © 2001 , SAS Institute Inc. All rights reserved. Problems Considerations for a Surveillance System • What are the objectives/purpose? • What are the data sources? • What information needs to be gathered? • Who are the data providers? • How is the data to be collected? • How often? • Voluntary or mandatory? • Who will collect data? • How should the data be processed, maintained and analyzed? • How will the data reach those who need to know in order that decisions/actions may be taken? Copyright © 2001 , SAS Institute Inc. All rights reserved. Opportunities Data Format: XML… Text Mining Modeling Format: Predictive Modeling Markup Language (PMML) Score Code: C Code Software: Java Based Copyright © 2001 , SAS Institute Inc. All rights reserved. Thank You! Copyright © 2001 , SAS Institute Inc. All rights reserved.