Data Mining Methods: Applications, Problems and Opportunities in the Public Sector

Data Mining Methods:
Applications, Problems and
Opportunities in the Public Sector
Disease and Adverse Event Reporting, Surveillance, and Analysis
DIMACS, October 16 – 18, 2002
Rutgers University, Piscataway, NJ
John Stultz, MPH
SAS
October 18, 2002
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Outline

Data Mining Methods Used in Surveillance
• Classification & Prediction
• Association
• Clustering
• Link Analysis
Applications
 Problems
 Opportunities

Copyright © 2001 , SAS Institute Inc. All rights reserved.
What Is Data Mining?
SAS Institute defines data mining as
the process of selecting, exploring, and
modeling large amounts of data to
uncover previously unknown patterns
of data for an information advantage.
Copyright © 2001 , SAS Institute Inc. All rights reserved.
What Is Data Mining?
“The nontrivial extraction of implicit,
previously unknown, and potentially
useful information from data. It
involves statistical and visualization
techniques to discover and present
knowledge in a form that may be
easily comprehended.”
Copyright © 2001 , SAS Institute Inc. All rights reserved.
SAS Enterprise Miner
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Classification and Prediction

Classification and
Regression Trees

Logistic Regression

Neural Networks…
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Classification and Prediction
Validation
Training
Test
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Comparison
Selection
Tuning
Final
Assessment
Classification and Prediction

Principal
Components/
Dmneural Network
The Princomp/Dmneural node enables users to fit an additive
nonlinear model that uses bucketed principal components as
inputs to predict a binary or an interval target variable. The
node can also perform a principal components analysis, and
then pass the scored principal components to successor
nodes for further analysis.
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Classification and Prediction

User Defined Model
You can use the User Defined Model node to import and
assess a model(s) that was not created with one of the
Enterprise Miner modeling nodes.
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Classification and Prediction

Ensemble Models
The Ensemble node enables users to combine the results from
multiple models to create a single, integrated model for their
data. This node performs:
stratified modeling
bagging
boosting
combined modeling.
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Classification and Prediction

Stratified Models
When you have a stratification variable (for example, a group
variable such as GENDER or REGION) defined in a Group
Processing node, the modeling node creates a separate model
for each level of the stratification variable.
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Classification and Prediction

Bagging and Boosting
Bagging and boosting models are created by resampling the
training data and fitting a separate model for each sample. The
predicted values (for interval targets) or the posterior
probabilities (for a class target) are then averaged to form the
ensemble model.
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Classification and Prediction

Combined Models
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Classification and Prediction

Two Stage Model
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Classification and Prediction

Memory Based Reasoning
Uses k-nearest neighbor approach to categorize or predict
observations. Search algorithms include: scan, Reduced
Dimensionality Tree.
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Association
Association Discovery
• “If item A is part of an event, then item B is also
part of the event X percent of the time.”
Sequence Discovery
• “If item A is part of an event, then item B occurs
after event A occurs.”
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Clustering
Clustering places objects into groups or clusters
suggested by the data.
Methods perform disjoint cluster analysis on the
basis of Euclidean distances computed from one
or more quantitative variables and seeds that are
generated and updated by the algorithm.
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Self Organizing Maps
Kohonen Vector Quantization
Kohonen Vector Quantization is a clustering
method, whereas Self Organizing Maps (SOMs)
are primarily dimension-reduction methods.
As with Clustering, after the network maps have
been created, the characteristics of the clusters
can be profiled graphically and cluster IDs can be
assigned to the data.
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Link Analysis
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Applications

National Database for clinical data centralized from
42 out of 49 hospitals with web access
• Indian Health Service
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Applications

Real-Time Emergency Medical Services Surveillance
• Health and Human Services, San Diego County
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Applications

Aberration detection methods during short-term
syndrome-based surveillance
• CDC, California/Florida Departments of Health
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Applications

Trends in Syndromic Surveillance data for
Washington DC
• District of Columbia Department of Health
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Applications

Ambulance dispatch and ER data sent via FTP to
health department database.
• New York City Health Department
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Problems

Considerations for a Surveillance System
• What are the objectives/purpose?
• What are the data sources?
• What information needs to be gathered?
• Who are the data providers?
• How is the data to be collected?
• How often?
• Voluntary or mandatory?
• Who will collect data?
• How should the data be processed, maintained and
analyzed?
• How will the data reach those who need to know in order
that decisions/actions may be taken?
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Opportunities
Data Format: XML…
 Text Mining
 Modeling Format: Predictive Modeling Markup
Language (PMML)
 Score Code: C Code
 Software: Java Based

Copyright © 2001 , SAS Institute Inc. All rights reserved.
Thank You!
Copyright © 2001 , SAS Institute Inc. All rights reserved.