DMBMI_chap1_review_basics

advertisement
Machine Learning in BioMedical Informatics
SCE 5095: Special Topics Course
Instructor: Jinbo Bi
Computer Science and Engineering Dept.
1
Course Information

Instructor: Dr. Jinbo Bi
– Office: ITEB 233
– Phone: 860-486-1458
– Email: jinbo@engr.uconn.edu

– Web: http://www.engr.uconn.edu/~jinbo/
– Time: Mon / Wed. 2:00pm – 3:15pm
– Location: CAST 204
– Office hours: Mon. 3:30-4:30pm
HuskyCT
– http://learn.uconn.edu
– Login with your NetID and password
– Illustration
2
Introduction of the instructor



Ph.D in Mathematics
Previous work experience:
– Siemens Medical Solutions Inc.
– Department of Defense, Bioanalysis
– Massachusetts General Hospital
Research Interests
Color of flowers
Cancer,
Psychiatric
disorders,
…
http://labhealthinfo.uconn.e
du/EasyBreathing
subtyping
GWAS
3
Course Information




Prerequisite: Basics of linear algebra, calculus, and
basics of programming
Course textbook (not required):
– Introduction to Data Mining (2005) by Pang-Ning Tan,
Michael Steinbach, Vipin Kumar
– Pattern Recognition and Machine Learning (2006)
Christopher M. Bishop
– Pattern Classification (2nd edition, 2000) Richard O.
Duda, Peter E. Hart and David G. Stork
Additional class notes and copied materials will be given
Reading material links will be provided
4
Course Information


Objectives:
– Introduce students knowledge about the basic
concepts of machine learning and the state-of-the-art
literature in data mining/machine learning
– Get to know some general topics in medical
informatics
– Focus on some high-demanding medical informatics
problems with hands-on experience of applying data
mining techniques
Format:
– Lectures, Labs, Paper reviews, A term project
5
Survey
Why are you taking this course?
 What would you like to gain from this course?
 What topics are you most interested in learning
about from this course?
 Any other suggestions?

(Please respond before NEXT THUR. You can also
Login HuskyCT and download the MS word file,
fill in, and shoot me an email.)
6
Grading




In-Class Lab Assignments (3):
Paper review (1):
Term Project (1):
Participation (1):
30%
10%
50%
10%
7
Policy



Computers
Assignments must be submitted electronically via
HuskyCT
Make-up policy
– If a lab assignment or a paper review
assignment is missed, there will be a final
take-home exam to make up
– If two of these assignments are missed, an
additional lab assignment and a final takehome exam will be used to make up.
8
Three In-class Lab Assignments
At the class where in-class lab assignment is
given, the class meeting will take place in a
computer lab, and no lecture
 Computer lab will be at ITEB 138 (TA reserve)
 The assignment is due at the beginning of the
class one week after the assignment is given
 If the assignment is handed in one-two days late,
10 credits will be reduced for each additional day
 Assignments will be graded by our teaching
assistant

9
Paper review
Topics of papers for review will be discussed
 Each student selects 1 paper in each
assignment, prepares slides and presents the
paper in 8 – 15 mins in the class
 The goal is to take a look at the state-of-the-art
research work in the related field
 Paper review assignment is on topics of state-ofthe-art data mining techniques

10
Term Project
Possible project topics will be provided as links,
students are encouraged to propose their own
 Teams of 1-2 students can be created
 Each team needs to give a presentation in the
last 1-2 weeks of the class (10-15min)
 Each team needs to submit a project report
– Definition of the problem
– Data mining approaches used to solve the
problem
– Computational results
– Conclusion (success or failure)

11
Final Exam
If you need make-up final exam, the exam will be
provided on May. 1st (Wed)
 Take-home exam
 Due on May 9th (Thur.)

12
Three In-class Lab Assignments

BioMedical Informatics Topics
– So many
– Cardiac Ultrasound image categorization
– Computerized decision support for Trauma
Patient Care
– Computer assisted diagnostic coding
13
Cardiac ultrasound view separation
14
Cardiac ultrasound view separation
Classification (or
clustering)
Apical 4 chamber view
Parasternal long axis view
Parasternal short axis view
15
Trauma Patient Care

25 min of transport time/patient

High-frequency vital-sign waveforms (3 waveforms)
– ECG, SpO2, Respiratory

Low-frequency vital-sign time series (9 variables)
–
–
–
–

Derived variables
ECG heart rate
SpO2 heart rate
SaO2 arterial O2
saturation
Respiratory rate
Measured variables
► NIBP (systolic, diastolic,
MAP)
► NIBP heart rate
► End tidal CO2
Discrete patient attribute data (100 variables)
– Demographics, injury description, prehospital
interventions, etc.
Vital signs used in decisionsupport algorithms
Propaq
HR
RR
SaO2
SBP
DBP
16
Trauma Patient Care
17
Trauma Patient Care
Heart Rate
Respiratory
Rate
Saturation
of Oxygen
Major
Bleeding
Blood
Pressure
Make a prediction
18
Diagnostic coding
Hospital
Document
DB
Patient
– Notes
Patient
Note
Diagnostic
Code DB
Patientsdatabase
– Criteria
Code
Patient
A
428
B
1
diagnosis
1
C
250
heart failure
diabetes
AMI
D
2
414
E
250
F
429
Insurance
3
2
G
...
Look up ICD-9
codes
SCIP
...
Statistics
reimbursement
...
...
...
...
...
...
...
...
SIEMENS
19
19/38
Diagnostic coding
Hospital
Document
DB
Patient
– Notes
Patient
Note
A
B
1
C
D
E
F
2
G
...
Diagnostic
Code DB
Code database
Patients – Criteria
RWP/CC1
DICT. XXXXXXXXXXX P
TRANS. XXXXXXXXXX P
DOC.# 1554360
diagnosis
Patient
JOB # XXXXXXXXXX
CC XXXXXXXXXX
FILE CV
XXXXXXXXXXXXXXXXXX.
428
XXXXXXXXXXXXXXXXXX
ORDXXXXXXX, XXXX L
ADM DIAGNOSIS: BRADYCARDIA ANEMIA 1
CHF
250
ORD #: XXXXXXX DX XXXXXXX 14:10
PROCEDURE: CHEST - PA ` LATERAL ACCXXXXXX
REPORT: CLINICAL HISTORY: CHEST PAIN. CHF.
AMI THERE ARE NO PRIOR
AP ERECT AND LATERAL VIEWS OF THE CHEST WERE OBTAINED.
STUDIES AVAILABLE FOR COMPARISON.
THE TRACHEA IS NORMAL IN POSITION. HEART IS MODERATELY ENLARGED.
HEMIDIAPHRAGMS ARE SMOOTH. THERE ARE
2 SMALL BILATERAL
414 PLEURAL EFFUSIONS.
THERE IS ENGORGEMENT OF THE PULMONARY VASCULARITY.
IMPRESSION:
1. CONGESTIVE HEART FAILURE WITH CARDIOMEGALY AND250
SMALL BILATERAL PLEURAL
EFFUSIONS.
2. INCREASING OPACITY AT THE LEFT LUNG BASE LIKELY REPRESENTING PASSIVE
ATELECTASIS.
heart failure
diabetes
….
………………….
Look…………….
up ICD-9
……….
429
3
SCIP
codes
...
Statistics
reimbursement
...
...
...
...
...
...
...
...
SIEMENS
Insurance
20
20/38
Diagnostic coding
Hospital
Document
DB
Patient
– Notes
Patient
Note
A
B
1
C
D
E
F
2
G
...
Diagnostic
Code DB
Code database
Patients – Criteria
RWP/CC1
DICT. XXXXXXXXXXX P
TRANS. XXXXXXXXXX P
FAMILY HISTORY: IS NONCONTRIBUTORY IN A PATIENT OF THIS AGE GROUP.
DOC.# 1554360
diagnosis
Patient
JOB # XXXXXXXXXX
CC XXXXXXXXXX
SOCIAL HISTORY: SHE IS DIVORCED. THE PATIENT CURRENTLY LIVES AT BERKS HEIM.
FILE CV
SHE IS ACCOMPANIED TODAY ON THIS VISIT BY HER DAUGHTER. SHE DOES NOT SMOKE
XXXXXXXXXXXXXXXXXX.
OR ABUSE ALCOHOLIC BEVERAGES. 428
XXXXXXXXXXXXXXXXXX
ORDXXXXXXX, XXXX L
PHYSICAL EXAMINATION: GENERAL: THIS IS AN ELDERLY, VERY PALE-APPEARING
ADM DIAGNOSIS: BRADYCARDIA
ANEMIA
CHF IN A WHEELCHAIR
FEMALE WHO
IS SITTING
1
250 AND WAS EXAMINED IN HER WHEELCHAIR.
ORD #: XXXXXXX DX XXXXXXX
14:10
HEENT: SHE
IS WEARING GLASSES. SITTING UPRIGHT IN A WHEELCHAIR. NECK: NECK
PROCEDURE: CHESTVEINS
- PA ` LATERAL
ACCXXXXXX I COULD NOT HEAR A LOUD CAROTID BRUIT. LUNGS: HAVE
WERE NONDISTENDED.
REPORT: CLINICAL HISTORY:
CHEST
PAIN. CHF.
DIMINISHED
BREATH
SOUNDS AT THE BASES WITH NO LOUD WHEEZES, RALES OR
AMI THERE ARE NO PRIOR
AP ERECT AND LATERAL
VIEWS HEART:
OF THE HEART
CHEST WERE
RHONCHI.
TONESOBTAINED.
WERE BRADYCARDIC,
REGULAR AND RATHER DISTANT
STUDIES AVAILABLE FOR
WITHCOMPARISON.
A SYSTOLIC MURMUR HEARD AT THE LEFT LOWER STERNAL BORDER. I COULD NOT
THE TRACHEA IS NORMAL
HEART
IS MODERATELY
HEARIN
A POSITION.
LOUD GALLOP
RHYTHM
WITH HER ENLARGED.
SITTING UPRIGHT OR A LOUD DIASTOLIC
HEMIDIAPHRAGMS ARE
SMOOTH.
THERE ARE
SMALL
PLEURALEXTREMITIES:
EFFUSIONS. ARE REMARKABLE FOR
2WAS
MURMUR.
ABDOMEN:
SOFTBILATERAL
AND 414
NONTENDER.
THERE IS ENGORGEMENT
OF THE
PULMONARY
THE FACT
THAT
SHE HAS AVASCULARITY.
BRACE ON HER LEFT LOWER EXTREMITY. THERE DID NOT
IMPRESSION:
APPEAR TO BE SIGNIFICANT PERIPHERAL EDEMA. NEUROLOGIC: SHE CLEARLY HAD
1. CONGESTIVE HEART
FAILUREHEMIPARESIS
WITH CARDIOMEGALY
AND
SMALL BILATERAL
250
RESIDUAL
FROM HER
PREVIOUS
STROKE, PLEURAL
BUT SHE WAS AWAKE AND ALERT
EFFUSIONS.
AND ANSWERING QUESTIONS APPROPRIATELY.
2. INCREASING OPACITY AT THE LEFT LUNG BASE LIKELY REPRESENTING PASSIVE
ATELECTASIS.
heart failure
diabetes
………………
….
…………………. ……………….. 3
Look…………….
up ICD-9 ………..
…………
……….
codes
………
……..
…….
429
SCIP
...
Statistics
reimbursement
...
...
...
...
...
...
...
...
SIEMENS
Insurance
21
21/38
Machine Learning / Data Mining
Data mining (sometimes called data or
knowledge discovery) is the process of analyzing
data from different perspectives and summarizing
it into useful information
 The ultimate goal of machine learning is the
creation and understanding of machine
intelligence
 The main goal of statistical learning theory is to
provide a framework for studying the problem of
inference, that is of gaining knowledge, making
predictions, and making decisions from a set of
data.

22
Traditional Topics in Data Mining /AI
Fuzzy set and fuzzy logic
– Fuzzy if-then rules
 Evolutionary computation
– Genetic algorithms
– Evolutionary strategies
 Artificial neural networks
– Back propagation network (supervised
learning)
– Self-organization network (unsupervised
learning, will not be covered)

23
Next Class
Continue with data mining topics
 Review of some basics of linear algebra and
probability

24
Last Class
Described the syllabus of this course
 Talked about HuskyCT website (Illustration)
 Briefly introduce 3 medical informatics topics
– Medical images: cardiac echo view
recognition
– Numerical: Trauma patient care
– Free text: ICD-9 diagnostic coding
 Introduce a little bit about definition of data
mining, machine learning, statistical learning
theory.

25
Challenges in traditional techniques
Lack theoretical analysis about the behavior of
the algorithms
 Traditional Techniques
may be unsuitable due to
Statistics/
Machine Learning/
– Enormity of data
AI
Pattern
Recognition
– High dimensionality
of data
Soft Computing
– Heterogeneous,
distributed nature
of data

26
Recent Topics in Data Mining

Supervised learning such as classification
and regression
– Support vector machines
– Regularized least squares
– Fisher discriminant analysis (LDA)
– Graphical models (Bayesian nets)
– others
Draw from Machine Learning domains
27
Recent Topics in Data Mining
Unsupervised learning such as clustering
– K-means
– Gaussian mixture models
– Hierarchical clustering
– Graph based clustering (spectral clustering)
 Dimension reduction
– Feature selection
– Compact feature space into low-dimensional
space (principal component analysis)

28
Statistical Behavior
Many perspectives to analyze how the algorithm
handles uncertainty
 Simple examples:
– Consistency analysis
– Learning bounds (upper bound on test error of
the constructed model or solution)
 “Statistical” not “deterministic”
– With probability p, the upper bound holds
P( > p) <= Upper_bound

29
Tasks may be in Data Mining

Prediction tasks (supervised problem)
– Use some variables to predict unknown or
future values of other variables.

Description tasks (unsupervised problem)
– Find human-interpretable patterns that
describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
30
Problems in Data Mining
Inference
 Classification [Predictive]
 Regression [Predictive]
 Clustering [Descriptive]
 Deviation Detection [Predictive]

31
Classification: Definition

Given a collection of examples (training set )
– Each example contains a set of attributes, one of
the attributes is the class.


Find a model for class attribute as a function
of the values of other attributes.
Goal: previously unseen examples should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
32
Classification Example
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
60K
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
10
No
Single
90K
Yes
Training
Set
Learn
Classifier
Test
Set
Model
33
Classification: Application 1

High Risky Patient Detection
– Goal: Predict if a patient will suffer major complication
after a surgery procedure
– Approach:
 Use
patients vital signs before and after surgical operation.
– Heart Rate, Respiratory Rate, etc.
 Monitor
patients by expert medical professionals to label
which patient has complication, which has not.
 Learn a model for the class of the after-surgery risk.
 Use this model to detect potential high-risk patients for a
particular surgical procedure
34
Classification: Application 2

Face recognition
– Goal: Predict the identity of a face image
– Approach:
Align
all images to derive the features
Model
the class (identity) based on these features
35
Classification: Application 3

Cancer Detection
– Goal: To predict class
(cancer or normal) of a
sample (person), based on
the microarray gene
expression data
– Approach:
 Use
expression levels of all
genes as the features
 Label
each example as cancer
or normal
 Learn
a model for the class of
all samples
36
Classification: Application 4

Alzheimer's Disease Detection
– Goal: To predict class (AD
or normal) of a sample
(person), based on
neuroimaging data such as
MRI and PET
– Approach:
 Extract
features from
neuroimages
 Label
each example as AD or
Reduced gray matter volume (colored
normal
areas) detected by MRI voxel-based
 Learn
a model for the class of morphometry in AD patients
compared to normal healthy controls.
all samples
37
Regression



Predict a value of a given continuous valued variable
based on the values of other variables, assuming a
linear or nonlinear model of dependency.
Greatly studied in statistics, neural network fields.
Examples:
– Predicting sales amounts of new product based on
advertising expenditure.
– Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.
– Time series prediction of stock market indices.
38
Classification algorithms
K-Nearest-Neighbor classifiers
 Naïve Bayes classifier
 Neural Networks
 Linear Discriminant Analysis (LDA)
 Support Vector Machines (SVM)
 Decision Tree
 Logistic Regression
 Graphical models

39
Clustering Definition


Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
– Data points in one cluster are more similar to
one another.
– Data points in separate clusters are less
similar to one another.
Similarity Measures:
– Euclidean Distance if attributes are
continuous.
– Other Problem-specific Measures
40
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster distances
are minimized
Intercluster distances
are maximized
41
Clustering: Application 1

High Risky Patient Detection
– Goal: Predict if a patient will suffer major complication
after a surgery procedure
– Approach:
 Use
patients vital signs before and after surgical operation.
– Heart Rate, Respiratory Rate, etc.
 Find
patients whose symptoms are dissimilar from most of
other patients.
42
Clustering: Application 2

Document Clustering:
– Goal: To find groups of documents that are
similar to each other based on the important
terms appearing in them.
– Approach: To identify frequently occurring
terms in each document. Form a similarity
measure based on the frequencies of different
terms. Use it to cluster.
– Gain: Information Retrieval can utilize the
clusters to relate a new document or search
term to clustered documents.
43
Illustrating Document Clustering


Clustering Points: 3204 Articles of Los Angeles Times.
Similarity Measure: How many words are common in
these documents (after some word filtering).
Category
Total
Articles
Correctly
Placed
555
364
Foreign
341
260
National
273
36
Metro
943
746
Sports
738
573
Entertainment
354
278
Financial
44
Clustering algorithms
 K-Means
 Hierarchical
clustering
 Graph based clustering (Spectral
clustering)
 Semi-supervised clustering
 Others
45
Basics of probability

An experiment (random variable) is a welldefined process with observable outcomes.

The set or collection of all outcomes of an
experiment is called the sample space, S.

An event E is any subset of outcomes from S.

Probability of an event, P(E) is P(E) = number of
outcomes in E / number of outcomes in S.
46
Probability Theory
Apples and Oranges
X: identity of the fruit
Y: identity of the box
Assume P(Y=r) = 40%, P(Y=b) = 60% (prior)
P(X=a|Y=r) = 2/8 = 25%
P(X=o|Y=r) = 6/8 = 75%
Marginal
P(X=a|Y=b) = 3/4 = 75%
P(X=o|Y=b) = 1/4 = 25%
P(X=a) = 11/20, P(X=o) = 9/20
Posterior
P(Y=r|X=o) = 2/3
P(Y=b|X=o) = 1/3
47
Probability Theory

Marginal Probability

Conditional Probability
Joint Probability
48
Probability Theory

• Product Rule
Sum Rule
The marginal prob of X equals the sum of
the joint prob of x and y with respect to y
The joint prob of X and Y equals the product of the conditional prob of Y
given X and the prob of X
49
Illustration
p(X,Y)
p(Y)
Y=2
Y=1
p(X)
p(X|Y=1)
50
The Rules of Probability

Sum Rule

Product Rule
= p(X|Y)p(Y)

Bayes’ Rule
posterior  likelihood × prior
51
Mean and Variance

The mean of a random variable X is the
average value X takes.

The variance of X is a measure of how
dispersed the values that X takes are.

The standard deviation is simply the square
root of the variance.
52
Simple Example

X= {1, 2} with P(X=1) = 0.8 and P(X=2) = 0.2

Mean
– 0.8 X 1 + 0.2 X 2 = 1.2

Variance
– 0.8 X (1 – 1.2) X (1 – 1.2) + 0.2 X (2 – 1.2)
X (2-1.2)
53
References
SC_prob_basics1.pdf (necessary)
 SC_prob_basic2.pdf
Loaded to HuskyCT

54
Basics of Linear Algebra
55
Matrix Multiplication

The product of two matrices

Special case: vector-vector product, matrix-vector
product
A
B
C
56
Matrix Multiplication
57
Rules of Matrix Multiplication
B
A
C
58
Orthogonal Matrix
1
1
..
.
1
59
Square Matrix – EigenValue, EigenVector
( , x) is an eigen pair of A, if and only if Ax  x.
 is the eigenvalue
x is the eigenvecto r
where
60
Symmetric Matrix – EigenValue EigenVector
A is symmetric, if A  AT
eigen-decomposition of A
.
A   nn is symmetric and positive semi -definite, if xT Ax  0, for any x   n .
i  0, i  1,, n
A   nn is symmetric and positive definite, if xT Ax  0, for any nonzero x   n .
i  0, i  1,, n
61
Matrix Norms and Trace
Frobenius norm
62
Singular Value Decomposition
orthogonal
diagonal
orthogonal
63
References
SC_linearAlg_basics.pdf (necessary)
 SVD_basics.pdf

loaded to HuskyCT
64
Summary
This is the end of the FIRST chapter of this
course
 Next Class
Cluster analysis
– General topics
– K-means


Slides after this one are backup slides, you can
also check them to learn more
65
Neural Networks

Motivated by biological brain neuron model introduced by McCulloch
and Pitts in 1943

A neural network consists of
 Nodes (mimic neurons)
 Links between nodes (pass message around, represent causal
relationship)
 All parts of NN are adaptive (modifiable parameters)
 Learning rules specify these parameters to finalize the NN
Dendrite
soma
Node
of
Ranvier
Axon terminal
Axon
Schwann cell
Nucleus
Myelin Sheath
66
Illustration of NN
w11
Activation function
x1
w12
y
x2
67
Many Types of NN
Adaptive NN
 Single-layer NN (perceptrons)
 Multi-layer NN
 Self-organizing NN
 Different activation functions


Types of problems:
– Supervised learning
– Unsupervised learning
68
Classification: Addiitonal Application

Sky Survey Cataloging
– Goal: To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the telescopic
survey images (from Palomar Observatory).
– 3000 images with 23,040 x 23,040 pixels per image.
– Approach:
 Segment
the image.
 Measure
image attributes (features) - 40 of them per object.
 Model
the class based on these features.
 Success
Story: Could find 16 new high red-shift quasars,
some of the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
69
Classifying Galaxies
Courtesy: http://aps.umn.edu
Early
Class:
• Stages of Formation
Attributes:
• Image features,
• Characteristics of light
waves received, etc.
Intermediate
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
70
Challenges of Data Mining






Scalability
Dimensionality
Complex and Heterogeneous Data
Data Quality
Data Ownership and Distribution
Privacy Preservation
71
Application of Prob Rules
Assume P(Y=r) = 40%, P(Y=b) = 60%
P(X=a|Y=r) = 2/8 = 25%
P(X=o|Y=r) = 6/8 = 75%
P(X=a|Y=b) = 3/4 = 75%
P(X=o|Y=b) = 1/4 = 25%
p(X=a) = p(X=a,Y=r) + p(X=a,Y=b)
= p(X=a|Y=r)p(Y=r) + p(X=a|Y=b)p(Y=b)
=0.25*0.4 + 0.75*0.6 = 11/20
P(X=o) = 9/20
p(Y=r|X=o) = p(Y=r,X=o)/p(X=o)
= p(X=o|Y=r)p(Y=r)/p(X=o)
= 0.75*0.4 / (9/20) = 2/3
72
The Gaussian Distribution
73
Gaussian Mean and Variance
74
The Multivariate Gaussian
y
x
75
Download