Alberta Ingenuity & CMASTE
Purpose : This lesson is based on research by Nasimeh Asgarian at the Alberta Ingenuity
Centre for Machine Learning (AICML) at the University of Alberta. This research comes under the banner of bioinformatics , which is the application of computing science techniques to solve problems in biological and medical science.
Nasimeh has been given access to medical data from 80 patients at the Cross Cancer
1
Institute. It is known historically that approximately
3 of patients who have had prostate cancer treatment will exhibit toxicity, resulting in bleeding. Nasimeh is using Machine
Learning techniques to try to help doctors improve predictability of this bleeding. Each patient has 50 000 different genetic variant dimensions called SNP’s. (This stands for
Single Nucleotide Polymorphism and is pronounced “snips”). This data is gathered from physicians at the Cross Institute. The SNP dimensions are not numerical, but are represented by the heterocyclic bases of human DNA, namely A, C, T, G for adenine, cytosine, thymine, and guanine. This is a huge data set that must be analyzed. For the sake of reducing that volume, only the 51 most important SNP dimensions are actually used. Here is a reduced example of this for 3 patients:
Patient
1
2
SNP
1
C
C
SNP
2
C
T
SNP
3
T
G
. . . . . SNP
51
G
A
Bleed
+
-
3 A G C T -
The machine learning techniques used to analyze the data include Linear Separators,
Decision Trees, Support Vector Machines, as well as Naïve Bayes Tests. This last technique is a conditional probability concept linked to the IB diploma program mathematics curriculum. Naïve Bayes tests are based on Bayes Law, but with strong independence assumptions and naïve (oversimplified) design. The results have been very successful for many complex problems studied.
Problem
: To improve physicians’ success in predicting toxicity in patients after prostate cancer treatment by finding the SNP’s that are most strongly linked to this toxicity.
Hypothesis : Without any analysis, we can predict that about 33% of prostate patients will exhibit toxicity (bleeding) after treatment. Using Naïve Bayes Tests on a select group of the medical dimensions (SNP’s) for each patient, we can improve the success rate in prediction of toxicity.
Prediction : We can significantly improve predictability of toxicity using Machine
Learning techniques.
AICML2ProstateCancer Centre for Machine Learning 1/5
Alberta Ingenuity & CMASTE
Design : Students in Pure Math 30 work with permutations and probability, and students in the IB diploma program learn Bayes Law for conditional probability as:
P ( A / B )
P ( A
P (
B )
B )
where A and B are two distinct events.
Procedure :
1) If each patient has 51 SNP dimensions, each of which is represented by any one of the bases A, C, T, or G, then how many different arrangements of these dimensions are possible for any patient?
Next, an example of how Bayes Law works would probably be helpful here:
It is estimated that about 2% of the world population has diabetes (event A). The test for diabetes indicates whether a patient has a blood glucose level above normal and this test is 85% accurate (event B). Based on this, determine the probability that: a) a person will test positive for diabetes given that they actually have diabetes b) a person is actually diabetic given that they have tested positive for diabetes
P ( B / A )
P ( B
A )
P ( A ) a)
0 .
85
0 .
02
0 .
15
0 .
02
0 .
85
0 .
02
0 .
85
0 .
02
( have the disease and test
)
( have the disease and test
as well as have the disease but test -)
0 .
02
0 .
85
This result is intuitive and seems trivial in the sense that 0.85 is simply the accuracy of the diabetes test. b)
P ( A / B )
P ( A
B )
P ( B )
0 .
02
0 .
85
0 .
98
0 .
15
0 .
02
(
0 .
85 test
( test positive and have the disease) and have the disease as well as test
but don' t have the disease)
0 .
104
This result intuitively seems low, but it means that only 10% of people who have high blood glucose levels are actually diabetic. It doesn’t necessarily indicate that the diabetes test is not effective.
For more practice/understanding I suggest the interactive Java applet demonstrating
Bayes Law in a medical testing application, found at: www.gametheory.net/Mike/applets/Bayes/Bayes.html
AICML2ProstateCancer Centre for Machine Learning 2/5
Alberta Ingenuity & CMASTE
Evidence : In our medical situation, one event
will be will be the DNA base (A, C, T, or G) for that SNP dimension and the second event
will be the chance of bleeding
(toxicity).
For the DNA bases P ( C )
P ( T )
P ( G )
P ( A )
0 .
25 if each base is equally likely to occur. Then, for the complements, P ( C )
P ( T )
P ( G )
P ( A )
0 .
75
The probability of each of the multiple events A
B is experimental and come from the actual patient data. Let’s use a sample of 20 patients and only 1 dimension, SNP
1
.
SNP
1
A C C G T T A G C T A C G G T T A A C C bleed - - - + - + + - + - - - - - + + + - + +
Analysis : From this table we can find the following probabilities simply by counting:
P ( A
bleed )
2
20
, P ( A
bleed )
3
20
, P ( C
bleed )
3
20
, P ( C
bleed )
P ( G
bleed )
1
20
, P ( G
bleed )
3
20
, P ( T
bleed )
3
20
, P ( T
bleed )
3
20
2
20
Also, from the initial information we know that P (
bleed )
1
3 and P (
bleed )
2
3
.
Eg) Find the probability that a patient has base G for SNP
1
given that they were not toxic
P ( G /
bleed )
P ( G
bleed )
P (
bleed )
3
20
2
3
3
20
3
2
9
40
Eg) Find the probability that a patient was toxic, given that their SNP
1
dimension was A.
P (
bleed / A )
P (
bleed
P ( A )
A )
2
20
0 .
25
2
20
4
1
8
20
2
5
Use these examples to do the 8 questions that follow.
AICML2ProstateCancer Centre for Machine Learning 3/5
Alberta Ingenuity & CMASTE
Evaluation :
2.
Find each of the following conditional probabilities: a) the probability of having base C for SNP
1,
given that the patient had toxicity b) the probability of having base A for SNP
1, given that the patient had toxicity c) the probability of having base G for SNP
1,
given that the patient had toxicity d) the probability of having base T for SNP
1,
given that the patient had toxicity e) the probability that the patient was toxic, given that they had base C for SNP
1 f) the probability that the patient was toxic, given that they had base A for SNP
1 g) the probability that the patient was toxic, given that they had base G for SNP
1 h) the probability that the patient was toxic, given that they had base T for SNP
1
3.
Even on this small scale, we hope to find some relationship between the two events. From the examples and your results in 2), can you find any connection between the SNP dimension and toxicity for patients?
AICML2ProstateCancer Centre for Machine Learning 4/5
Alberta Ingenuity & CMASTE
Synthesis : Doing this type of calculation by hand for only 1 SNP dimension out of 51 and for only 20 patients out of 80 is not difficult or very time-consuming. But the entire data set would be enormous and this is where Machine Learning techniques are used to process the data quickly. A Bayesian Network of probability algorithms on computer would process this large data set.
When this is completed the researcher and the physician can easily observe the probability values and infer which bases in each SNP dimension are most strongly linked with toxicity in patients. This process can be used for many other medical situations as well.
Even though students are only doing basic permutation and probability calculations, they can appreciate the volume of calculations needed to make informed conclusions and thus, appropriate medical decisions for researchers as well as physicians and their patients.
Sources:
1) Nasimeh Asgarian, AICML, University of Alberta Computer Sciences
2) Mathpower 12, Knill, George et al, McGraw-Hill Ryerson Publishing,
Toronto, 2000
3) www.biochem.northwestern.edu/holmgren/Glossary/Definitions/Def-
S/SNP.html
4) http://b-course.cs.helsinki.fi/obc/bayesnetprediction.html
5) www.cs.ualberta.ca/~greiner/Presentations.html/#IntroBN
6) www.gametheory.net/Mike/applets/Bayes/Bayes.html
7) www.cs.ualberta/research/areas/bioinformatics/profiles/index.php
AICML2ProstateCancer Centre for Machine Learning 5/5