>> Surajit Chaudhuri: Good morning. I'm very delighted... from Case Western University here. So I met Tekin...

advertisement
>> Surajit Chaudhuri: Good morning. I'm very delighted to have Professor Tekin Ozsoyoglu
from Case Western University here. So I met Tekin I think 17 years ago when I visited Case
Western (inaudible) looking for a job. So (inaudible) is also here today, so I'm very delighted to
have both of them. Merhilo (phonetic) is my official host, so a lot of memories of those days.
And Professor Ozsoyoglu is -- has been a very senior faculty member at Case Western and has
been a contributor to or community for many, many years. He did his Ph.D. from University of
Alberta, Edmonton and his current research interests are around databases, bioinformatics, and
the Web.
So today I'm going to learn something that I don't know. I know nothing about the
bioinformatics area broadly, so it will be an education for me. So, please.
>> Gultekin Ozsoyoglu: Thank you, Surajit. So this is the group that worked on this. I
shouldn't really take credit for everything. Ali is a Ph.D. student of mine. He's with us for
about -- has been with us for five years. He knows more about chemistry than I do. Arum
(phonetic) is actually a master's student. He's finishing his degree and he's going to start working
at Microsoft at the end of August. And this is me over here and this is Mack (phonetic).
So what I'll talk about is metabolomic analysis. What is metabolomics? It's actually -metabolomics are small-weight -- a small molecular-weight molecules that are products of
various different metabolism. And metabolomia refers to the complete set of metabolites in
different tissues or organs. The amount of metabolites is -- you can get different numbers if you
ask different people, but I would say around 2500 of them. And metabolomics is the study of
distributions of metabolomia in biofluids. By that I mean blood, urine, et cetera.
So the recent technological increases, mass spectrometry and gas chromatography and so on,
have actually enabled us to -- have enabled biologists to measure these small-weight metabolites
in biofluids -- blood, urine, saliva, et cetera. So the question is, when you can measure these and
when you know what the normal values are and when they differ from normal values, what do
they mean? So then you go and take a blood test -- you have about 20, 30 different
measurements, and these are biomarkers, and so they won't metabolize. And you know what -- if
a single metabolite has a higher value, such as ketone in your urine, you know that you have
certain problems, the doctors know that. But the question is when you have 300 of these
metabolites that are lower or higher than the normal values, what type of problems do I have or
what type of physiological issues do I have, or maybe I -- my dietary intake has issues so I need
to adjust them.
So this is the question: What do they mean? There's no easy answer to this. The way in which
this is done is actually our second -- the third author over here is a very well known, world-class
biochemist , and his specialty is metabolic biochemistry. He can close his eyes and tell you what
happens if you start with metabolite in this organ and how it interacts and how it changes, loses
its carbon (inaudible), how it produces energy and so on. So he actually suggested this problem
to us. And the standard approach is you ask this biochemist, and then he says, Okay, well, I
know that if you have alanine increase, arginine decrease, glucose increase in the blood, it may
mean one of these ten different possibilities. So our task is to do this computationally.
Metabolic network itself is very complex, and different metabolism have different set of
pathways, connections. Carbohydrate metabolism is about your pathways that actually deal with
carbohydrate consumption. Lipid metabolism is with lipids, (inaudible) metabolism is with
(inaudible). And all together the number of different pathways -- these are really specific
functional units. You can view them as graphs in your body in different organs that do certain
things. Each one of these different metabolisms actually involve sophisticated, really complex
number of reactions.
So the question is, if we have this network available to us -- which we do, for eight years now we
have been actually building and managing metabolic network, it's on the Web, it's used across
the road by biologists, we built it with Microsoft technology, on the server side at least. On the
client side we use Java. So we already have this metabolic network, so what can you do with
this metabolic network to help this question over here, this fundamental question over here?
This will become more and more important in the future because essentially clinical doctors even
cannot answer these questions. If you have 2400, 200 of these increases and decreases in your
blood, and these measurements are becoming cheaper and cheaper, even -- the devices are
expensive, but once you get them, then the overall cost is very low, so it will be measured for
lots of people for these reasons. And then the question is, how can we learn from these and what
do these different values mean? This is the question that we deal with.
Obviously, the true approach should be the following: You have the whole metabolic network
dynamics figured out through a -- perhaps a very large differential equation, set of differential
equations, and then you look at the steady-state analysis of this network. But this is an incredible
task. It cannot be done -- it will not be done in our lifetime. Right now the state of the art is that
there are four or five reactions and that there are thousands of reactions in a metabolic network.
You can model it. You can come up with a partial differential equation to analyze its behavior,
but you cannot do more than, let's say -- the most that I'm aware of is about 25 to 30 reactions.
So you cannot use kinetics. You have to use something else.
So our goal is to automate the interpretation of metabolomics data, and for that we will use our
metabolic network database. And our metabolic network database actually has deficiencies, so
we revised it. And then by learning from biochemists. This is our first goal. And if we can
achieve this in the next three or four years, the next goal is to actually move forward and do more
along the lines of data mining, along the lines of doing computational things.
One of the problems that we encounter these days is -- you know, I have been using biochemists
for about seven or eight years, but even with that, I say something to Mr. Richard Hanson. He's
a great guy. He's an awesome guy. He says, That's stupid. So then we stop that and then we
move on.
So as we learn more, then we will be able to actually understand and provide computational
techniques to them. This approach that I propose over here, Richard Hanson for nine months
said it cannot be done. And then when we did it, he said, Oh, it's beautiful. Let's work more on
it. And so we keep working on it. Every week we get together, we try to improve it.
So we would like to answer questions of the type, what may have led to the increase or decrease
in the concentration of a metabolite? So the way in which these measurements are available is -most of the time is with respect to a control subject. Control subject is a normal person. And
then there's this person whose measurements are -- they differ from the control subject's
measurements in terms of the concentration levels of the metabolite.
So are there alternative hypotheses or scenarios, and we can produce this. Our result's
consistent. I will just define what I mean by that. And then can we verify and score these
different alternatives. The way in which we will define the hypothesis will be essentially a path
in a graph database starting from blood measurement of a metabolite, ending with the same
blood measurement of another metabolite, and then looking at the increases and decreases. We
will look at preservation analysis. We make no claims about steady-state behavioral system. We
are just saying that if there's an increase of this metabolite, there is a time period increase,
preservation in this metabolite and this metabolite in the organs. And you cannot do more than
that.
With humans, we have to target things for humans, you cannot really look at the metabolite
concentration levels in organs, unless you put a clamp around the liver of a person and then
squeeze it and then get the metabolites out. You can do it for mice and unfortunately for dogs,
but no more than that. Yes.
>>: Just to verify the scenario a little bit, you were talking about comparing this to the normal,
normal --
>> Gultekin Ozsoyoglu: Yes.
>>: -- person. How much variation is there among individuals and how much is a normal
variation over time for a single individual? Is that understood?
>> Gultekin Ozsoyoglu: That's not understood at all. Not at this stage. This is really -- this is
really the beginning of this technology. We are at the really forefront of the technology
(inaudible) extremely important for nutritional purposes into the future.
Actually, we are also working with a metabolomics expert, Dr. Henri Brunengraber from
Belgium. He's a well-known expert. And this information is not available at this point in time.
As we know more, it will be much more important. And we will relate these increases and
decreases to dietary problems, problems in specific organs, or to diseases.
So I will start with an example. Let's say that in the blood, not that we measured 200 or 300 of
these, but let's only start with five of them. Glutamine has increased. This is what you would -when you go through this exercise class they say they're protein shakes. I can give you
glutamine, alanine and so on, right? So it's one of those. Glutamine has increased with respect
to a normal person by fourfold. Alanine, another metabolite, has increased by twofold. Urea has
increased by 0.54. Glucose, blood glucose has increased by 1.34, and this stands for
branched-chain amino acids. These are shorter amino acids. And it is a metabolite. It has not
changed with respect to a normal person.
What do they mean? So, in the first place, if you start with glucose over here, you will not go
too far. But if you start with glutamine, which is what we will do, you can actually start coming
to some alternative conclusions. These, we intend to keep them in our database to start with. It's
a daunting task to do this for all the metabolites across all the organs, but this is what we intend
to do eventually. But then after that we will actually do a complete analysis: searching a
network, looking at possible implications.
To start with, let's say glutamine may increase, and these are the four possible physiological
conditions, problems. It may be a problem in the muscle because of increased protein turnover,
or it may be a liver problem, there may be an increased production of glutamine in the liver
because of the urea cycle, which is another pathway within the metabolic network. There may
be a decreased uptake by kidney or by gut. Okay.
So this is a hand-drawn figure. We are coming close to this, but ultimately what we would like,
our system to produce these at different levels for biochemists and also for clinical researchers.
So these are the five metabolites that we just observed changes. And this is a simplified
network. So, for instance, you see over here that glutamine is -- from here it gets transported
into kidney, and then these are actual paths, but then this is really a pathway over here which I
omitted because Richard Hanson said that that path is really -- it's reversed and it's (inaudible)
utilized for this specific interaction, so we have a simpler version. It's another pathway, it's
another pathway you see over here. They are really subgraphs.
Anyway, so we are going to essentially do preservation analysis and chase the increases and
decreases of these metabolites in the blood backward and forward. It's not a forward chase. It's
actually back -- forward and backward. We go in all directions. It will become clear what I
mean in a minute.
So the way in which we will do this is, again, we are following what Richard Hanson told us that
we should be doing, we are -- he said that when I -- he last -- consulting for metabolomic
companies, this is big business, and they measure these things and then they say what kind of
problems should (inaudible) for these measurements tell us, and he looks at them -- he says, The
way I do it, I know that I have some patterns. If glutamine has increased, alanine decreased,
and -- I'm just making this up -- arginine has increased, then it's likely that these are the
problems. Because he can close his eyes and go through these networks. I'm not kidding. He's
amazing. On the board he immediately starts throwing reactions with the catalyzing -- with all
the activators, inhibitors and so on and so forth.
So he says that I look at -- I define a pattern as a set of metabolite changes that may be related to
a physiological condition. So this is one metabolite change, ketones in urine. If you have
relatives that have ketones, and if you have relatives that have diabetes, you would immediately
recognize this. If you're not controlling your diabetes well, you will have ketones in your urine.
These are actually short -- these are energy units. They are actually lovely. They are really
needed for your body. But they are not good if they are being excreted by your body into your
urine. That means that you have a problem. And then also if the blood glucose is about 200
milligrams per deciliter, then in all likelihood you have diabetes. So this is a pattern, a simple
pattern. This is known by doctors. Doctors actually look at these immediately and say, ah-hah,
this patient may have diabetes.
So this is what -- we will make use of that in our problem. These four patterns actually are the
patterns that he told us, which we discuss, right, pattern P1 is glutamine increase, protein
turnover, muscle, liver problem with urea cycle and kidney and gut problems. So the way in
which we -- computationally the way to find this is by chasing this. For instance, let's start with
this. Glutamine has increased in blood. Assuming that the transport mechanism is 1:1 -- in other
words, increase of this glutamine means that it was produced in the muscle more and, therefore,
it was excreted into the blood more. This is a very simplistic view of the transport mechanism,
but this is what we will use. And this is what actually metabolomics researchers use at this point
in time. They are a reasonable -- (inaudible) community they have -- every year they have
metabolomic conference, metabolomics conference. And actually we will present this work over
there as well.
So glutamine in blood has increased. It's because perhaps glutamine is increased in blood. And
you see that glutamate, another metabolite, has increased. It's pretty small, and then it's resulted
in glutamine increase. And then following this, glutamate increase is because the
branched-chain amino acids have increased and this immediately means that there was a protein
turnover over here. This is a problem in muscle.
So then obviously you can say, hey, why didn't you go through this path rather than this path?
We didn't go through this path because even though the kinetic models are not available through
years and years of biochemistry research, biochemists know that -- this is ammonia. Ammonia
produces very little glutamine. The ammonia amounts in your muscle is very small, whereas
most glutamine production doesn't come from here, it comes from here. So this is an important
observation. We've modeled into our system.
So, essentially, this is actually also hand-drawn, but we actually produced this on the fly.
Glutamine has increased in blood. This means that there's an increase. There are four possible
options. Perhaps glutamine has increased in muscle, and then this is caused because glutamate
has increased in muscle. These are aggravations for these metabolites. And glutamate in muscle
is increased because branched-chain amino acid has increased and this is because there was a
protein turnover.
But look at our original observations. Our original observations said that branched-chain amino
acids have not increased. So this is a contradiction. So then that means that what we have
observed cannot be because of protein turnover in muscle. This is exactly how we proceed.
So then my conclusion is that this cannot have happened because branched-chain amino acids
actually have not increased in the blood, so this path is actually -- it can be eliminated. Yeah.
>>: Isn't there some sort of noise that maybe a probabilistic interpretation that might work better
than just culling a whole branch from the tree (inaudible).
>> Gultekin Ozsoyoglu: No. I mean, it's not really noise. These paths are very well known.
>>: I mean noise in the sense that you say there is no (inaudible), but what does that mean? I
mean, presumably there's always some amount of change --
>> Gultekin Ozsoyoglu: So -- so -- okay --
>>: -- (inaudible) thresholds (inaudible) --
>> Gultekin Ozsoyoglu: That's a very good question. Actually these, the techniques, this
high-energy physics equipment, mass spectrometry and so on, actually use very sophisticated
software too. And, actually, if you change the device that you use from -- that uses this
technique versus this technique, like move from mass spectrometry to gas chromatography, you
will have variations in there. Yes, that's true.
So from one device to another, they will vary slightly. But if you observe a fourfold increase,
then whatever noise there is in there, that is -- that's obviously (inaudible).
>>: Is your threshold for -- an up arrow says either fourfold increase or not?
>> Gultekin Ozsoyoglu: Okay, no. So what I observe over here -- so the technique that I use is
very blunt. This is the beginning, right? I don't use actual values. Because actual values you
cannot use. I only say that -- at this point I measured in the blood glutamine increase. This is
because immediately the glutamine has increased in muscle. I'm not saying that in another five
minutes or after a certain period of time glutamine will still be high in the muscle that --
>>: I guess I'm just trying to figure out how you define increase with respect to your raw
measurements. When you say that this increase, what's your definition?
>> Gultekin Ozsoyoglu: Okay. So I'm going to model the reactions. In other words, these
reactions -- each reaction is an input and output substrate in product. And, actually, if substrate
increases, product also increases.
>>: I mean, you're starting with measurements for a mass spectrometer or something like that --
>> Gultekin Ozsoyoglu: Right.
>>: -- so for each node there you get a raw -- a real value measurement.
>> Gultekin Ozsoyoglu: Only in biofluids. These are -- this isn't an organ, now, muscle. You
cannot know, you cannot measure this. So, therefore, that's why it's a limited technique. You
cannot do any more than that. Unless you -- as I said, you get a clamp and then squeeze the
muscle and then extract the metabolites that those -- this is exactly what they do to mice anyway,
why doing them, they measure them. Or you use cancer-related research, very disruptive
research.
Okay. So moving on. So we have -- we can verify that (inaudible) is invalid, simply because I
said this already, it should -- it really implies -- by following the network, it implies that BCAA,
branched-chain amino acids, must have increased, but -- and then therefore they must have
increased in the blood, but they did not, so therefore I eliminate this.
On the other hand, the second part over here may be valid. It may be valid. We are not saying
that it's valid; we are only saying it may be valid. So this path, glutamine in blood has increased
and it's because glutamine in liver has increased and it's because ammonia has increased. And if
ammonia increases, then there's something wrong in your urea cycle. So this is what this whole
thing -- it says that.
So let's take a look at this. In -- glutamine has increased, but you see that glutamine is produced
by liver and released into the blood. And glutamine increase may be because of NH3, ammonia,
over here. And ammonia is actually consumed by the urea cycle. If the urea cycle doesn't
consume ammonia, then there's an excess ammonia over here and it results in larger amounts of
glutamine and, then, therefore, you will observe (inaudible) yeah.
>>: (Inaudible) this one in your work you are just classifying as increase or decrease.
>> Gultekin Ozsoyoglu: That's right.
>>: You're not actually worrying about the actual amount --
>> Gultekin Ozsoyoglu: Right.
>>: (Inaudible.)
>> Gultekin Ozsoyoglu: The whole metabolomics society at this point only deals with -- for
normal people and for -- with only increases and decreases. You can actually go to actual
concentration values if you are doing cancer research, and you can actually measure things in
cells, obviously, not on cancer patients.
>>: Even with that sort of (inaudible).
>> Gultekin Ozsoyoglu: No. Actually, if you are actually looking at cell, you can actually
observe within a given cell metabolite, concentrations improves, and then what they do, they
actually try to -- some of them, actually, they do dynamic analysis; they define -- they fit it to a
set of differential equations and then they pass judgment on the basis or the fact that if you -colon cancer, this colon cancer cell behaves this way to this drug and so on. But not for regular
patients. I mean, if you are getting your blood measurements, that's all you have. That's all you
have. Yes.
>>: (Inaudible) independent of metabolites (inaudible)?
>> Gultekin Ozsoyoglu: Okay. Yeah. You know, seven years ago when I started this, this is
exactly what I asked my genetics researchers. Yeah. You can -- I can abstract it to you. I will
abstract it to you. But whatever it is that I do will be very faithful to the underlying biology. If
it's not faithful, then you're faking yourself.
I had a meeting with Richard Hanson, who is a great guy, and I said that I had this idea about an
extension of this, and I said, I'm writing a grant proposal busily right now, he needs a copy. And
I said, Richard, can we do this? And he looked at me, that's -- he said, That's BS. You should
never go that way. They will immediately kill your proposal. The biochemists will kill your
proposal. I threw it away right away.
In other words, whatever abstractions that you have has to be extremely faithful. Also, you can't
speak in naive terms. You know? When you have a reaction, you don't talk about an input to the
reaction and output to the reaction. You have to talk about the substrate to the reaction and a
product to the reaction, activators and inhibitors and so on.
Okay. So moving on. So, actually -- wait. Well, okay. I think I'm going to go back. So this
actually -- so all we do is given a possible set of hypotheses we invalidate them and link them to
physiological conditions. I think I need to go a little bit faster because the example is taking -- it
has taken half an hour. And then we can -- those we cannot eliminate. They may be valid. We
are not saying that they are valid; they may be valid.
So I'm going to skip the third hypothesis. Actually, the third hypothesis that there's a problem
in -- this was -- I think this was -- this was gut, I think. And then P4 must be kidney. You can
invalidate them as well. You can eliminate them.
So the approach is automated ways of eliminating hypotheses from among a list of likely
hypotheses, and we managed to eliminate three of them and one of them may be valid. And a
broader approach is actually eliminate all hypotheses. Remember, I said that this is not a path
that's taken frequently in your muscle. Your muscle has very little ammonia, so therefore
glutamine cannot help you produce insignificant amounts through the ammonia, so therefore
that's a path that you need to take. But if you are sick, actually, if your body is under stress, your
metabolic network is very flexible. It starts compensating and it's not -- how it compensates is
not understood. It is indeed possible that -- well, ammonia is our only example, but some other
path may be active, may become active because you are sick. So then all bets are off.
So, anyway, the broader approach is to look at all hypotheses and provide them to the
researchers. If you look at all hypotheses that are invalid, with 300 measurements, you can go
down to about 200 hypotheses, which is still large.
So let's take a look at the abstraction. This is the abstraction now. Now we are there. There is a
metabolite. This is an input to this reaction. Each reaction is catalyzed, controlled, by an
enzyme, which is (inaudible). And then it produces this metabolite. So M2 is metabolized into
M6, and if this has increased, this also has increased. I'm sorry?
>>: (Inaudible.)
>> Gultekin Ozsoyoglu: M4. What did I say?
>>: M6.
>> Gultekin Ozsoyoglu: Oh, I'm sorry. M4. And M4 is a substrate, or input, to this reaction,
catalyzed by this enzyme, and then it produces M6. So you see that these two reactions use the
same substrate, same input, whereas over here or over here, this reaction uses two inputs, two
substrates, and they are called co-substrates. This is a simpler model. In reality, this enzyme
over here is controlled -- the effectiveness of the enzyme is activated or inhibited by activators
and inhibitors that complicates this model. We model them as well, but I'm not going to talk
about in this talk.
So the way in which we will talk is if it's -- this is a river. So M2 is upstream to M6, and
therefore M6 is downstream to M2. I will repeatedly use the same analogy. So this is essentially
a graph network, right? So what we have observed are observed events -- we say that this
metabolite is observed, it's an observed event -- to have this concentration level change, increase
by X-fold, decrease by X-fold, or no change. This is the only information that we use. And then
so -- so you can actually say that two hours after -- this is just an example -- two hours after the
treatment with a certain drug the level of M4 has increased by twofold, the level of M5 has not
changed, and now we are going to drive events in terms of changes, level changes of metabolites
in organs, in tissues.
So this cannot be measured. And at this point we are not -- they are not of interest to us. We
only want to eventually validate or invalidate different hypotheses or different patterns.
So essentially will we drive events using this reasoning. A metabolite -- concentration of a
metabolite may increase either in the blood or in the organs. It's because it's produced more,
because substrates, they're produced more. The inputs of that remote substrate -- they were
more -- larger number of -- larger amounts of substrates that -- for the production of this
metabolite. Or it's consumed less. Because it's consumed less, then its concentration level stays
high. Okay. This is just one reasoning. Or you can do the reverse reasoning over here. So then
you can do a cascading effect over here as well.
If M6 has increased, then going backward, because of this reaction, M4 has increased. This is a
preservation analysis, (inaudible) behavior, and larger amounts of M4 is because M2 has
increased, so I have a cascading effect over here of M2 over M6. So this is a cascading effect.
And then you can negate the increases and decreases.
This is the reasoning. If a metabolite is observed to increase, it's either increased more or
consumed less, right? That's all there is to it. There are no other options.
So if we model this in various different ways -- I'm going to skip these. This is a formulization.
I do have the paper. I sent the paper to Surajit. You're welcome to take a look at it. We
submitted it. It's being reviewed by Journal of Computational Bioinformatics, computational
biology, I think.
So perhaps this is an observed event. This is in the blood, but these are in an organ, so we chase
these backward and forward. So if M4, let's say, has increased by twofold, it may be that M2 has
increased or it may be that M2 -- the production of M2 has increased and then the production of
M5, which is of course substrate to this, has decreased. If this is decreased and then this reaction
did not use enough of M4, both of these are needed, so therefore it implies a parallel change.
The M5 has decreased, and therefore M6 has decreased. These are your alternatives.
So then also you can incorporate dietary intake to this and physiological conditions. There are
certain metabolites that our body does not provide, so they are taken externally through dietary
intake. So we modeled them. Essential amino acids are actually the metabolites that are not
produced by our body. You have to take them as part of your diet. And then also, as I -- I gave
this as an example, if there is a physiological process, as a result of that, the metabolite
concentration levels change, such as protein turnover that I illustrated.
So we capture these, but forget this modeling over here. I'll give an example over here. Dietary
intake increases, so it produces -- it has this production event. It produces more of tryptophan in
blood. And protein turnover increases alanine in muscle. So let's move on.
You can also -- you can actually -- you can model additional external events in terms of
physiological changes as well. If your calcium dietary intake increases, then this results in an
increase -- increase intake. So let's move on.
Let's characterize what do we mean by conflict. Conflict is that if I start with an observed event
and then later on I follow metabolic pathway, I end up with the conclusion that contradicts with
this observation. Instead of decrease I have an increase. Then that's a contradiction. Then I
need to stop at this point. This is the characterization of conflicts. So we can start with these
observed events and chase them backward and forward and horizontally through the core
substrates across the network. And I can have a closure of all these events, right? And this is
indeed what we do. With a smaller network right at this point in time. We don't have the full
network.
And so this -- we refer to this as the closure of an event. And then I can define the closure tree.
If I start with this event -- let's say this metabolite has increased. And then this second event is a
child event, because -- child of this parent event because it's derived from the parent event, so
then I can define a closure tree I actually -- if you compute everything, then it's a closure tree. If
you don't compute everything, which sometimes we cannot, then it's a hypothesis tree, it's
incomplete.
So to give you an example over here for this very simple network over here, metabolic network,
let's say that we have observed that M4 has increased by twofold. There's a typo over here.
Anyway, so if you only look at this path over here, increase of M4 is because M2 has increased,
so it corresponds to this path over here. M2 has increased. Increase of M2 is because M1 has
increased. And then M1 can only increase. Perhaps M1 is an essential amino acid. M1 can only
increase because of dietary intake. This symbol defines dietary intake, increase in dietary intake.
So this is a valid -- maybe valid path.
On the other hand, you can see the red ones over here are the contradictions and you can
eliminate them from this completely, and this would give you the closure tree for this specific
subnetwork.
So the marginal inconsistency is that in your closure tree, you don't have the same metabolite
observed in a contradicting manner. That's consistency. Minimality means that you don't have
your -- in your closure tree the same metabolite increasing in two different spots, so your closure
tree is a minimal tree.
So once you obtain this minimal consistent and minimal tree, then from this you can decide
whether -- which paths are maybe valid, which parts are invalid.
So I will skip this because we are running out of time. So for this specific path over here, it's this
one over here and this is dietary increase. So the problem statement is this: given an organism,
and we are only dealing with humans at this stage, and a set of observed events, we compute all
hypotheses. And hypothesis is a root to leaf path. I start with an observation. I end up with
either a dietary -- consistent dietary change or consistent physiological change or a consistent -another observed event. Then, essentially, instead of all these hypotheses, we find them and we
would like to rank them and would like to give them to the biochemist or clinical researcher.
This is a problem. Formally, it's defined in the paper. So I'm sort of giving you the intuition. Of
course, I have -- along the way I have taken liberties. The transport mechanism is not as simple
as I have modeled. But this is what current metabolism researchers do. If your blood glucose is
high, I immediately said that -- what is this? In muscle glucose is increased. We know that that's
not the case, right? I mean, glucose is gated by insulin. Glucose transport process from blood to
an organ is extremely complicated. It involves five or six different types of cells on your
pancreas. It involves four or five different mechanisms.
But modeling -- so we cannot really model this, but I simply -- what we are proposing is the next
stage of our development is actually define possible (inaudible) conditions over here for glucose
to be transported into a tissue. This we can do, and that's what we are planning to do next. Our
current transport mechanism is essentially if you observe an increase, then there's an increase in
downstream or there's an increase upstream in related organs. So these details, I will skip that.
Another issue that I already mentioned to you, the dynamic analysis is called flux balance
analysis in biochemistry. This is a very complex and difficult process, but through years and
years of research in humans, as I said, if -- the biochemists now, for instance, that if arginine is
produced in this certain organ, arginine is consumed more into a urea rather than through this
reaction into a gueninodate (phonetic) acetate over here. So actually I can say that arginine
consumption, metabolizing arginine is more on this path rather than this path. The numbers are
rather arbitrary, because it's not known. Only a truly metabolic biochemistry specialist can really
tell you something that's close to this, but it's rather arbitrary.
The point here is that in my closure tree, I can actually rank these paths because of this
knowledge. The likelihood of this happening is more; the likelihood of this happening is less.
So we incorporate this into our system.
Physiological conditions I will skip. We actually do -- another thing that we do is ex-path-like
analysis. We actually look at these different hypotheses, and then if a subpath occurs in these
different hypotheses, many times, many, many times, then we say that this is perhaps a more
likely subpath, can you make use of this. We give this to the biochemist. This will do it at this
point in time, so therefore we percolate these hypotheses up in possible hypotheses lists or
invalid hypotheses lists, so we do some (inaudible) in that sense. But, as I said, Richard Hanson
doesn't like this at all. He says this doesn't make sense. But we do it anyway.
So I will skip this. So this a subpath that occurs frequently here, and then as a result we put it
into our interesting event set.
I will skip all of these and go to our experimental evaluation. We tested this for a certain
disease, actually for a paper. In clinic there's a researcher, and we use his data and validated our
approach, but his paper's not published; he would not allow to discuss this. But, essentially, his
data had much more than 34 metabolic measurements, but we only use 34 of them because our
database is a prototype database at this stage because our original system that we managed for
seven years does not have location information. We don't distinguish between organs.
So we had to produce a separate database for that, so therefore we have a small number of
pathways, and the pathways that we have, altogether I think we have 50 pathways right now and
we have 28 pathways in this metabolism -- I mean acid metabolism; 11 pathways carbohydrate
metabolism; 11 pathways in lipid metabolism. And this is what we use, and therefore we use
this smaller number of observations.
And this is the data that we have used. We abstracted the actual increases with increases and
decreases. This is yielded, as I said. And with this data, the number of hypotheses -- remember
that we go backward and forward. At any node we go backward and forward, backward and
forward. So the number of hypotheses with our 50 pathways went up to 130,000 different
hypotheses. We eliminated the invalid ones. The invalid ones stayed up at about 3,000. But
then --
>>: (Inaudible.)
>> Gultekin Ozsoyoglu: I'm sorry?
>>: By 30,000.
>> Gultekin Ozsoyoglu: 30,000, yeah. 30,000. And the maximum hypotheses length was 70.
And then 70, when we showed this to Richard Hanson, Richard Hanson said this is totally
useless. Hypotheses length of 70 is incomprehensible even to the expert biochemists, metabolic
experts. So we have to do something better. So this describes how we eliminated in different
types of metabolism.
But we -- the number of -- using the observations that we had. Remember, the observations of
patterns, we achieved 95 percent reduction with 40 measured metabolites, and then we increased
the metabolites to 80 metabolites, we eliminated about 99.9 percent of the hypotheses down to
300 hypotheses. But this 300 hypotheses too much as well. I mean, it's humanly not possible to
interpret 300 hypotheses. The goal is to take it down to, let's say, somewhere, 20 -- you know,
case by case, 20 or something so that they can interpret these. And that's what we want to do.
And I think we can do it if we can model the patterns more and more.
So when we use summarization, I said these are ex-path-like subpaths in metabolic network. We
actually reduced the total number of hypotheses by 99 percent, even with 34 metabolite
observations.
So I will skip this. As I said -- this shouldn't have been here. I'll skip this. Sorry about that. We
use it for actually for this specific -- the data came from there.
So this is our system. This system is actually a revised version of our system that's used -- it's a
production system, it's a development system. It's at this side, this side of the development
server. So these are the pathways processed. And you can browse them and so on. Our current
system does that. The new part that we built for this research is this (inaudible) prediction. So
users cannot load the observations as an XML file. There's a sample over here that they can use,
actually. We have a sample, we provide a sample. And then they may actually or they may
manually increase, add what they observed one by one. And then when they do that, then -- and
also we use Ajax technology over here. We actually provide what's available. Instead of all the
metabolite, there are 2500 metabolites. And then you may actually specify them one by one in a
specific organ.
So this is let's say -- after uploading it, this is what you would have what's uploaded by the user.
These are the metabolites. These are the actual changes. And then you can choose whichever
one to be your closure tree for exploratory (inaudible). I have ten minutes. Slightly less than ten
minutes. Moving up.
And then when you click generate observation-supported hypothesis, it produces this. You can
save them. These are actually -- these are our paths, starting in -- with glycolate in blood,
glycolate in liver, colate, colate, glycolate biosynthesis in liver and so on and so forth. This is
what Richard Hanson says useless sprouting of pathways.
And then we visualize our system. This is also available. You can zoom in, zoom out and then
do lots of things. This is an original of our visualization in biological pathways.
We actually -- this is a new addition to our current system. It's being worked out. As I said, all
the metabolic network data sources on the Web, highly respected (inaudible) KEGG. We are
licensed to use their KEGG data. They never distinguish between liver and muscle because -and they have to distinguish between liver and muscle, because the same -- you see that this
reaction over here, this pathway occurs in liver, blood, and muscle. It -- and there are other
pathways as well, so we measure -- visualize this, and then we do the (inaudible) prediction with
this too. Very well.
What have we done? We have actually modeled the observed measured metabolite changes. We
chased them in the metabolic network across different organs using a rather simple transport
mechanism, and then even with that we managed to eliminate a significant amount of
hypotheses. And we have also, as I said -- remember 0.9 and 0.1, this path is more favorable
than this path, that's actually called flux ratio analysis, so we incorporated flux ratio. The
limitation of the system, of course, it does not utilize the exact amounts. But utilizing exact
amounts is only possible with full kinetics, and full kinetics, it will not be -- I don't think -- will
not be discovered in our lifetimes.
So I think I would like to stop here rather than say more things and then maybe answer your
questions.
>>: Are there any competitive systems that are doing this kind of metabolic pathway analysis?
>> Gultekin Ozsoyoglu: Absolutely not. We are the first. All the companies, professional
companies -- actually, this company (inaudible) I think, the minute they learned that we are
doing this they said, Can you do this for us? Give us all the paths between two metabolites in
different organs, and we'll pay you. We haven't paid attention to it. But mapping these
metabolites through the network is not done yet. All the analysis that we have done is extremely
new. There are five or six Web-based data sources for metabolomics research, and there is one
highly respectable one in my alma mater out in University of Alberta. They are funded by $35
million from Canada. And on the site they say they have patent information. If you have these
and these observations, then it's probably this disease. And it's actually FTP downloadable. We
downloaded their data.
>>: But it's pattern based. They're not doing network analysis.
>> Gultekin Ozsoyoglu: Nothing. Nothing. But I can use those patterns in my system, and then
I can actually do more. Of course, this is just the beginning, right? I mean, into the future, you
have to do data mining. Saying that this path is 0.9, the 0.1 is something that I invented. No one
knows. But my co-researcher, Richard Hanson, says that you know the amount of ammonia in
the muscle; it's minuscule. So only he knows it. And he's a world-class expert in this area.
>>: Can I just add something? I think this is motivated by the fact that it's very easy to measure
so many metabolites in a simple blood test in the lab. But then nobody knows what to do with
all this data. And there are very few experts, frankly, who can interpret this data. So this is -motivation for this is to maybe capture some of this knowledge of these experts and help people
to sort out and eliminate or figure out the alternatives. There will be, of course, lots of positives
in the results of this. But the idea is just to limit the scope so that maybe there will be
meaningful (inaudible).
>>: You can see, though, why Hanson would want to see relatively short paths. Because if you
can't under- -- as you say, this is such early-stage stuff. Nobody's going to trust the computer's
analysis and go off and do what it says. So if it isn't -- if it isn't a line of logic you can really
understand, then what's the point?
>>: But, see, the idea is that -- on one hand you can't do anything with it, because you don't
know what to do. On the other hand, you have a maybe smaller scope. And as there are more
data embedded into this system, it will produce better results. But this is a very early stage of the
(inaudible).
>> Gultekin Ozsoyoglu: Our cancer researchers -- we have a cancer center. They are funded by
another $50 million over five years. Our cancer researchers went crazy on this. Our cancer
researchers really went crazy on this. They said actually cancer changes the metabolic networks
in completely unexpected ways. They say, Can I do (inaudible) over here? This path that's not
insignificant suddenly becomes insignificant. You're a cancer patient, suddenly your muscles are
collecting ammonia because you have cancer. So what does it mean? How can I fix this? So
this can be used for cancer, but we are way early for that. Maybe in another ten years.
>>: Well, this -- as you said, this is sort of early days, and the question I have is the level of
modeling that they're currently doing, the level of analysis you're doing, are you beginning
already to get useful results? Can you narrow it down to (inaudible) --
>> Gultekin Ozsoyoglu: Right, right. Right. So there is one example that we did all the
experiments and then we said we cannot -- we were -- we were told -- we are told not -- we
cannot use it. There is a certain disease, and there's a researcher in clinic -- his results, his
observations, his conclusions, we produced them. Once we have them as patterns, we produce
them. Among those 300 hypotheses, we percolate to the top his observations. But we can't do
more than that, right? Ultimately it has to be interpreted by someone who's an expert.
>>: He's studying (inaudible) disease.
>> Gultekin Ozsoyoglu: You're not supposed to say that.
>>: No, but my point -- I believe one of us asked you this, was basically is the way you -- the
research that you are now at, what stage is it? Is it at the stage where it's beginning or ready to
produce some useful results (inaudible) --
>> Gultekin Ozsoyoglu: Right. So we are incorporating with Henri Brunengraber; he's doing
cancer research, colon cancer research. He's in the department of nutrition. And he's a very well
known metabolomics researcher. So we are trying to match our system where its producing
results consistent with what he has with colon cancer cells. We just got the data and we are
working on that. But the one that we verified is the one that (inaudible). We literally found the
right ones. You had a question?
>>: Can you take the output, then, and just say if I just have these two additional tests, I could
produce the output (inaudible)?
>> Gultekin Ozsoyoglu: No. I mean, we really use all 34 --
>>: (Inaudible) had some additional tests to run.
>>: Additional tests, sure.
>> Gultekin Ozsoyoglu: Right, right. In other words --
>>: (Inaudible) exception is.
>> Gultekin Ozsoyoglu: The other exception --
>>: This is why we can't -- it can be done.
>> Gultekin Ozsoyoglu: It can be done. Ultimate to the goal is this: You know that when you
look at cholesterol -- and I think we need to stop, right? Ultimately you know that when your
cholesterol level is high in your blood, that's a biomarker. And that immediately says, oh, you
better have -- change your dietary intake, you are immune to heart disease. What we would like
to do with this is, oh, I have this pattern of 12 of these metabolites. This is deadly. You may
have this disease. Or you may have these physiological issues, so you start watching out.
So, in other words, we would like to make a biomarker out of not just a single observation, this is
what current technology is, we would like to make a biomarker out of a pattern of, and we would
like to locate them first ourselves and then have them verified in the lab. This cannot be done
without a system like this.
>>: So (inaudible) question, which is, I mean, in terms of using this from a CS perspective,
would the increase that you found (inaudible) for you (inaudible) so for (inaudible) techniques
that were particularly (inaudible) --
>> Gultekin Ozsoyoglu: Okay. So we found out that we manage -- not this system, but the
original system we manage at a professional level. We use Version Control, CVS; we use
Bugzilla. Every week we get together with ten master Ph.D. students. We fix the bugs and then
our co-researchers, Richard Hanson and the others, come. Not Richard Hanson; the others come.
And then we fix this. But we found out that at the server side, SQL server works really well.
There's no problem with that. We keep changing the database, though. It's amazing.
Continuously we change it. And then as we change, we each have to change the code (inaudible)
technology, it's perfectly okay. In terms of --
>>: The techniques? What are the techniques?
>> Gultekin Ozsoyoglu: The techniques? I think these are essentially down -- at heart, these are
graph databases. So there are certain -- I haven't really described everything. If you increase
this, which we did, into -- instead of 50 pathways into about 70 pathways, our system, even
though it runs (inaudible) main memory on the server, 16 gig main memory, highly powerful
servers, they cannot stop. They cannot finish the closure tree computation. So we will have to
use original indexing techniques. We are looking into that. Even with only main memory
computations. If there's the system -- because it's not just chasing forward in the network; go
forward, backward, backward, backward, go forward, backward, backward, backward. So at
each step when you move, move backward, and these are all possible hypotheses.
>>: How big is this graph?
>> Gultekin Ozsoyoglu: How big is this graph? The full known metabolic network, Biocyc, has
about -- it's on the Web. Right now --
>>: (Inaudible?)
>> Gultekin Ozsoyoglu: I'm sorry?
>>: Are you using all of --
>> Gultekin Ozsoyoglu: No, no, no. We're not using all of it. Because KEGG doesn't have
location information. What they say, they give you a pathway and say you cannot tell whether
this part is in this organ or it's completely in this organ. Actually, you can take a pathway, liver
to kidney, some reactions you miss. So we will have to collect all that information from the
literature. We only have the limited work.
>>: So is this --
>> Gultekin Ozsoyoglu: The number of pathways altogether, across all the metabolisms, KEGG
right now has about 140 pathways. The largest one is 150 reactions. And they have it across all
organisms. And we have it -- we make it available -- if you type "PathCase," you will reach our
production system. And we built that over seven, eight years. PathCase, right?
>>: Sure. I just wanted to know how many (inaudible) --
>> Gultekin Ozsoyoglu: So we have it at the first page. I think we have about 35,000
metabolites in that full network. I mean, I don't know the exact number. It's listed. It keeps
changing. Every three months we download the new version, we up -- change it into -- change
our database.
>>: So (inaudible)?
>> Gultekin Ozsoyoglu: We don't use any supervised learning over here. Around this --
>>: Why not?
>> Gultekin Ozsoyoglu: Okay. So this is -- there's not much to do in terms of --
>>: No, but (inaudible).
>> Gultekin Ozsoyoglu: Into the future. Yeah, yeah. Into the future. We -- around PathCase
we did work on protein annotation, protein network annotation. We're published in the
(inaudible) in the ISMB. Protein annotation we use machine learning, supervised learning. And
we had extensive (inaudible) models. This research is just in its infancy. But I'm hoping that
once I understand it long enough, next semester I will sit and take a biochemistry course.
Literally. I mean, seven years I've been working on it, but next semester I'll sit down and take
that biochemistry course.
>>: (Inaudible.)
(Laughter.)
>> Gultekin Ozsoyoglu: I cannot handle that. I will audit it. I will audit it. So, anyway, once I
learn it more -- you know, the main -- one of the issues is I'm too aggressive, right? I say, Can
we do this? And it says, You are wrong. But sometimes it's actually -- those -- if you do it, then
it appreciates that the change is mine. So I think that the problem with communicating with
biologists we found -- geneticists, biochemists we found out is that the ones that we deal with are
really experts. They are -- they are not at this same computational-thinking level that we are. On
the other hand, we can go off base and do stupid things as well.
>> Surajit Chaudhuri: Okay. Thanks,Tekin.
>> Gultekin Ozsoyoglu: Thank you.
(Applause.)
Download