>> Surajit Chaudhuri: Good morning. I'm very delighted to have Professor Tekin Ozsoyoglu from Case Western University here. So I met Tekin I think 17 years ago when I visited Case Western (inaudible) looking for a job. So (inaudible) is also here today, so I'm very delighted to have both of them. Merhilo (phonetic) is my official host, so a lot of memories of those days. And Professor Ozsoyoglu is -- has been a very senior faculty member at Case Western and has been a contributor to or community for many, many years. He did his Ph.D. from University of Alberta, Edmonton and his current research interests are around databases, bioinformatics, and the Web. So today I'm going to learn something that I don't know. I know nothing about the bioinformatics area broadly, so it will be an education for me. So, please. >> Gultekin Ozsoyoglu: Thank you, Surajit. So this is the group that worked on this. I shouldn't really take credit for everything. Ali is a Ph.D. student of mine. He's with us for about -- has been with us for five years. He knows more about chemistry than I do. Arum (phonetic) is actually a master's student. He's finishing his degree and he's going to start working at Microsoft at the end of August. And this is me over here and this is Mack (phonetic). So what I'll talk about is metabolomic analysis. What is metabolomics? It's actually -metabolomics are small-weight -- a small molecular-weight molecules that are products of various different metabolism. And metabolomia refers to the complete set of metabolites in different tissues or organs. The amount of metabolites is -- you can get different numbers if you ask different people, but I would say around 2500 of them. And metabolomics is the study of distributions of metabolomia in biofluids. By that I mean blood, urine, et cetera. So the recent technological increases, mass spectrometry and gas chromatography and so on, have actually enabled us to -- have enabled biologists to measure these small-weight metabolites in biofluids -- blood, urine, saliva, et cetera. So the question is, when you can measure these and when you know what the normal values are and when they differ from normal values, what do they mean? So then you go and take a blood test -- you have about 20, 30 different measurements, and these are biomarkers, and so they won't metabolize. And you know what -- if a single metabolite has a higher value, such as ketone in your urine, you know that you have certain problems, the doctors know that. But the question is when you have 300 of these metabolites that are lower or higher than the normal values, what type of problems do I have or what type of physiological issues do I have, or maybe I -- my dietary intake has issues so I need to adjust them. So this is the question: What do they mean? There's no easy answer to this. The way in which this is done is actually our second -- the third author over here is a very well known, world-class biochemist , and his specialty is metabolic biochemistry. He can close his eyes and tell you what happens if you start with metabolite in this organ and how it interacts and how it changes, loses its carbon (inaudible), how it produces energy and so on. So he actually suggested this problem to us. And the standard approach is you ask this biochemist, and then he says, Okay, well, I know that if you have alanine increase, arginine decrease, glucose increase in the blood, it may mean one of these ten different possibilities. So our task is to do this computationally. Metabolic network itself is very complex, and different metabolism have different set of pathways, connections. Carbohydrate metabolism is about your pathways that actually deal with carbohydrate consumption. Lipid metabolism is with lipids, (inaudible) metabolism is with (inaudible). And all together the number of different pathways -- these are really specific functional units. You can view them as graphs in your body in different organs that do certain things. Each one of these different metabolisms actually involve sophisticated, really complex number of reactions. So the question is, if we have this network available to us -- which we do, for eight years now we have been actually building and managing metabolic network, it's on the Web, it's used across the road by biologists, we built it with Microsoft technology, on the server side at least. On the client side we use Java. So we already have this metabolic network, so what can you do with this metabolic network to help this question over here, this fundamental question over here? This will become more and more important in the future because essentially clinical doctors even cannot answer these questions. If you have 2400, 200 of these increases and decreases in your blood, and these measurements are becoming cheaper and cheaper, even -- the devices are expensive, but once you get them, then the overall cost is very low, so it will be measured for lots of people for these reasons. And then the question is, how can we learn from these and what do these different values mean? This is the question that we deal with. Obviously, the true approach should be the following: You have the whole metabolic network dynamics figured out through a -- perhaps a very large differential equation, set of differential equations, and then you look at the steady-state analysis of this network. But this is an incredible task. It cannot be done -- it will not be done in our lifetime. Right now the state of the art is that there are four or five reactions and that there are thousands of reactions in a metabolic network. You can model it. You can come up with a partial differential equation to analyze its behavior, but you cannot do more than, let's say -- the most that I'm aware of is about 25 to 30 reactions. So you cannot use kinetics. You have to use something else. So our goal is to automate the interpretation of metabolomics data, and for that we will use our metabolic network database. And our metabolic network database actually has deficiencies, so we revised it. And then by learning from biochemists. This is our first goal. And if we can achieve this in the next three or four years, the next goal is to actually move forward and do more along the lines of data mining, along the lines of doing computational things. One of the problems that we encounter these days is -- you know, I have been using biochemists for about seven or eight years, but even with that, I say something to Mr. Richard Hanson. He's a great guy. He's an awesome guy. He says, That's stupid. So then we stop that and then we move on. So as we learn more, then we will be able to actually understand and provide computational techniques to them. This approach that I propose over here, Richard Hanson for nine months said it cannot be done. And then when we did it, he said, Oh, it's beautiful. Let's work more on it. And so we keep working on it. Every week we get together, we try to improve it. So we would like to answer questions of the type, what may have led to the increase or decrease in the concentration of a metabolite? So the way in which these measurements are available is -most of the time is with respect to a control subject. Control subject is a normal person. And then there's this person whose measurements are -- they differ from the control subject's measurements in terms of the concentration levels of the metabolite. So are there alternative hypotheses or scenarios, and we can produce this. Our result's consistent. I will just define what I mean by that. And then can we verify and score these different alternatives. The way in which we will define the hypothesis will be essentially a path in a graph database starting from blood measurement of a metabolite, ending with the same blood measurement of another metabolite, and then looking at the increases and decreases. We will look at preservation analysis. We make no claims about steady-state behavioral system. We are just saying that if there's an increase of this metabolite, there is a time period increase, preservation in this metabolite and this metabolite in the organs. And you cannot do more than that. With humans, we have to target things for humans, you cannot really look at the metabolite concentration levels in organs, unless you put a clamp around the liver of a person and then squeeze it and then get the metabolites out. You can do it for mice and unfortunately for dogs, but no more than that. Yes. >>: Just to verify the scenario a little bit, you were talking about comparing this to the normal, normal -- >> Gultekin Ozsoyoglu: Yes. >>: -- person. How much variation is there among individuals and how much is a normal variation over time for a single individual? Is that understood? >> Gultekin Ozsoyoglu: That's not understood at all. Not at this stage. This is really -- this is really the beginning of this technology. We are at the really forefront of the technology (inaudible) extremely important for nutritional purposes into the future. Actually, we are also working with a metabolomics expert, Dr. Henri Brunengraber from Belgium. He's a well-known expert. And this information is not available at this point in time. As we know more, it will be much more important. And we will relate these increases and decreases to dietary problems, problems in specific organs, or to diseases. So I will start with an example. Let's say that in the blood, not that we measured 200 or 300 of these, but let's only start with five of them. Glutamine has increased. This is what you would -when you go through this exercise class they say they're protein shakes. I can give you glutamine, alanine and so on, right? So it's one of those. Glutamine has increased with respect to a normal person by fourfold. Alanine, another metabolite, has increased by twofold. Urea has increased by 0.54. Glucose, blood glucose has increased by 1.34, and this stands for branched-chain amino acids. These are shorter amino acids. And it is a metabolite. It has not changed with respect to a normal person. What do they mean? So, in the first place, if you start with glucose over here, you will not go too far. But if you start with glutamine, which is what we will do, you can actually start coming to some alternative conclusions. These, we intend to keep them in our database to start with. It's a daunting task to do this for all the metabolites across all the organs, but this is what we intend to do eventually. But then after that we will actually do a complete analysis: searching a network, looking at possible implications. To start with, let's say glutamine may increase, and these are the four possible physiological conditions, problems. It may be a problem in the muscle because of increased protein turnover, or it may be a liver problem, there may be an increased production of glutamine in the liver because of the urea cycle, which is another pathway within the metabolic network. There may be a decreased uptake by kidney or by gut. Okay. So this is a hand-drawn figure. We are coming close to this, but ultimately what we would like, our system to produce these at different levels for biochemists and also for clinical researchers. So these are the five metabolites that we just observed changes. And this is a simplified network. So, for instance, you see over here that glutamine is -- from here it gets transported into kidney, and then these are actual paths, but then this is really a pathway over here which I omitted because Richard Hanson said that that path is really -- it's reversed and it's (inaudible) utilized for this specific interaction, so we have a simpler version. It's another pathway, it's another pathway you see over here. They are really subgraphs. Anyway, so we are going to essentially do preservation analysis and chase the increases and decreases of these metabolites in the blood backward and forward. It's not a forward chase. It's actually back -- forward and backward. We go in all directions. It will become clear what I mean in a minute. So the way in which we will do this is, again, we are following what Richard Hanson told us that we should be doing, we are -- he said that when I -- he last -- consulting for metabolomic companies, this is big business, and they measure these things and then they say what kind of problems should (inaudible) for these measurements tell us, and he looks at them -- he says, The way I do it, I know that I have some patterns. If glutamine has increased, alanine decreased, and -- I'm just making this up -- arginine has increased, then it's likely that these are the problems. Because he can close his eyes and go through these networks. I'm not kidding. He's amazing. On the board he immediately starts throwing reactions with the catalyzing -- with all the activators, inhibitors and so on and so forth. So he says that I look at -- I define a pattern as a set of metabolite changes that may be related to a physiological condition. So this is one metabolite change, ketones in urine. If you have relatives that have ketones, and if you have relatives that have diabetes, you would immediately recognize this. If you're not controlling your diabetes well, you will have ketones in your urine. These are actually short -- these are energy units. They are actually lovely. They are really needed for your body. But they are not good if they are being excreted by your body into your urine. That means that you have a problem. And then also if the blood glucose is about 200 milligrams per deciliter, then in all likelihood you have diabetes. So this is a pattern, a simple pattern. This is known by doctors. Doctors actually look at these immediately and say, ah-hah, this patient may have diabetes. So this is what -- we will make use of that in our problem. These four patterns actually are the patterns that he told us, which we discuss, right, pattern P1 is glutamine increase, protein turnover, muscle, liver problem with urea cycle and kidney and gut problems. So the way in which we -- computationally the way to find this is by chasing this. For instance, let's start with this. Glutamine has increased in blood. Assuming that the transport mechanism is 1:1 -- in other words, increase of this glutamine means that it was produced in the muscle more and, therefore, it was excreted into the blood more. This is a very simplistic view of the transport mechanism, but this is what we will use. And this is what actually metabolomics researchers use at this point in time. They are a reasonable -- (inaudible) community they have -- every year they have metabolomic conference, metabolomics conference. And actually we will present this work over there as well. So glutamine in blood has increased. It's because perhaps glutamine is increased in blood. And you see that glutamate, another metabolite, has increased. It's pretty small, and then it's resulted in glutamine increase. And then following this, glutamate increase is because the branched-chain amino acids have increased and this immediately means that there was a protein turnover over here. This is a problem in muscle. So then obviously you can say, hey, why didn't you go through this path rather than this path? We didn't go through this path because even though the kinetic models are not available through years and years of biochemistry research, biochemists know that -- this is ammonia. Ammonia produces very little glutamine. The ammonia amounts in your muscle is very small, whereas most glutamine production doesn't come from here, it comes from here. So this is an important observation. We've modeled into our system. So, essentially, this is actually also hand-drawn, but we actually produced this on the fly. Glutamine has increased in blood. This means that there's an increase. There are four possible options. Perhaps glutamine has increased in muscle, and then this is caused because glutamate has increased in muscle. These are aggravations for these metabolites. And glutamate in muscle is increased because branched-chain amino acid has increased and this is because there was a protein turnover. But look at our original observations. Our original observations said that branched-chain amino acids have not increased. So this is a contradiction. So then that means that what we have observed cannot be because of protein turnover in muscle. This is exactly how we proceed. So then my conclusion is that this cannot have happened because branched-chain amino acids actually have not increased in the blood, so this path is actually -- it can be eliminated. Yeah. >>: Isn't there some sort of noise that maybe a probabilistic interpretation that might work better than just culling a whole branch from the tree (inaudible). >> Gultekin Ozsoyoglu: No. I mean, it's not really noise. These paths are very well known. >>: I mean noise in the sense that you say there is no (inaudible), but what does that mean? I mean, presumably there's always some amount of change -- >> Gultekin Ozsoyoglu: So -- so -- okay -- >>: -- (inaudible) thresholds (inaudible) -- >> Gultekin Ozsoyoglu: That's a very good question. Actually these, the techniques, this high-energy physics equipment, mass spectrometry and so on, actually use very sophisticated software too. And, actually, if you change the device that you use from -- that uses this technique versus this technique, like move from mass spectrometry to gas chromatography, you will have variations in there. Yes, that's true. So from one device to another, they will vary slightly. But if you observe a fourfold increase, then whatever noise there is in there, that is -- that's obviously (inaudible). >>: Is your threshold for -- an up arrow says either fourfold increase or not? >> Gultekin Ozsoyoglu: Okay, no. So what I observe over here -- so the technique that I use is very blunt. This is the beginning, right? I don't use actual values. Because actual values you cannot use. I only say that -- at this point I measured in the blood glutamine increase. This is because immediately the glutamine has increased in muscle. I'm not saying that in another five minutes or after a certain period of time glutamine will still be high in the muscle that -- >>: I guess I'm just trying to figure out how you define increase with respect to your raw measurements. When you say that this increase, what's your definition? >> Gultekin Ozsoyoglu: Okay. So I'm going to model the reactions. In other words, these reactions -- each reaction is an input and output substrate in product. And, actually, if substrate increases, product also increases. >>: I mean, you're starting with measurements for a mass spectrometer or something like that -- >> Gultekin Ozsoyoglu: Right. >>: -- so for each node there you get a raw -- a real value measurement. >> Gultekin Ozsoyoglu: Only in biofluids. These are -- this isn't an organ, now, muscle. You cannot know, you cannot measure this. So, therefore, that's why it's a limited technique. You cannot do any more than that. Unless you -- as I said, you get a clamp and then squeeze the muscle and then extract the metabolites that those -- this is exactly what they do to mice anyway, why doing them, they measure them. Or you use cancer-related research, very disruptive research. Okay. So moving on. So we have -- we can verify that (inaudible) is invalid, simply because I said this already, it should -- it really implies -- by following the network, it implies that BCAA, branched-chain amino acids, must have increased, but -- and then therefore they must have increased in the blood, but they did not, so therefore I eliminate this. On the other hand, the second part over here may be valid. It may be valid. We are not saying that it's valid; we are only saying it may be valid. So this path, glutamine in blood has increased and it's because glutamine in liver has increased and it's because ammonia has increased. And if ammonia increases, then there's something wrong in your urea cycle. So this is what this whole thing -- it says that. So let's take a look at this. In -- glutamine has increased, but you see that glutamine is produced by liver and released into the blood. And glutamine increase may be because of NH3, ammonia, over here. And ammonia is actually consumed by the urea cycle. If the urea cycle doesn't consume ammonia, then there's an excess ammonia over here and it results in larger amounts of glutamine and, then, therefore, you will observe (inaudible) yeah. >>: (Inaudible) this one in your work you are just classifying as increase or decrease. >> Gultekin Ozsoyoglu: That's right. >>: You're not actually worrying about the actual amount -- >> Gultekin Ozsoyoglu: Right. >>: (Inaudible.) >> Gultekin Ozsoyoglu: The whole metabolomics society at this point only deals with -- for normal people and for -- with only increases and decreases. You can actually go to actual concentration values if you are doing cancer research, and you can actually measure things in cells, obviously, not on cancer patients. >>: Even with that sort of (inaudible). >> Gultekin Ozsoyoglu: No. Actually, if you are actually looking at cell, you can actually observe within a given cell metabolite, concentrations improves, and then what they do, they actually try to -- some of them, actually, they do dynamic analysis; they define -- they fit it to a set of differential equations and then they pass judgment on the basis or the fact that if you -colon cancer, this colon cancer cell behaves this way to this drug and so on. But not for regular patients. I mean, if you are getting your blood measurements, that's all you have. That's all you have. Yes. >>: (Inaudible) independent of metabolites (inaudible)? >> Gultekin Ozsoyoglu: Okay. Yeah. You know, seven years ago when I started this, this is exactly what I asked my genetics researchers. Yeah. You can -- I can abstract it to you. I will abstract it to you. But whatever it is that I do will be very faithful to the underlying biology. If it's not faithful, then you're faking yourself. I had a meeting with Richard Hanson, who is a great guy, and I said that I had this idea about an extension of this, and I said, I'm writing a grant proposal busily right now, he needs a copy. And I said, Richard, can we do this? And he looked at me, that's -- he said, That's BS. You should never go that way. They will immediately kill your proposal. The biochemists will kill your proposal. I threw it away right away. In other words, whatever abstractions that you have has to be extremely faithful. Also, you can't speak in naive terms. You know? When you have a reaction, you don't talk about an input to the reaction and output to the reaction. You have to talk about the substrate to the reaction and a product to the reaction, activators and inhibitors and so on. Okay. So moving on. So, actually -- wait. Well, okay. I think I'm going to go back. So this actually -- so all we do is given a possible set of hypotheses we invalidate them and link them to physiological conditions. I think I need to go a little bit faster because the example is taking -- it has taken half an hour. And then we can -- those we cannot eliminate. They may be valid. We are not saying that they are valid; they may be valid. So I'm going to skip the third hypothesis. Actually, the third hypothesis that there's a problem in -- this was -- I think this was -- this was gut, I think. And then P4 must be kidney. You can invalidate them as well. You can eliminate them. So the approach is automated ways of eliminating hypotheses from among a list of likely hypotheses, and we managed to eliminate three of them and one of them may be valid. And a broader approach is actually eliminate all hypotheses. Remember, I said that this is not a path that's taken frequently in your muscle. Your muscle has very little ammonia, so therefore glutamine cannot help you produce insignificant amounts through the ammonia, so therefore that's a path that you need to take. But if you are sick, actually, if your body is under stress, your metabolic network is very flexible. It starts compensating and it's not -- how it compensates is not understood. It is indeed possible that -- well, ammonia is our only example, but some other path may be active, may become active because you are sick. So then all bets are off. So, anyway, the broader approach is to look at all hypotheses and provide them to the researchers. If you look at all hypotheses that are invalid, with 300 measurements, you can go down to about 200 hypotheses, which is still large. So let's take a look at the abstraction. This is the abstraction now. Now we are there. There is a metabolite. This is an input to this reaction. Each reaction is catalyzed, controlled, by an enzyme, which is (inaudible). And then it produces this metabolite. So M2 is metabolized into M6, and if this has increased, this also has increased. I'm sorry? >>: (Inaudible.) >> Gultekin Ozsoyoglu: M4. What did I say? >>: M6. >> Gultekin Ozsoyoglu: Oh, I'm sorry. M4. And M4 is a substrate, or input, to this reaction, catalyzed by this enzyme, and then it produces M6. So you see that these two reactions use the same substrate, same input, whereas over here or over here, this reaction uses two inputs, two substrates, and they are called co-substrates. This is a simpler model. In reality, this enzyme over here is controlled -- the effectiveness of the enzyme is activated or inhibited by activators and inhibitors that complicates this model. We model them as well, but I'm not going to talk about in this talk. So the way in which we will talk is if it's -- this is a river. So M2 is upstream to M6, and therefore M6 is downstream to M2. I will repeatedly use the same analogy. So this is essentially a graph network, right? So what we have observed are observed events -- we say that this metabolite is observed, it's an observed event -- to have this concentration level change, increase by X-fold, decrease by X-fold, or no change. This is the only information that we use. And then so -- so you can actually say that two hours after -- this is just an example -- two hours after the treatment with a certain drug the level of M4 has increased by twofold, the level of M5 has not changed, and now we are going to drive events in terms of changes, level changes of metabolites in organs, in tissues. So this cannot be measured. And at this point we are not -- they are not of interest to us. We only want to eventually validate or invalidate different hypotheses or different patterns. So essentially will we drive events using this reasoning. A metabolite -- concentration of a metabolite may increase either in the blood or in the organs. It's because it's produced more, because substrates, they're produced more. The inputs of that remote substrate -- they were more -- larger number of -- larger amounts of substrates that -- for the production of this metabolite. Or it's consumed less. Because it's consumed less, then its concentration level stays high. Okay. This is just one reasoning. Or you can do the reverse reasoning over here. So then you can do a cascading effect over here as well. If M6 has increased, then going backward, because of this reaction, M4 has increased. This is a preservation analysis, (inaudible) behavior, and larger amounts of M4 is because M2 has increased, so I have a cascading effect over here of M2 over M6. So this is a cascading effect. And then you can negate the increases and decreases. This is the reasoning. If a metabolite is observed to increase, it's either increased more or consumed less, right? That's all there is to it. There are no other options. So if we model this in various different ways -- I'm going to skip these. This is a formulization. I do have the paper. I sent the paper to Surajit. You're welcome to take a look at it. We submitted it. It's being reviewed by Journal of Computational Bioinformatics, computational biology, I think. So perhaps this is an observed event. This is in the blood, but these are in an organ, so we chase these backward and forward. So if M4, let's say, has increased by twofold, it may be that M2 has increased or it may be that M2 -- the production of M2 has increased and then the production of M5, which is of course substrate to this, has decreased. If this is decreased and then this reaction did not use enough of M4, both of these are needed, so therefore it implies a parallel change. The M5 has decreased, and therefore M6 has decreased. These are your alternatives. So then also you can incorporate dietary intake to this and physiological conditions. There are certain metabolites that our body does not provide, so they are taken externally through dietary intake. So we modeled them. Essential amino acids are actually the metabolites that are not produced by our body. You have to take them as part of your diet. And then also, as I -- I gave this as an example, if there is a physiological process, as a result of that, the metabolite concentration levels change, such as protein turnover that I illustrated. So we capture these, but forget this modeling over here. I'll give an example over here. Dietary intake increases, so it produces -- it has this production event. It produces more of tryptophan in blood. And protein turnover increases alanine in muscle. So let's move on. You can also -- you can actually -- you can model additional external events in terms of physiological changes as well. If your calcium dietary intake increases, then this results in an increase -- increase intake. So let's move on. Let's characterize what do we mean by conflict. Conflict is that if I start with an observed event and then later on I follow metabolic pathway, I end up with the conclusion that contradicts with this observation. Instead of decrease I have an increase. Then that's a contradiction. Then I need to stop at this point. This is the characterization of conflicts. So we can start with these observed events and chase them backward and forward and horizontally through the core substrates across the network. And I can have a closure of all these events, right? And this is indeed what we do. With a smaller network right at this point in time. We don't have the full network. And so this -- we refer to this as the closure of an event. And then I can define the closure tree. If I start with this event -- let's say this metabolite has increased. And then this second event is a child event, because -- child of this parent event because it's derived from the parent event, so then I can define a closure tree I actually -- if you compute everything, then it's a closure tree. If you don't compute everything, which sometimes we cannot, then it's a hypothesis tree, it's incomplete. So to give you an example over here for this very simple network over here, metabolic network, let's say that we have observed that M4 has increased by twofold. There's a typo over here. Anyway, so if you only look at this path over here, increase of M4 is because M2 has increased, so it corresponds to this path over here. M2 has increased. Increase of M2 is because M1 has increased. And then M1 can only increase. Perhaps M1 is an essential amino acid. M1 can only increase because of dietary intake. This symbol defines dietary intake, increase in dietary intake. So this is a valid -- maybe valid path. On the other hand, you can see the red ones over here are the contradictions and you can eliminate them from this completely, and this would give you the closure tree for this specific subnetwork. So the marginal inconsistency is that in your closure tree, you don't have the same metabolite observed in a contradicting manner. That's consistency. Minimality means that you don't have your -- in your closure tree the same metabolite increasing in two different spots, so your closure tree is a minimal tree. So once you obtain this minimal consistent and minimal tree, then from this you can decide whether -- which paths are maybe valid, which parts are invalid. So I will skip this because we are running out of time. So for this specific path over here, it's this one over here and this is dietary increase. So the problem statement is this: given an organism, and we are only dealing with humans at this stage, and a set of observed events, we compute all hypotheses. And hypothesis is a root to leaf path. I start with an observation. I end up with either a dietary -- consistent dietary change or consistent physiological change or a consistent -another observed event. Then, essentially, instead of all these hypotheses, we find them and we would like to rank them and would like to give them to the biochemist or clinical researcher. This is a problem. Formally, it's defined in the paper. So I'm sort of giving you the intuition. Of course, I have -- along the way I have taken liberties. The transport mechanism is not as simple as I have modeled. But this is what current metabolism researchers do. If your blood glucose is high, I immediately said that -- what is this? In muscle glucose is increased. We know that that's not the case, right? I mean, glucose is gated by insulin. Glucose transport process from blood to an organ is extremely complicated. It involves five or six different types of cells on your pancreas. It involves four or five different mechanisms. But modeling -- so we cannot really model this, but I simply -- what we are proposing is the next stage of our development is actually define possible (inaudible) conditions over here for glucose to be transported into a tissue. This we can do, and that's what we are planning to do next. Our current transport mechanism is essentially if you observe an increase, then there's an increase in downstream or there's an increase upstream in related organs. So these details, I will skip that. Another issue that I already mentioned to you, the dynamic analysis is called flux balance analysis in biochemistry. This is a very complex and difficult process, but through years and years of research in humans, as I said, if -- the biochemists now, for instance, that if arginine is produced in this certain organ, arginine is consumed more into a urea rather than through this reaction into a gueninodate (phonetic) acetate over here. So actually I can say that arginine consumption, metabolizing arginine is more on this path rather than this path. The numbers are rather arbitrary, because it's not known. Only a truly metabolic biochemistry specialist can really tell you something that's close to this, but it's rather arbitrary. The point here is that in my closure tree, I can actually rank these paths because of this knowledge. The likelihood of this happening is more; the likelihood of this happening is less. So we incorporate this into our system. Physiological conditions I will skip. We actually do -- another thing that we do is ex-path-like analysis. We actually look at these different hypotheses, and then if a subpath occurs in these different hypotheses, many times, many, many times, then we say that this is perhaps a more likely subpath, can you make use of this. We give this to the biochemist. This will do it at this point in time, so therefore we percolate these hypotheses up in possible hypotheses lists or invalid hypotheses lists, so we do some (inaudible) in that sense. But, as I said, Richard Hanson doesn't like this at all. He says this doesn't make sense. But we do it anyway. So I will skip this. So this a subpath that occurs frequently here, and then as a result we put it into our interesting event set. I will skip all of these and go to our experimental evaluation. We tested this for a certain disease, actually for a paper. In clinic there's a researcher, and we use his data and validated our approach, but his paper's not published; he would not allow to discuss this. But, essentially, his data had much more than 34 metabolic measurements, but we only use 34 of them because our database is a prototype database at this stage because our original system that we managed for seven years does not have location information. We don't distinguish between organs. So we had to produce a separate database for that, so therefore we have a small number of pathways, and the pathways that we have, altogether I think we have 50 pathways right now and we have 28 pathways in this metabolism -- I mean acid metabolism; 11 pathways carbohydrate metabolism; 11 pathways in lipid metabolism. And this is what we use, and therefore we use this smaller number of observations. And this is the data that we have used. We abstracted the actual increases with increases and decreases. This is yielded, as I said. And with this data, the number of hypotheses -- remember that we go backward and forward. At any node we go backward and forward, backward and forward. So the number of hypotheses with our 50 pathways went up to 130,000 different hypotheses. We eliminated the invalid ones. The invalid ones stayed up at about 3,000. But then -- >>: (Inaudible.) >> Gultekin Ozsoyoglu: I'm sorry? >>: By 30,000. >> Gultekin Ozsoyoglu: 30,000, yeah. 30,000. And the maximum hypotheses length was 70. And then 70, when we showed this to Richard Hanson, Richard Hanson said this is totally useless. Hypotheses length of 70 is incomprehensible even to the expert biochemists, metabolic experts. So we have to do something better. So this describes how we eliminated in different types of metabolism. But we -- the number of -- using the observations that we had. Remember, the observations of patterns, we achieved 95 percent reduction with 40 measured metabolites, and then we increased the metabolites to 80 metabolites, we eliminated about 99.9 percent of the hypotheses down to 300 hypotheses. But this 300 hypotheses too much as well. I mean, it's humanly not possible to interpret 300 hypotheses. The goal is to take it down to, let's say, somewhere, 20 -- you know, case by case, 20 or something so that they can interpret these. And that's what we want to do. And I think we can do it if we can model the patterns more and more. So when we use summarization, I said these are ex-path-like subpaths in metabolic network. We actually reduced the total number of hypotheses by 99 percent, even with 34 metabolite observations. So I will skip this. As I said -- this shouldn't have been here. I'll skip this. Sorry about that. We use it for actually for this specific -- the data came from there. So this is our system. This system is actually a revised version of our system that's used -- it's a production system, it's a development system. It's at this side, this side of the development server. So these are the pathways processed. And you can browse them and so on. Our current system does that. The new part that we built for this research is this (inaudible) prediction. So users cannot load the observations as an XML file. There's a sample over here that they can use, actually. We have a sample, we provide a sample. And then they may actually or they may manually increase, add what they observed one by one. And then when they do that, then -- and also we use Ajax technology over here. We actually provide what's available. Instead of all the metabolite, there are 2500 metabolites. And then you may actually specify them one by one in a specific organ. So this is let's say -- after uploading it, this is what you would have what's uploaded by the user. These are the metabolites. These are the actual changes. And then you can choose whichever one to be your closure tree for exploratory (inaudible). I have ten minutes. Slightly less than ten minutes. Moving up. And then when you click generate observation-supported hypothesis, it produces this. You can save them. These are actually -- these are our paths, starting in -- with glycolate in blood, glycolate in liver, colate, colate, glycolate biosynthesis in liver and so on and so forth. This is what Richard Hanson says useless sprouting of pathways. And then we visualize our system. This is also available. You can zoom in, zoom out and then do lots of things. This is an original of our visualization in biological pathways. We actually -- this is a new addition to our current system. It's being worked out. As I said, all the metabolic network data sources on the Web, highly respected (inaudible) KEGG. We are licensed to use their KEGG data. They never distinguish between liver and muscle because -and they have to distinguish between liver and muscle, because the same -- you see that this reaction over here, this pathway occurs in liver, blood, and muscle. It -- and there are other pathways as well, so we measure -- visualize this, and then we do the (inaudible) prediction with this too. Very well. What have we done? We have actually modeled the observed measured metabolite changes. We chased them in the metabolic network across different organs using a rather simple transport mechanism, and then even with that we managed to eliminate a significant amount of hypotheses. And we have also, as I said -- remember 0.9 and 0.1, this path is more favorable than this path, that's actually called flux ratio analysis, so we incorporated flux ratio. The limitation of the system, of course, it does not utilize the exact amounts. But utilizing exact amounts is only possible with full kinetics, and full kinetics, it will not be -- I don't think -- will not be discovered in our lifetimes. So I think I would like to stop here rather than say more things and then maybe answer your questions. >>: Are there any competitive systems that are doing this kind of metabolic pathway analysis? >> Gultekin Ozsoyoglu: Absolutely not. We are the first. All the companies, professional companies -- actually, this company (inaudible) I think, the minute they learned that we are doing this they said, Can you do this for us? Give us all the paths between two metabolites in different organs, and we'll pay you. We haven't paid attention to it. But mapping these metabolites through the network is not done yet. All the analysis that we have done is extremely new. There are five or six Web-based data sources for metabolomics research, and there is one highly respectable one in my alma mater out in University of Alberta. They are funded by $35 million from Canada. And on the site they say they have patent information. If you have these and these observations, then it's probably this disease. And it's actually FTP downloadable. We downloaded their data. >>: But it's pattern based. They're not doing network analysis. >> Gultekin Ozsoyoglu: Nothing. Nothing. But I can use those patterns in my system, and then I can actually do more. Of course, this is just the beginning, right? I mean, into the future, you have to do data mining. Saying that this path is 0.9, the 0.1 is something that I invented. No one knows. But my co-researcher, Richard Hanson, says that you know the amount of ammonia in the muscle; it's minuscule. So only he knows it. And he's a world-class expert in this area. >>: Can I just add something? I think this is motivated by the fact that it's very easy to measure so many metabolites in a simple blood test in the lab. But then nobody knows what to do with all this data. And there are very few experts, frankly, who can interpret this data. So this is -motivation for this is to maybe capture some of this knowledge of these experts and help people to sort out and eliminate or figure out the alternatives. There will be, of course, lots of positives in the results of this. But the idea is just to limit the scope so that maybe there will be meaningful (inaudible). >>: You can see, though, why Hanson would want to see relatively short paths. Because if you can't under- -- as you say, this is such early-stage stuff. Nobody's going to trust the computer's analysis and go off and do what it says. So if it isn't -- if it isn't a line of logic you can really understand, then what's the point? >>: But, see, the idea is that -- on one hand you can't do anything with it, because you don't know what to do. On the other hand, you have a maybe smaller scope. And as there are more data embedded into this system, it will produce better results. But this is a very early stage of the (inaudible). >> Gultekin Ozsoyoglu: Our cancer researchers -- we have a cancer center. They are funded by another $50 million over five years. Our cancer researchers went crazy on this. Our cancer researchers really went crazy on this. They said actually cancer changes the metabolic networks in completely unexpected ways. They say, Can I do (inaudible) over here? This path that's not insignificant suddenly becomes insignificant. You're a cancer patient, suddenly your muscles are collecting ammonia because you have cancer. So what does it mean? How can I fix this? So this can be used for cancer, but we are way early for that. Maybe in another ten years. >>: Well, this -- as you said, this is sort of early days, and the question I have is the level of modeling that they're currently doing, the level of analysis you're doing, are you beginning already to get useful results? Can you narrow it down to (inaudible) -- >> Gultekin Ozsoyoglu: Right, right. Right. So there is one example that we did all the experiments and then we said we cannot -- we were -- we were told -- we are told not -- we cannot use it. There is a certain disease, and there's a researcher in clinic -- his results, his observations, his conclusions, we produced them. Once we have them as patterns, we produce them. Among those 300 hypotheses, we percolate to the top his observations. But we can't do more than that, right? Ultimately it has to be interpreted by someone who's an expert. >>: He's studying (inaudible) disease. >> Gultekin Ozsoyoglu: You're not supposed to say that. >>: No, but my point -- I believe one of us asked you this, was basically is the way you -- the research that you are now at, what stage is it? Is it at the stage where it's beginning or ready to produce some useful results (inaudible) -- >> Gultekin Ozsoyoglu: Right. So we are incorporating with Henri Brunengraber; he's doing cancer research, colon cancer research. He's in the department of nutrition. And he's a very well known metabolomics researcher. So we are trying to match our system where its producing results consistent with what he has with colon cancer cells. We just got the data and we are working on that. But the one that we verified is the one that (inaudible). We literally found the right ones. You had a question? >>: Can you take the output, then, and just say if I just have these two additional tests, I could produce the output (inaudible)? >> Gultekin Ozsoyoglu: No. I mean, we really use all 34 -- >>: (Inaudible) had some additional tests to run. >>: Additional tests, sure. >> Gultekin Ozsoyoglu: Right, right. In other words -- >>: (Inaudible) exception is. >> Gultekin Ozsoyoglu: The other exception -- >>: This is why we can't -- it can be done. >> Gultekin Ozsoyoglu: It can be done. Ultimate to the goal is this: You know that when you look at cholesterol -- and I think we need to stop, right? Ultimately you know that when your cholesterol level is high in your blood, that's a biomarker. And that immediately says, oh, you better have -- change your dietary intake, you are immune to heart disease. What we would like to do with this is, oh, I have this pattern of 12 of these metabolites. This is deadly. You may have this disease. Or you may have these physiological issues, so you start watching out. So, in other words, we would like to make a biomarker out of not just a single observation, this is what current technology is, we would like to make a biomarker out of a pattern of, and we would like to locate them first ourselves and then have them verified in the lab. This cannot be done without a system like this. >>: So (inaudible) question, which is, I mean, in terms of using this from a CS perspective, would the increase that you found (inaudible) for you (inaudible) so for (inaudible) techniques that were particularly (inaudible) -- >> Gultekin Ozsoyoglu: Okay. So we found out that we manage -- not this system, but the original system we manage at a professional level. We use Version Control, CVS; we use Bugzilla. Every week we get together with ten master Ph.D. students. We fix the bugs and then our co-researchers, Richard Hanson and the others, come. Not Richard Hanson; the others come. And then we fix this. But we found out that at the server side, SQL server works really well. There's no problem with that. We keep changing the database, though. It's amazing. Continuously we change it. And then as we change, we each have to change the code (inaudible) technology, it's perfectly okay. In terms of -- >>: The techniques? What are the techniques? >> Gultekin Ozsoyoglu: The techniques? I think these are essentially down -- at heart, these are graph databases. So there are certain -- I haven't really described everything. If you increase this, which we did, into -- instead of 50 pathways into about 70 pathways, our system, even though it runs (inaudible) main memory on the server, 16 gig main memory, highly powerful servers, they cannot stop. They cannot finish the closure tree computation. So we will have to use original indexing techniques. We are looking into that. Even with only main memory computations. If there's the system -- because it's not just chasing forward in the network; go forward, backward, backward, backward, go forward, backward, backward, backward. So at each step when you move, move backward, and these are all possible hypotheses. >>: How big is this graph? >> Gultekin Ozsoyoglu: How big is this graph? The full known metabolic network, Biocyc, has about -- it's on the Web. Right now -- >>: (Inaudible?) >> Gultekin Ozsoyoglu: I'm sorry? >>: Are you using all of -- >> Gultekin Ozsoyoglu: No, no, no. We're not using all of it. Because KEGG doesn't have location information. What they say, they give you a pathway and say you cannot tell whether this part is in this organ or it's completely in this organ. Actually, you can take a pathway, liver to kidney, some reactions you miss. So we will have to collect all that information from the literature. We only have the limited work. >>: So is this -- >> Gultekin Ozsoyoglu: The number of pathways altogether, across all the metabolisms, KEGG right now has about 140 pathways. The largest one is 150 reactions. And they have it across all organisms. And we have it -- we make it available -- if you type "PathCase," you will reach our production system. And we built that over seven, eight years. PathCase, right? >>: Sure. I just wanted to know how many (inaudible) -- >> Gultekin Ozsoyoglu: So we have it at the first page. I think we have about 35,000 metabolites in that full network. I mean, I don't know the exact number. It's listed. It keeps changing. Every three months we download the new version, we up -- change it into -- change our database. >>: So (inaudible)? >> Gultekin Ozsoyoglu: We don't use any supervised learning over here. Around this -- >>: Why not? >> Gultekin Ozsoyoglu: Okay. So this is -- there's not much to do in terms of -- >>: No, but (inaudible). >> Gultekin Ozsoyoglu: Into the future. Yeah, yeah. Into the future. We -- around PathCase we did work on protein annotation, protein network annotation. We're published in the (inaudible) in the ISMB. Protein annotation we use machine learning, supervised learning. And we had extensive (inaudible) models. This research is just in its infancy. But I'm hoping that once I understand it long enough, next semester I will sit and take a biochemistry course. Literally. I mean, seven years I've been working on it, but next semester I'll sit down and take that biochemistry course. >>: (Inaudible.) (Laughter.) >> Gultekin Ozsoyoglu: I cannot handle that. I will audit it. I will audit it. So, anyway, once I learn it more -- you know, the main -- one of the issues is I'm too aggressive, right? I say, Can we do this? And it says, You are wrong. But sometimes it's actually -- those -- if you do it, then it appreciates that the change is mine. So I think that the problem with communicating with biologists we found -- geneticists, biochemists we found out is that the ones that we deal with are really experts. They are -- they are not at this same computational-thinking level that we are. On the other hand, we can go off base and do stupid things as well. >> Surajit Chaudhuri: Okay. Thanks,Tekin. >> Gultekin Ozsoyoglu: Thank you. (Applause.)