>> Tao Cheng: So let's get started. So today we are very happen to have Professor Vagelis Hristidis give us a talk. He was a social professor at the University of California at the Riverside, and I think he got his Ph.D. from University of California, San Diego. So [inaudible] the same sunny state. And he's basically interested in bridging the gap between database systems and information retrieval. I think he has been supported by many firms, including the National Science Foundation, Department of Homeland Security and many other big companies. So today he will talk about a collaborative tagging for web items. So let's welcome. >> Vagelis Hristidis: Hello, everyone. Thanks for having me. And it's great to be back after my 2003 internship which has led to many long-term friendships and collaborations. And actually this work is also a collaboration with Guatam Das, who I met while I was an intern here. So in this talk I will -- the main part will be presenting the results that were published in the [inaudible] last year, and then if I have some more time I will talk a little about my more recent work on data management for health, health data. Okay. So let's start. Okay. So the motivation is that if you go to any commerce website recently, you will see that there is a lot of new kind of information popping up all the time, including the ratings, likes and the tags, which is the focus of this work. So as you see, so this was about cameras. This site is about songs, so people are tagging songs with what they think is useful for remembering or for other people to find. Also, you see here people are tagging trips. So when you go on a trip, you say this adventure or this art and so on. So it seems that many of the objects on the internet have a set of tags associated with them. So tagging -- there has been quite some work on tags, but mostly the work on tags before our work was on using tags as kind of extending the content of the object. So in addition to the text and the items of the object you also see the tag so you have a more, let's say, complete picture of the object so you can increase your recall when you search for objects. Also in ranking you can use the tags to kind of extend the document. And also for -- there has also been work on classification trending, see which tags are becoming more popular, and also there has been work on how to assign tags and also you guys here have been working on how to tag objects and also how to predict what tags an object will acquire. So the question we're trying to answer here is if we can -- if the tags can be useful in designing better items where items can be products, can be trips, you know, anything. And the answer is yes, as we'll see. So the question is how can we use tags to design a new camera. Suppose we are working for Canon and we say, you know, I want to do some research on the internet using the tags and I will propose a few new cameras that would probably be attractive to people, so I want to see what attributes a camera should have such that this camera will attract good tags. So the same way for music. Let's say I want to build a new song, create a new song, what kind of attributes a song should have to become popular and that, you know, [inaudible] and so on. So to repeat the problem is how to, you know, use the tags to build products. So I'll use tags to decide what attributes a product should have. So more formally, the problem is the following. Suppose that they have a table that has two types columns. The first type of column is the attributes of the product. For instance, if it's a camera, what kind of camera, what is the zoom, color, flash, and so on. And then the second part or the second set of columns are the tags that people have assigned on this camera. And then we want to design K items, so we want to design, let's say, K new rows, so add K new rows to the table such that these new items will have a [inaudible] of attracting the set of tags we want them to attract. So, for instance, if I say I want to build an easy and lightweight camera, so I select, let's say, these two tags as desirable, then our algorithm will decide what kind of attributes a camera should have such that it will attract these two tags. Of course, you may also want to say but I want to avoid this unreliable, have some [inaudible]. Okay. So the first challenge is how do you -- how do we decide how the attributes and the tags are related. So there has been some work on this that this can be viewed as a classification problem. So given the attributes of the product, I want to -- for every tag I build a classifier to decide if the tag will be present or not present given the attributes. So, of course, there are many different types of classifiers, but given some research we did on previous work and some experiments we did, we found that Naive Bayes classifier is a good solution for that. Of course, all classifiers are possible. So in this case what do we do is we use Naive Bayes classifier to estimate the probability that the product will attract tag Tj given o, which is the -- all the attributes of the product. So we want to compute the reliability of a tag given that [inaudible] values. And this is the standard inference from the derivations from the Naive Bayes classifier. So in the end, given the probability of an object -- of a tag given an object is this equation here. So this is the probability of not having a tag, this is the probability of having a tag, regardless of what object we're talking about, and these are the pairwise probabilities of an attribute. For instance, this is the zoom equals 3. Given that tag, this is zoom equals 3 given tag. And then we use rj to the kind of summarize this term here. >>: [inaudible] >> Vagelis Hristidis: Yes. In our problem, the set of good tags is an input. So you say I want to build a product that will -- so let's say I want to build a product that will try to attract let's say these two tags, which means that you think that these two tags are positive. But you don't have to give all the positive tags as input because you may want to build like, say, one lightweight camera and then you want to be the professional camera so you put different tags as input. >>: Seems to me that your output is new items. >> Vagelis Hristidis: Yes. >> Vagelis Hristidis: There's also an implicit assumption that it's easy to create these new items. >> Vagelis Hristidis: That's a very good -- that's a very good point. So the argument we make is that the item that we output would then have to be combined with some other, let's say, research on what is possible to build, because if maybe you output an item that is not possible to build with currently technology, right? Or maybe it's too expensive to build. So the result of this algorithm is just one of the inputs that the designer will take to decide what to build along with market research and other things. Yes? >>: Limited to categorical attributes? So, for example, the tag [inaudible] would highlight correlate with price, which is numerical. >> Vagelis Hristidis: Yeah, actually in the -- we defined most of the formulas in terms of Boolean and in the paper we explain how it can be extended to categorical. And if I'm not mistaken, we also discuss about numerical but in a very simple way, like split into ranges so not considering the order. Okay. [inaudible] condition independence to do that on Bayes. So then if you're given a set of targets you want your product to satisfy, then the problem becomes how do you find the -- create the objects given a set of tags such that the sum of these probabilities is maximized across all the targets. So if you have two targets, let's say lightweight and modern [phonetic] you know, at the same time maximize both probabilities. Okay. So given this problem formulation, the first probably expected result is that this is an NP-hard problem because you have to -- Naive way would be to consider all possible combinations of values for all the attributes, which is an exponential number, and then we prove that it's NP-hard. So then the next step is what algorithm can we propose. So we propose two algorithms. The first is an exact algorithm which obviously only works for relatively small data sets, and then we have an approximate algorithm that works for bigger data sets. Okay. So let's present one by one. So the exact algorithm is based on top-k algorithms. Actually that's a combination of two top-k algorithms. So before we talk about the exact, Naive would be to -- as I said, to create all possible items, so this will be exponential, and then the exact algorithm which we'll call exact two-tier algorithm is called two-tier because there's two tiers of top-k algorithms. So in the first tier we find what is the -- you know, we find what are the best products for each tag, and the second tier we combine across all tags that are in the input. And obviously since this is an NP-hard problem, in the worst case this algorithm can be very slow, but as we see in the experiments, the average case performs very well. It's the same in all top-k algorithms. In the worst case it can be very bad. Okay. So these are the two tiers of the algorithm. So in the first tier you have one, let's say, top-k execution for every tag. So here we have shown that the user is looking to create a product that will -- maximal probabilities of tags T1 to Tz. So for each tag we have, let's say, pipelined algorithm that will get next to [inaudible] suggested product that will maximize this tag independently of the other tags, and then in the top tier we combine all these tags together to build -to decide whether it's the best product that globally maximizes the probability for the tags. So in the first -- now let's talk in a little more detail. In the lower level what we do is we -- for each of these top-k algorithms we split the -- we create one list for a subset of the attributes. So suppose we'll have, let's say, 50 attributes. We can build 10 lists of five attributes. And then for every list we consider all combinations or values for these attributes which is, you know -- since these are relatively few [inaudible] lists, the number is not very high. For instance, if you have boolean and you have five other, you have 2 to the 5 entries here, all combinations, and then for each combination we assign a score which is how much this subset of attributes can contribute to the probability of having the tag Tz. And then -- so you have this order by this value that I mentioned, and then you have kind of a rank joined algorithm, but there's a difference between -- because in rank joined there is a joined condition, but here there's no joined condition. Everything joined with everything, which means all combinations from here and here and here, you know, can potentially result. There's no joined conditions. You have to do all the combinations for everything with everything. But, still, in practice it is not so bad because that's what we will see. You don't have to go very deep on the list. Yeah, that's what I just mentioned. So you just -- all combinations and then you compute a score. And then on the top tier you have -- here you have a complete product, and then you want to find the complete product that maximizes all the tags. So this one is the same as the threshold algorithm. So you -- this list here joined when it is the same product. So this was the exact algorithm, and now we'll present the approximate algorithm. So the good thing about this approximate algorithm is that it has an approximation guarantee, so it has a bound. So the algorithm is kind of inspired by an algorithm -- by a polynomial tag approximation algorithm that solves the subset sub problem. So it works as follows. The first step is to split the set of tags, T1 to TZ, to smaller groups where each one has raw tags. So then you have r number of groups. So then also we'll see the error bound will be 1 over r times 1 plus epsilon, so this r depends, as I said, how many subproblems we create. That's when we see in the experiment. We create only one subproblem because we have a relatively small number of attributes, which means in that case we'll have approximation 1 over 1 plus epsilon. So epsilon is a user-specified constant. So the user specifies epsilon, and given epsilon, the algorithm will work in the polynomial time to achieve this bound. So the way it works is as follows. So first you start with -- so this point is in z-dimensional space. So you have one dimension for every tag. So then -- and then the coordinates here -- so in this example suppose you have two tags, and then what this point says is that if you add zero to all four attributes -- suppose you have only four attributes -- then it will get the probability 0.3 to satisfy this tag and 0.18 to satisfy this tag. So that's all this point means. So what you do is -- again, here, suppose we have only four attributes, but, of course, in practice you have more attributes. So then what the algorithm -- how it starts, it starts with all zeros and we compute using our formula what is the probability for every tag, and then we flip one bit at a time, which means we flip one value at a time. So the first step from all zeros, we change the first bit to be all zeros or 1000, and we'll compute the score for every attribute -- sorry, for every tag again for these two. This was the same as before. And then we compute the score for this one. And then what we do is we do clustering so that we -- because our goal is to always have a polynomial number of candidate products because if you become exponential, then your algorithm is not polynomial anymore, right? So we use a specific clustering algorithm such that we eliminate points that are relatively close to other points so we can keep kind of a summary of all the points. So we do a clustering and we keep one representative from every cluster. And the details of the distance that we use to eliminate is in the paper. And then in the next iteration we flip the second bit, which continues from the previous slide. So the 0000 was eliminated. Now when we continue here, we flip the second bit, and let's say in this case these points are far enough from each other that they cannot be eliminated, and then you continue like this and then the next iteration and so on. And then in the end you are left with a set of candidates, and then you select the one that has the highest score across all tags. Okay. So now let's move to the experiment. So we used two data sets, one synthetic and one real. In the synthetic it's not uniformly distributed. We kind of tried to make it a little realistic, so, you know, there is some skew in most of the attributes. But, by the way, the important thing here is not so much the number of rows but the number of attributes because the complexity of the algorithm is on the number of columns because if you want to compute all the possible -- to enumerate all possible products to build, the number of attributes is important. So because the number of rows is only used to compute the probabilities in Bayes, in Naive Bayes, but it's not very important. So even though the [inaudible] looks small, but the number of attributes is important. So in the first experiment we compared the time to create the top products between three algorithms, the Naive, which is enumerating all the combinations for the attributes, and then you have the exact two-tier which [inaudible] before the top-k algorithm, and then you have the polynomial approximation algorithm. And as expected, you see that as the number of attributes increases, then at some point only the approximation algorithm can scale. Yes? >>: Just briefly for clarification, you said the time to create the top products. What does that mean, top products, here? >> Vagelis Hristidis: In this case I think we're looking for the top one -- yeah, this I think is for the top one. So in order to find the product that will maximize the sum of probabilities of attracting the tags that you have input. So you want to -so if you say I want to build a camera that will maximize modern and lightweight, this is the time that you need to build the -- select attributes such that these two tags -- the probability of having these two tags is maximized. And in this example we used the epsilon 0.5 for the polynomial approximation algorithm. And as I said, we have only one subproblem. Okay. Now let's move to some qualitative experiments. So the goal here is to see if that, you know, what we're finding kind of is similar to what users think about tags and products. So we use the Amazon Mechanical Turk to do some survey. So we had 30 users who participated. So we used the real data set which I described before. The real data set, we got it from crawling Amazon and then also augmenting the attributes from Google products because we want to get more -- because Amazon has maybe 20 attributes, and then if we look up the product in Google products you can get maybe 30 more attributes, so we go to about 45 attributes in total for every product. And the tags will come through Amazon, where the product has tags, so we have 55 tags on cameras. So the vocabulary of tags was, say, 55. And then in the first task what we do is we're going to build four cameras, two compact cameras and two SLR cameras. So what we do is we do kind of -- we asked experts in photography, and then we asked them which tags are desirable if you want to build a compact camera and which tags are desirable if you want to build an SLR camera. For instance, for SLR maybe the zoom, high zoom or high clarity or whatever. For compact it's thin or modern. So the tags that are important. So the tags are important. And then given these inputs, we build two and two cameras [phonetic] so we talk to and talk to cameras for this. We also build cameras. I mean, we decided what attributes to put in these cameras. And then we do a survey and we ask users to select between our designed cameras and the top cameras in these two categories, and we find that 65 percent of the users select our cameras, which, of course, they are imaginary cameras because we don't even know if they're possible to build, right? Compared to the existing cameras. >>: Wouldn't that experiment be easy to beat? So I proposed -- first of all, you're not sure on price, right? And second of all, I propose just a new camera that has every single boolean attribute, and most of those are positive. So -- >> Vagelis Hristidis: Well, actually you -- let me think. Because some of the -- for some attributes a kind of negative correlation with tags, right? Because if you put, let's say, a big lens, then maybe the tag more than [inaudible] lightweight will not be selected, right? So it's putting -- all the features on the camera doesn't make it necessarily the best camera. >>: Not necessarily. But -- >> Vagelis Hristidis: But, you're right, it would be a good baseline to compare against, yeah. >>: [Inaudible] >> Vagelis Hristidis: Well, who do you mean by we, we don't know? You mean the designer or the users? >>: [Inaudible] >> Vagelis Hristidis: Yeah, exactly. So, I mean, building a product is a very complex process. So what we're doing is just give one more signal to the people who build the products. But, of course, we're not saying you should only build based on this. >>: Since you're getting feedback from the camera expert, so they are suggesting the text, they can also potentially actually basically [inaudible], right? So this can potentially serve as sort of the golden standard, right? The ideal camera that they want to view, and you can compare this ideal camera from the expert from the camera that [inaudible] and see if you're getting -- >> Vagelis Hristidis: Well, that's -- doing that, you kind of bypass the tags. You go directly to the attributes and then you say, you know -- actually this is what people have been doing probably for years, right? They get experts to say what are the good attributes, we're going to build this. But in this work we say, you know, can -- using the tags, can it give you some extra signal that the attributes alone cannot give. The reason is that -- the reason that there is a lot of information about tags out there, so -- it's not the same as, you know, getting ten experts, and if you have millions of users who are tagging and you can leverage all these opinions of many users. So it's kind of a model of going from a few experts to going to, you know, all the users because that's -- because [inaudible] two tags, you can get opinion of more users >>: In the end what you'll leverage is really the correlation between the attributes to the tags. So if you're an expert [inaudible] already has opinions on the design or the [inaudible] also potentially be able to tell you what attributes -- >> Vagelis Hristidis: Yes. But, again, you are assume that you want to build a camera for the expert, but if you want to build a camera for the people, then you cannot just ask the experts. I mean, that is the idea. The idea is that through leveraging tags, you can build something that is appealing to the masses and not only to the experts. I mean, an alternative, you can say send out a question to 1 million people to [inaudible], but, you know, this is how to do. So using the tags kind of implicitly gives you what they like. And the second qualitative experiment was the following. So -- we built six cameras designed for three groups of people, young students, retired, and professional photographers. So, again, we have kind of photography experts. I mean, I'm not saying people who work in Canon, but I'm talking, you know, people who buy into photography. And they assign [inaudible] that are desirable for each of these three categories, and then we build cameras based on these tags and then in the end we ask some other users, not the expert users but some regular users, to say given on these six cameras, given the attributes of the cameras, what tags do you think are appropriate for these cameras, and then we see that they select more or less the same tags as the expert selected, which means that -- this experiment kind of tests that there is a correlation between attributes and tags. So if you build -- if you select the attributes based on the tags, then other people, when they see the attributes, they will also decide that these tags are related. So it kind of confirms that the Naive Bayes is a realistic classifier for tags. Okay. So this concludes the first part of my talk, which was, I guess, the main part. So then to summarize, I think that, you know, the main contribution of this paper is kind of showing that the tags can be used for more things than people have used them before. So and, actually, we will keep working on more problems on how tags can be used for advertising and other things. So we kind of report on new directions of how tags, used. And then with [inaudible] also the algorithms that go with it, but I will say the problem itself is interesting. And we present two algorithms, an exact algorithm based on top-k and an approximate algorithm based on the probabilistic principles. And then for future, one thing we have been discussing is to try other classifiers like decision trees to see if we can do something better or similar or, as I said, to find more applications of tags, like in advertising. So I guess now if you have any comments or questions on this part of the talk, maybe now is a good time and then we'll [inaudible] data. Okay. >>: [Inaudible] >> Vagelis Hristidis: Yeah. >>: And from there trying to figure out perhaps what is causing the existing products to be labeled -- >> Vagelis Hristidis: Definitely, that's very related, because -- yeah, some -- I mean, you can think of some of the tags as like positive sentiments, some of the tags are negative sentiment, and then you say if I want to build a product that will bring positive reviews, what attributes should I put. But I would say tags it more than negative, because one tag can be positive for one camera and negative for another type of camera. Like, for instance, if it's a big lens, it's positive for SLR, it's negative for compact, so it's -- I would say maybe tags can give you a little bit more flexibility than sentiment, but it's very similar. Okay. So for the rest of the time I want to discuss a little about some ongoing work and some recent work we have been doing on health data management. So in the last few years I have been always trying to apply my research on health data, and because, as you know, my primary work is between structured data and text, so the health data is the ideal setting for these kind of problems. So the kind of data sources that we have been working with, the main sources we have been -- is, first of all, we have structured health records, which are usually represented as xml or relational. So, you know, you can say what are the -- you know, what is the disease, bronchitis, medications, and so on. And as you'll see here, also there is some free text. And this, by the way, is on a standard format. This is HL7 CDA format. And then you have -- you know, inside here actually you can have some separate [inaudible] to tell you free text notes about the patient as a second kind of source, and then you have a very interesting and unique thing about the data. There's a lot of very rich ontologies and dictionaries that, you know -- mostly the NIH has, and there has been a great investment from the part of NIH and some other organizations to build very big ontologies which you cannot find in any other domain. So, for instance, you see this graph. This is a subgraph of this [inaudible]. Some [inaudible] is kind of a medical dictionary. And then it shows all associations between concepts. So asthmatic bronchitis and finding site of bronchial structure and so on. These graphs have millions of nodes, so it's not like, you know, the ACM classification which has maybe 1,000 nodes. So these are really massive dictionaries. And then the last piece of data we work with is the literature from PubMed. So PubMed has biomedical publications, and, again, there are some interesting things. For instance, every publication is manually annotated with mesh concepts where -- mesh is a [inaudible] dictionary. So there are people who work -- their full-time job is to get their publication and assign ten concepts to the publication, which, you know, again, gives some unique opportunities. So the kind of problems I have been working with are entering health data, querying data, and sharing data. And I will talk briefly about each of these problems. So the first one which is the newest project I'm working on, and so far we don't have any publication -- so the problem is how do you help users, which can be doctors, nurses, or administrators, to enter clinical notes. So this is -- imagine a setting that a patient goes to the doctor and then, you know, they have a discussion, you know, what's your problem, what are your symptoms, all with a nurse. So there are two extreme ways to record such data. The first way is to just say I will record everything in text. So I'll just type everything in text or I use a tape recorder, you know, I -- [inaudible] recorder, and then, you know, I'll use a system to transform this to text. And the other extreme is to say I will go all structured, which means that whenever I want to enter, let's say, disease, I will have to navigate the ontology, so maybe I will say issues and then I will say bronchitis and then I will go asthmatic bronchitis and then I will say check this is the concept. And then medication, I will give all the list of medications and see which one it is, how many milligrams. So the good thing is, as I mentioned before, because we have such rich ontologies, because of the concepts can be found somewhere. But [inaudible] user for every concept to try to find it. And actually there has been -- I have talked to a person who is in the IT of a military hospital, and they told me that personnel there spend hours every day to record the data in a structured way because they have some -- in some -- in the military hospitals, they have a requirement to put more structure to their notes than if you go to, let's say, a private doctor. So the question is how can you bridge the gap between these two. So we want to make it easy to enter nodes but at the same time also have some structure so it's not completely unstructured. So some tools you can use for that is, first of all, we have -- often we have some clinical rules that says, for instance, if concept cognition, which can be [inaudible] contains dementia or dizziness and home meds contains psychotropic, then inquire about falls. So using these rules can kind of guide you to the data entry, because if the first two conditions, let's say, are true, then maybe it will pop up about getting information about this one or maybe before the first condition is true, then maybe the system may suggest that you record some information about home meds to see if this tool can be evaluated. Another direction is to use dynamic entry forms because as we will see -- as we see here -- so this is a screen shot from [inaudible] used by the VA hospitals, and actually this is being -- this year they are in the process of upgrading, so this is what has been used until now, but after a few months probably it will be a new version. So if you want to enter clinical notes, what you can do is you can -- on the left here there is a huge list of templates. It has thousands of templates. And because everybody can build a template or -- a template is, you know, a set of fields. And then you select -- you have to find which is the right template. Let's say you have a patient who comes here -- who comes and has trouble sleeping, let's say, and you want to record the sleeping problems. So you have to look over these thousands of templates and find the right template, and this is how a template looks like. So it has maybe some check boxes or some text boxes and so on. So then you fill out this template, and then the way it works now in the VA is that once you -you save the template and then it is saved as text. So this is just used to help you enter the data, and then when you click okay it will be transferred to kind of a text file and it will be stored on the patient's records as text. So the one challenge here is how can you make this -- how can you allow the user to enter data without having to search through thousands of templates. So how can you know what template the user is -- needs. And you can personalize templates or, you know, learn templates. Another thing we're working on currently with one of my students is how do you -let's say suppose that you have a text editor and the user is typing a clinical note, and then how does the text editor kind of make -- try to guess what is the structure you are trying to do. For instance, if you say this is a 75-year-old, and then maybe the interface will say something like age equals 75, do you accept this, and you click yes and then this goes -- stored as structured, so kind of interactively as structured to the text. By the way, there are some tools that -- one of them is an meta match which is maintained by NIH and then there's a C Text which is from Mayo Clinic that -what these tools do is that you can input a whole document, text document, and then tell you which of the concepts in the dictionary are related to this document. So it will kind of try to -- do kind of information extraction on the text document given the dictionary. But, of course, this is kind of offline, so it will do something interactively as the user is typing. And, of course, NLP [phonetic] is also a very important tool here. One challenge is that medical language is a little different than the common language because there's many shortcuts, many, you know -- sentences don't have verbs, so there is many things that are unique. Yes? >>: [inaudible] >> Vagelis Hristidis: Yeah. >>: [inaudible] >> Vagelis Hristidis: Yes. That's definitely -- that's a good point. Currently it's not supported, but, you know, the existing system, but this shows you that even the low-hanging fruit, they have not implemented them. So, you know, the VA health record is supposed to be one of the most advanced out there. I mean, if you go to private doctors, as you know, it's even more old-fashioned interfaces. So your point is good. There's like a lot of opportunity to improve this. >>: You mentioned somewhere that you wanted, while the text is being entered, you wanted online characterization into structured data. >> Vagelis Hristidis: Yeah. >>: Is that on-the-fly or [inaudible] requirement, a stringent requirement? For example, like if I am -- I'm going to a doctor. The nurse types in the message in text. Once I go back home [inaudible] takes the text, analyzes and operates the database? >> Vagelis Hristidis: Yeah, but the problem with that is that -- these tools are not 100 percent correct, right? They make mistakes. So that's why if do you it interactively, you know, you can -- the user can confirm that these are the correct ways of extracting the data, because there's not perfect tool that will -- given the text, will find the perfect concepts. Okay. So the second direction I'm working on is on querying health data. And, also, I'm very interested in user interfaces for querying health data. So not only -- I mean, traditionally we started it on ranking, but there is also many other things that, you know, need to be taken care of. So in general -- here I just put a few bullets on, in general, what kind of things are important for the user experience when the user is searching. So ranking, obviously, is one important thing. But it has been shown that it's not the only important thing. So the other important things is how do you formulate a query, to help the user formulate a query, for example, do some [inaudible], how do you present the results, so do you present the results like in Google or Bing where, you know, one after the other or do you also do some grouping, some graphics. Also, you know, how do you handle user feedback. The user clicks on a result, do you want to give more relevant result, or maybe you want to personalize on the user or, you know, suggest query formulation. So specifically for health data, you know, all these questions are open. So, for instance, what is a good answer is one big challenge. Because suppose you're a doctor and you type something like breast cancer complications, for instance. What are the semantics of the answer? Are you looking for your own patients? Are you looking for -- for patients of the hospital? Are you looking for literature? And if you're looking for your patients, what are you looking for? Are you looking for the names? Are you looking for the part that talks about breast cancer complication? You know, it's not clear what is a good answer. And also how can you use maybe the context of what you're doing to say what's a good answer. For instance, if a patient is kind of visiting, you have the file of the patient open, then probably something related this patient, so the patient should become the context [inaudible] that question. Then -- by the way, there is some work called [inaudible] which what they do is given -- if you're looking at the patient record, then it will find some literature that is related to the patient's record. And there are some pretty simple ways to -- that people have been using. For instance, you extract some key words from the patient record and then you submit them to PubMed and see what is related. So there's nothing very fancy that people have done there. Also -- I'll also mention with granularity, you want to show the whole record of the patient. You want to see what's specific to the query. [inaudible] ranking is also how do you -- what are semantics of ranking for health data? Do you want to rank patients by how serious they are, by time, by location? Also, what [inaudible] conditions do you display. Suppose -- let's say that you have a -- what I envision is, let's say, one single textbook, and then you can search everything from literature to your patients to other patients to studies to experiments. So then you can imagine you have facet conditions. You say I want to see literature, I want to see my patients, I want to see, you know, studies about medications, and so on and so on. So you have facets which can be fixed or can be dynamic based on the query. User interface, you know, of course is very important. So the first question is the is the web face interface a good interface with the text box and then the list of results or is there some other user interface that will be more appropriate. Also, you have personalization in a way, and not only personalization on a personal level but also on, let's say, stakeholder type level. So you have a patients, doctors, administrators who search the same data, but they're searching from a different perspective. So how do you achieve that? >>: I have a question here. So this looks a lot like [inaudible]? >> Vagelis Hristidis: Well, okay. One unique thing, I think, is that you have these dictionaries here, so -- which offers some -- it's a kind of unique input. That's one thing. Now, the other unique thing is that people have been working for decades on the enterprise search, whereas this one is, you know, much newer area, newer for, you know, many reasons why people think that hasn't been the right amount of effort to build these things. And I guess the semantics of the queries are different. I mean, of course, also in enterprise search you can define different types of semantics, but -- hmm. But, yeah, I think that there are many semantics here, many types of queries that have not been addressed, like, for instance, that is different settings of query. I'm a doctor, I have a patient in the office, I do a query. That's one setting. And then another setting is that I go home at night, I want to see summary of my patients -- I mean, all of them, you can say, you know, there has been some related work, but then they also are, you know, unique. And, by the way -- so this is, again, the VA EHR system, so I'm just showing this to show you what is currently available in terms of querying. So, actually, I'm not showing you probably the right screen, but the only way that you can query in this system is by querying on the patient name. So you just say I want to open the file of this patient, then put the patient name or the patient ID and then you get the -- let's say this is the file of the patient. There's no other set functionality. And this is not on the VA. Most of the health records don't have any search functionality, which, you know, you think is surprising, but, you know, people maybe have not agreed on what is useful to search. Maybe searching would make things more complicated and confuse users. So there are many reasons why there is no search. But, on the other hand, there is also some research that says that the users would like to have search. So this is -- now I'll present very briefly a couple of, you know, previous work we had on searching health data. So the first is about how to do key word search on xml where you also have -- you are aware also of the health ontologies. So the idea is that suppose you have a health record which is in xml, can be viewed as a tree, and then you also have some dictionaries. So suppose that -let's say one of the query key words matches the dictionary but doesn't match the health record itself. How can you still, you know, say that maybe this record is relevant because, for instance, the query says asthma, [inaudible] has bronchitis, and I know that these two are related through some path here. So this is one of the works we have done. Another work is on how do you navigate the results of a search on PubMed. So, as I said, the interesting thing in PubMed which is kind of unique is that every publication is annotated manually by a set of about ten concepts, and the concepts come from a hierarchy of concept, the mesh hierarchy. So what you can do is you can -- one way to display the results is to say that I will organize the results on the tree. So, for instance, this is cell physiology 161 is that 161 of the results have been annotated by this concept. So then doing that, the problem is that if you have thousands of results, the tree can be very big and have thousands of nodes because every node has -- every paper has maybe about ten annotations. So then even displaying the tree may not be very useful because the tree can be as big as the list of results. So then what we did is we have some algorithms so that you can navigate the tree in a more efficient way, so kind of skipping some of the levels depending on some assumptions on what is useful for the user [inaudible] to the user and we will do some jumping of levels such that we minimize the expected navigation time. So we have some cost model. Based on the cost model, you also have -here at Microsoft you have developed, you know, saying that when I read something, it has cost 1, when I click something, it has cost 1, and then given this cost model, [inaudible] tree. And, finally, the third direction I'm working on, which, again, this one is very new and we don't have any publications, and actually this is more -- I'm working on this with some people from nursing school. So the problem is the following. I'm not sure exactly what is the technical challenges, but this is an interesting problem. So [inaudible] the following. In older people who are staying at home and have home care, what happens is that there are agencies, state agencies, that send people called case managers to visit these older people once a month to see, you know, if they need some help. And this is mostly about the people who -- it's more important for people who don't have families or who are very low-income and they cannot afford their own care, so the state takes care of them. So what happens is the case manager goes there once in a while and then the case manager has a form, and the form is like maybe seven pages and has check boxes, you know, the house is clean, the person is -- has a broken something, there is no food in patent frig and these kind of things. And then -and also, you know, what kind of medication the patient is taking. And then this -- so then this case managers take them back to their institute of home care, which is usually nonprofit, you know, supported by the state. And then, also, the patients at some point go to the doctor, and because the doctor -because the patients, you know, are very old, many times have dementia and they cannot, you know, communicate very well to the doctor. So the problem is that at this point there is no communication between the doctor and the case manager. Because at best what happens is that the case manager will submit a form and then maybe will fax it to the doctor's office, and then this form will be, you know, buried somewhere in the doctor's office and the doctor will never see the form. So then the doctor cannot prescribe the right medication because he or she doesn't have the communication with the case manager who knows the patient. So there is a kind of broken communication, so -- and this kind of shows what's happening. So the case manager tries to contact the doctor through the nurse or voicemail or fax, and then the patient goes to the doctor and the doctor usually doesn't have this information. So the idea is how can technology help in making this communication better by building some -- let's say some central portal that, you know, all case managers, patients, and physicians can access. So maybe one day, let's say, the patient goes to the doctor, maybe the doctor can just look into a website and see what the case manager has said about the patient. So now what are the technical challenges? The technical challenges is how do you make the user interface easy so that the doctor can just, you know, on screen see the summary of what's important for the patient. How can the case manager -- again, make is easy for case managers to update the information of the patient without having to go through many pages. You know, there has been some work on how to do -- recently some work on how do [inaudible], like how do you order the set of questions of user surveys so that the most important ones go first and then the ones that kind of -- can be inferred from other questions do not have to be asked. So kind of these kind of things. How do -- so you want to minimize the time of the users. That's the purpose of this work. How do you build a portal such that it minimizes the effort from all parts. Also, how do you decide if an alert should be submitted to any of them so that you don't get those -- you know, there's a big issue with -- on [inaudible] systems that, you know, you don't alert too much. You don't always pop up windows or pop up or give -- send emails with alert, so you minimize alerts. And, finally, which kind of led to the question we had before, some of the properties of shared data which are and are not unique, they are -- if you take maybe each of them separately, maybe it's not unique, but if you take them all together, maybe they become something more unique. So you have [inaudible] issue, you have little missing values, dirty data, you have a mix of text and structure, you have a lot of shortcuts, like [inaudible]. You have a lot of negated phrases. So, you know, the doctor might say the patient does not have diabetes, and this is kind of very common practice that you want to explicitly say what patient does not have so that means a simple key word search may fail. Time stamps are important. They're everywhere. So you have this -- you know, how do you handle time. And then this concludes my talk. So I would like to thank my students, and this is where you can find more information. [applause]. >> Tao Cheng: Questions? >>: So you're building tools for this in this medical area. This is a very practical, hands-on activity. It's something you just sort of [inaudible] and developed it on your own [inaudible] hope for the best, that's not going to work. >> Vagelis Hristidis: Yes. So actually that's the big challenge in working this area is that you have to work closely with the collaborators from these areas. So, specifically, I have a very good collaborator from the VA, and he's a medical doctor and doing medical informatics at the VA. And also I have some other contacts which -- so I tried meet with them to see, you know, to do something that has a chance that they will use eventually. So this is not the kind of research you can do in your lab with your students. And that's actually one main challenge to do that, because you depend on other people. So you cannot say, you know, I will work hard and I will, you know, make the deadline and submit a paper, because maybe you wait for the user survey and the user survey takes months to -- >>: [inaudible] usable testing which is also a big piece of this. These computer science students, are they okay with that? That's pretty much an essential ingredient [inaudible]. >> Vagelis Hristidis: You mean usability -- you mean if usability is part of computer science or not? >>: Well, just surveying. It's very labor intensive. Whether these are HCI students that are happy to do it or whether they're database students, this is tedious -- >> Vagelis Hristidis: Well, I -- I think that the students like to get out of their strict area and do something else, because I think that -- I think it's interesting for students to have -- you know, to not have only one focus but to get some experience from other areas. >>: So you haven't had any resistance? >> Vagelis Hristidis: No, no. I mean, I don't have any resistance from students. The only challenge is to get time from doctors, because they're very busy. It's hard to convince them that your collaboration with them is going to bring them something positive for them. >>: Well, there's that. If you're doing these sort of interfaces, you're working with nurses, and my impression is they're all grossly overworked. >> Vagelis Hristidis: Yeah. >>: You're just adding [inaudible] getting research done, they're more oriented in getting their work done so they can go home at a reasonable time. >> Vagelis Hristidis: Yeah. And I'm working with a nurse who's -- she's an assistant professor. So that's motivated to work until she gets done. >>: Works best. >> Vagelis Hristidis: Yes. >>: Thanks. >> Vagelis Hristidis: Thanks. >>: All right. [inaudible] setting up all these user surveys and seeing [inaudible]. >> Vagelis Hristidis: That's right. Yes. >> Tao Cheng: Thanks. [applause]