>> Tao Cheng: So let's get started. So... Professor Vagelis Hristidis give us a talk. He was...

advertisement
>> Tao Cheng: So let's get started. So today we are very happen to have
Professor Vagelis Hristidis give us a talk. He was a social professor at the
University of California at the Riverside, and I think he got his Ph.D. from
University of California, San Diego. So [inaudible] the same sunny state. And
he's basically interested in bridging the gap between database systems and
information retrieval. I think he has been supported by many firms, including the
National Science Foundation, Department of Homeland Security and many other
big companies.
So today he will talk about a collaborative tagging for web items. So let's
welcome.
>> Vagelis Hristidis: Hello, everyone. Thanks for having me. And it's great to be
back after my 2003 internship which has led to many long-term friendships and
collaborations. And actually this work is also a collaboration with Guatam Das,
who I met while I was an intern here.
So in this talk I will -- the main part will be presenting the results that were
published in the [inaudible] last year, and then if I have some more time I will talk
a little about my more recent work on data management for health, health data.
Okay. So let's start. Okay. So the motivation is that if you go to any commerce
website recently, you will see that there is a lot of new kind of information
popping up all the time, including the ratings, likes and the tags, which is the
focus of this work.
So as you see, so this was about cameras. This site is about songs, so people
are tagging songs with what they think is useful for remembering or for other
people to find. Also, you see here people are tagging trips. So when you go on
a trip, you say this adventure or this art and so on.
So it seems that many of the objects on the internet have a set of tags
associated with them. So tagging -- there has been quite some work on tags, but
mostly the work on tags before our work was on using tags as kind of extending
the content of the object. So in addition to the text and the items of the object
you also see the tag so you have a more, let's say, complete picture of the object
so you can increase your recall when you search for objects. Also in ranking you
can use the tags to kind of extend the document.
And also for -- there has also been work on classification trending, see which
tags are becoming more popular, and also there has been work on how to assign
tags and also you guys here have been working on how to tag objects and also
how to predict what tags an object will acquire.
So the question we're trying to answer here is if we can -- if the tags can be
useful in designing better items where items can be products, can be trips, you
know, anything. And the answer is yes, as we'll see.
So the question is how can we use tags to design a new camera. Suppose we
are working for Canon and we say, you know, I want to do some research on the
internet using the tags and I will propose a few new cameras that would probably
be attractive to people, so I want to see what attributes a camera should have
such that this camera will attract good tags.
So the same way for music. Let's say I want to build a new song, create a new
song, what kind of attributes a song should have to become popular and that,
you know, [inaudible] and so on.
So to repeat the problem is how to, you know, use the tags to build products. So
I'll use tags to decide what attributes a product should have.
So more formally, the problem is the following. Suppose that they have a table
that has two types columns. The first type of column is the attributes of the
product. For instance, if it's a camera, what kind of camera, what is the zoom,
color, flash, and so on. And then the second part or the second set of columns
are the tags that people have assigned on this camera.
And then we want to design K items, so we want to design, let's say, K new rows,
so add K new rows to the table such that these new items will have a [inaudible]
of attracting the set of tags we want them to attract.
So, for instance, if I say I want to build an easy and lightweight camera, so I
select, let's say, these two tags as desirable, then our algorithm will decide what
kind of attributes a camera should have such that it will attract these two tags.
Of course, you may also want to say but I want to avoid this unreliable, have
some [inaudible].
Okay. So the first challenge is how do you -- how do we decide how the
attributes and the tags are related. So there has been some work on this that
this can be viewed as a classification problem. So given the attributes of the
product, I want to -- for every tag I build a classifier to decide if the tag will be
present or not present given the attributes.
So, of course, there are many different types of classifiers, but given some
research we did on previous work and some experiments we did, we found that
Naive Bayes classifier is a good solution for that. Of course, all classifiers are
possible.
So in this case what do we do is we use Naive Bayes classifier to estimate the
probability that the product will attract tag Tj given o, which is the -- all the
attributes of the product.
So we want to compute the reliability of a tag given that [inaudible] values.
And this is the standard inference from the derivations from the Naive Bayes
classifier. So in the end, given the probability of an object -- of a tag given an
object is this equation here. So this is the probability of not having a tag, this is
the probability of having a tag, regardless of what object we're talking about, and
these are the pairwise probabilities of an attribute. For instance, this is the zoom
equals 3. Given that tag, this is zoom equals 3 given tag. And then we use rj to
the kind of summarize this term here.
>>: [inaudible]
>> Vagelis Hristidis: Yes. In our problem, the set of good tags is an input. So
you say I want to build a product that will -- so let's say I want to build a product
that will try to attract let's say these two tags, which means that you think that
these two tags are positive. But you don't have to give all the positive tags as
input because you may want to build like, say, one lightweight camera and then
you want to be the professional camera so you put different tags as input.
>>: Seems to me that your output is new items.
>> Vagelis Hristidis: Yes.
>> Vagelis Hristidis: There's also an implicit assumption that it's easy to create
these new items.
>> Vagelis Hristidis: That's a very good -- that's a very good point. So the
argument we make is that the item that we output would then have to be
combined with some other, let's say, research on what is possible to build,
because if maybe you output an item that is not possible to build with currently
technology, right? Or maybe it's too expensive to build.
So the result of this algorithm is just one of the inputs that the designer will take
to decide what to build along with market research and other things.
Yes?
>>: Limited to categorical attributes? So, for example, the tag [inaudible] would
highlight correlate with price, which is numerical.
>> Vagelis Hristidis: Yeah, actually in the -- we defined most of the formulas in
terms of Boolean and in the paper we explain how it can be extended to
categorical. And if I'm not mistaken, we also discuss about numerical but in a
very simple way, like split into ranges so not considering the order.
Okay. [inaudible] condition independence to do that on Bayes. So then if you're
given a set of targets you want your product to satisfy, then the problem
becomes how do you find the -- create the objects given a set of tags such that
the sum of these probabilities is maximized across all the targets. So if you have
two targets, let's say lightweight and modern [phonetic] you know, at the same
time maximize both probabilities.
Okay. So given this problem formulation, the first probably expected result is that
this is an NP-hard problem because you have to -- Naive way would be to
consider all possible combinations of values for all the attributes, which is an
exponential number, and then we prove that it's NP-hard.
So then the next step is what algorithm can we propose. So we propose two
algorithms. The first is an exact algorithm which obviously only works for
relatively small data sets, and then we have an approximate algorithm that works
for bigger data sets.
Okay. So let's present one by one. So the exact algorithm is based on top-k
algorithms. Actually that's a combination of two top-k algorithms. So before we
talk about the exact, Naive would be to -- as I said, to create all possible items,
so this will be exponential, and then the exact algorithm which we'll call exact
two-tier algorithm is called two-tier because there's two tiers of top-k algorithms.
So in the first tier we find what is the -- you know, we find what are the best
products for each tag, and the second tier we combine across all tags that are in
the input.
And obviously since this is an NP-hard problem, in the worst case this algorithm
can be very slow, but as we see in the experiments, the average case performs
very well. It's the same in all top-k algorithms. In the worst case it can be very
bad.
Okay. So these are the two tiers of the algorithm. So in the first tier you have
one, let's say, top-k execution for every tag. So here we have shown that the
user is looking to create a product that will -- maximal probabilities of tags T1 to
Tz. So for each tag we have, let's say, pipelined algorithm that will get next to
[inaudible] suggested product that will maximize this tag independently of the
other tags, and then in the top tier we combine all these tags together to build -to decide whether it's the best product that globally maximizes the probability for
the tags.
So in the first -- now let's talk in a little more detail. In the lower level what we do
is we -- for each of these top-k algorithms we split the -- we create one list for a
subset of the attributes. So suppose we'll have, let's say, 50 attributes. We can
build 10 lists of five attributes.
And then for every list we consider all combinations or values for these attributes
which is, you know -- since these are relatively few [inaudible] lists, the number is
not very high. For instance, if you have boolean and you have five other, you
have 2 to the 5 entries here, all combinations, and then for each combination we
assign a score which is how much this subset of attributes can contribute to the
probability of having the tag Tz.
And then -- so you have this order by this value that I mentioned, and then you
have kind of a rank joined algorithm, but there's a difference between -- because
in rank joined there is a joined condition, but here there's no joined condition.
Everything joined with everything, which means all combinations from here and
here and here, you know, can potentially result. There's no joined conditions.
You have to do all the combinations for everything with everything.
But, still, in practice it is not so bad because that's what we will see. You don't
have to go very deep on the list.
Yeah, that's what I just mentioned. So you just -- all combinations and then you
compute a score.
And then on the top tier you have -- here you have a complete product, and then
you want to find the complete product that maximizes all the tags.
So this one is the same as the threshold algorithm. So you -- this list here joined
when it is the same product. So this was the exact algorithm, and now we'll
present the approximate algorithm.
So the good thing about this approximate algorithm is that it has an
approximation guarantee, so it has a bound. So the algorithm is kind of inspired
by an algorithm -- by a polynomial tag approximation algorithm that solves the
subset sub problem.
So it works as follows. The first step is to split the set of tags, T1 to TZ, to
smaller groups where each one has raw tags. So then you have r number of
groups.
So then also we'll see the error bound will be 1 over r times 1 plus epsilon, so this
r depends, as I said, how many subproblems we create. That's when we see in
the experiment. We create only one subproblem because we have a relatively
small number of attributes, which means in that case we'll have approximation 1
over 1 plus epsilon.
So epsilon is a user-specified constant. So the user specifies epsilon, and given
epsilon, the algorithm will work in the polynomial time to achieve this bound.
So the way it works is as follows. So first you start with -- so this point is in
z-dimensional space. So you have one dimension for every tag.
So then -- and then the coordinates here -- so in this example suppose you have
two tags, and then what this point says is that if you add zero to all four
attributes -- suppose you have only four attributes -- then it will get the probability
0.3 to satisfy this tag and 0.18 to satisfy this tag. So that's all this point means.
So what you do is -- again, here, suppose we have only four attributes, but, of
course, in practice you have more attributes. So then what the algorithm -- how it
starts, it starts with all zeros and we compute using our formula what is the
probability for every tag, and then we flip one bit at a time, which means we flip
one value at a time. So the first step from all zeros, we change the first bit to be
all zeros or 1000, and we'll compute the score for every attribute -- sorry, for
every tag again for these two. This was the same as before. And then we
compute the score for this one.
And then what we do is we do clustering so that we -- because our goal is to
always have a polynomial number of candidate products because if you become
exponential, then your algorithm is not polynomial anymore, right?
So we use a specific clustering algorithm such that we eliminate points that are
relatively close to other points so we can keep kind of a summary of all the
points. So we do a clustering and we keep one representative from every
cluster. And the details of the distance that we use to eliminate is in the paper.
And then in the next iteration we flip the second bit, which continues from the
previous slide. So the 0000 was eliminated. Now when we continue here, we
flip the second bit, and let's say in this case these points are far enough from
each other that they cannot be eliminated, and then you continue like this and
then the next iteration and so on.
And then in the end you are left with a set of candidates, and then you select the
one that has the highest score across all tags.
Okay. So now let's move to the experiment. So we used two data sets, one
synthetic and one real. In the synthetic it's not uniformly distributed. We kind of
tried to make it a little realistic, so, you know, there is some skew in most of the
attributes.
But, by the way, the important thing here is not so much the number of rows but
the number of attributes because the complexity of the algorithm is on the
number of columns because if you want to compute all the possible -- to
enumerate all possible products to build, the number of attributes is important.
So because the number of rows is only used to compute the probabilities in
Bayes, in Naive Bayes, but it's not very important. So even though the
[inaudible] looks small, but the number of attributes is important.
So in the first experiment we compared the time to create the top products
between three algorithms, the Naive, which is enumerating all the combinations
for the attributes, and then you have the exact two-tier which [inaudible] before
the top-k algorithm, and then you have the polynomial approximation algorithm.
And as expected, you see that as the number of attributes increases, then at
some point only the approximation algorithm can scale.
Yes?
>>: Just briefly for clarification, you said the time to create the top products.
What does that mean, top products, here?
>> Vagelis Hristidis: In this case I think we're looking for the top one -- yeah, this
I think is for the top one. So in order to find the product that will maximize the
sum of probabilities of attracting the tags that you have input. So you want to -so if you say I want to build a camera that will maximize modern and lightweight,
this is the time that you need to build the -- select attributes such that these two
tags -- the probability of having these two tags is maximized.
And in this example we used the epsilon 0.5 for the polynomial approximation
algorithm. And as I said, we have only one subproblem.
Okay. Now let's move to some qualitative experiments. So the goal here is to
see if that, you know, what we're finding kind of is similar to what users think
about tags and products.
So we use the Amazon Mechanical Turk to do some survey. So we had 30
users who participated. So we used the real data set which I described before.
The real data set, we got it from crawling Amazon and then also augmenting the
attributes from Google products because we want to get more -- because
Amazon has maybe 20 attributes, and then if we look up the product in Google
products you can get maybe 30 more attributes, so we go to about 45 attributes
in total for every product.
And the tags will come through Amazon, where the product has tags, so we have
55 tags on cameras. So the vocabulary of tags was, say, 55. And then in the
first task what we do is we're going to build four cameras, two compact cameras
and two SLR cameras.
So what we do is we do kind of -- we asked experts in photography, and then we
asked them which tags are desirable if you want to build a compact camera and
which tags are desirable if you want to build an SLR camera. For instance, for
SLR maybe the zoom, high zoom or high clarity or whatever. For compact it's
thin or modern. So the tags that are important.
So the tags are important. And then given these inputs, we build two and two
cameras [phonetic] so we talk to and talk to cameras for this. We also build
cameras. I mean, we decided what attributes to put in these cameras. And then
we do a survey and we ask users to select between our designed cameras and
the top cameras in these two categories, and we find that 65 percent of the users
select our cameras, which, of course, they are imaginary cameras because we
don't even know if they're possible to build, right? Compared to the existing
cameras.
>>: Wouldn't that experiment be easy to beat? So I proposed -- first of all, you're
not sure on price, right? And second of all, I propose just a new camera that has
every single boolean attribute, and most of those are positive. So --
>> Vagelis Hristidis: Well, actually you -- let me think. Because some of the -- for
some attributes a kind of negative correlation with tags, right? Because if you
put, let's say, a big lens, then maybe the tag more than [inaudible] lightweight will
not be selected, right? So it's putting -- all the features on the camera doesn't
make it necessarily the best camera.
>>: Not necessarily. But --
>> Vagelis Hristidis: But, you're right, it would be a good baseline to compare
against, yeah.
>>: [Inaudible]
>> Vagelis Hristidis: Well, who do you mean by we, we don't know? You mean
the designer or the users?
>>: [Inaudible]
>> Vagelis Hristidis: Yeah, exactly. So, I mean, building a product is a very
complex process. So what we're doing is just give one more signal to the people
who build the products. But, of course, we're not saying you should only build
based on this.
>>: Since you're getting feedback from the camera expert, so they are
suggesting the text, they can also potentially actually basically [inaudible], right?
So this can potentially serve as sort of the golden standard, right? The ideal
camera that they want to view, and you can compare this ideal camera from the
expert from the camera that [inaudible] and see if you're getting --
>> Vagelis Hristidis: Well, that's -- doing that, you kind of bypass the tags. You
go directly to the attributes and then you say, you know -- actually this is what
people have been doing probably for years, right? They get experts to say what
are the good attributes, we're going to build this.
But in this work we say, you know, can -- using the tags, can it give you some
extra signal that the attributes alone cannot give. The reason is that -- the
reason that there is a lot of information about tags out there, so -- it's not the
same as, you know, getting ten experts, and if you have millions of users who are
tagging and you can leverage all these opinions of many users. So it's kind of a
model of going from a few experts to going to, you know, all the users because
that's -- because [inaudible] two tags, you can get opinion of more users
>>: In the end what you'll leverage is really the correlation between the attributes
to the tags. So if you're an expert [inaudible] already has opinions on the design
or the [inaudible] also potentially be able to tell you what attributes --
>> Vagelis Hristidis: Yes. But, again, you are assume that you want to build a
camera for the expert, but if you want to build a camera for the people, then you
cannot just ask the experts. I mean, that is the idea. The idea is that through
leveraging tags, you can build something that is appealing to the masses and not
only to the experts.
I mean, an alternative, you can say send out a question to 1 million people to
[inaudible], but, you know, this is how to do. So using the tags kind of implicitly
gives you what they like.
And the second qualitative experiment was the following. So -- we built six
cameras designed for three groups of people, young students, retired, and
professional photographers.
So, again, we have kind of photography experts. I mean, I'm not saying people
who work in Canon, but I'm talking, you know, people who buy into photography.
And they assign [inaudible] that are desirable for each of these three categories,
and then we build cameras based on these tags and then in the end we ask
some other users, not the expert users but some regular users, to say given on
these six cameras, given the attributes of the cameras, what tags do you think
are appropriate for these cameras, and then we see that they select more or less
the same tags as the expert selected, which means that -- this experiment kind of
tests that there is a correlation between attributes and tags. So if you build -- if
you select the attributes based on the tags, then other people, when they see the
attributes, they will also decide that these tags are related. So it kind of confirms
that the Naive Bayes is a realistic classifier for tags.
Okay. So this concludes the first part of my talk, which was, I guess, the main
part.
So then to summarize, I think that, you know, the main contribution of this paper
is kind of showing that the tags can be used for more things than people have
used them before. So and, actually, we will keep working on more problems on
how tags can be used for advertising and other things.
So we kind of report on new directions of how tags, used. And then with
[inaudible] also the algorithms that go with it, but I will say the problem itself is
interesting.
And we present two algorithms, an exact algorithm based on top-k and an
approximate algorithm based on the probabilistic principles. And then for future,
one thing we have been discussing is to try other classifiers like decision trees to
see if we can do something better or similar or, as I said, to find more
applications of tags, like in advertising.
So I guess now if you have any comments or questions on this part of the talk,
maybe now is a good time and then we'll [inaudible] data.
Okay.
>>: [Inaudible]
>> Vagelis Hristidis: Yeah.
>>: And from there trying to figure out perhaps what is causing the existing
products to be labeled --
>> Vagelis Hristidis: Definitely, that's very related, because -- yeah, some -- I
mean, you can think of some of the tags as like positive sentiments, some of the
tags are negative sentiment, and then you say if I want to build a product that will
bring positive reviews, what attributes should I put.
But I would say tags it more than negative, because one tag can be positive for
one camera and negative for another type of camera. Like, for instance, if it's a
big lens, it's positive for SLR, it's negative for compact, so it's -- I would say
maybe tags can give you a little bit more flexibility than sentiment, but it's very
similar.
Okay. So for the rest of the time I want to discuss a little about some ongoing
work and some recent work we have been doing on health data management.
So in the last few years I have been always trying to apply my research on health
data, and because, as you know, my primary work is between structured data
and text, so the health data is the ideal setting for these kind of problems.
So the kind of data sources that we have been working with, the main sources
we have been -- is, first of all, we have structured health records, which are
usually represented as xml or relational. So, you know, you can say what are
the -- you know, what is the disease, bronchitis, medications, and so on.
And as you'll see here, also there is some free text. And this, by the way, is on a
standard format. This is HL7 CDA format.
And then you have -- you know, inside here actually you can have some
separate [inaudible] to tell you free text notes about the patient as a second kind
of source, and then you have a very interesting and unique thing about the data.
There's a lot of very rich ontologies and dictionaries that, you know -- mostly the
NIH has, and there has been a great investment from the part of NIH and some
other organizations to build very big ontologies which you cannot find in any
other domain.
So, for instance, you see this graph. This is a subgraph of this [inaudible]. Some
[inaudible] is kind of a medical dictionary. And then it shows all associations
between concepts.
So asthmatic bronchitis and finding site of bronchial structure and so on. These
graphs have millions of nodes, so it's not like, you know, the ACM classification
which has maybe 1,000 nodes. So these are really massive dictionaries.
And then the last piece of data we work with is the literature from PubMed. So
PubMed has biomedical publications, and, again, there are some interesting
things. For instance, every publication is manually annotated with mesh
concepts where -- mesh is a [inaudible] dictionary. So there are people who
work -- their full-time job is to get their publication and assign ten concepts to the
publication, which, you know, again, gives some unique opportunities.
So the kind of problems I have been working with are entering health data,
querying data, and sharing data. And I will talk briefly about each of these
problems.
So the first one which is the newest project I'm working on, and so far we don't
have any publication -- so the problem is how do you help users, which can be
doctors, nurses, or administrators, to enter clinical notes.
So this is -- imagine a setting that a patient goes to the doctor and then, you
know, they have a discussion, you know, what's your problem, what are your
symptoms, all with a nurse.
So there are two extreme ways to record such data. The first way is to just say I
will record everything in text. So I'll just type everything in text or I use a tape
recorder, you know, I -- [inaudible] recorder, and then, you know, I'll use a system
to transform this to text.
And the other extreme is to say I will go all structured, which means that
whenever I want to enter, let's say, disease, I will have to navigate the ontology,
so maybe I will say issues and then I will say bronchitis and then I will go
asthmatic bronchitis and then I will say check this is the concept. And then
medication, I will give all the list of medications and see which one it is, how
many milligrams.
So the good thing is, as I mentioned before, because we have such rich
ontologies, because of the concepts can be found somewhere. But [inaudible]
user for every concept to try to find it.
And actually there has been -- I have talked to a person who is in the IT of a
military hospital, and they told me that personnel there spend hours every day to
record the data in a structured way because they have some -- in some -- in the
military hospitals, they have a requirement to put more structure to their notes
than if you go to, let's say, a private doctor.
So the question is how can you bridge the gap between these two. So we want
to make it easy to enter nodes but at the same time also have some structure so
it's not completely unstructured.
So some tools you can use for that is, first of all, we have -- often we have some
clinical rules that says, for instance, if concept cognition, which can be [inaudible]
contains dementia or dizziness and home meds contains psychotropic, then
inquire about falls. So using these rules can kind of guide you to the data entry,
because if the first two conditions, let's say, are true, then maybe it will pop up
about getting information about this one or maybe before the first condition is
true, then maybe the system may suggest that you record some information
about home meds to see if this tool can be evaluated.
Another direction is to use dynamic entry forms because as we will see -- as we
see here -- so this is a screen shot from [inaudible] used by the VA hospitals, and
actually this is being -- this year they are in the process of upgrading, so this is
what has been used until now, but after a few months probably it will be a new
version.
So if you want to enter clinical notes, what you can do is you can -- on the left
here there is a huge list of templates. It has thousands of templates. And
because everybody can build a template or -- a template is, you know, a set of
fields.
And then you select -- you have to find which is the right template. Let's say you
have a patient who comes here -- who comes and has trouble sleeping, let's say,
and you want to record the sleeping problems. So you have to look over these
thousands of templates and find the right template, and this is how a template
looks like.
So it has maybe some check boxes or some text boxes and so on. So then you
fill out this template, and then the way it works now in the VA is that once you -you save the template and then it is saved as text. So this is just used to help
you enter the data, and then when you click okay it will be transferred to kind of a
text file and it will be stored on the patient's records as text.
So the one challenge here is how can you make this -- how can you allow the
user to enter data without having to search through thousands of templates. So
how can you know what template the user is -- needs. And you can personalize
templates or, you know, learn templates.
Another thing we're working on currently with one of my students is how do you -let's say suppose that you have a text editor and the user is typing a clinical note,
and then how does the text editor kind of make -- try to guess what is the
structure you are trying to do.
For instance, if you say this is a 75-year-old, and then maybe the interface will
say something like age equals 75, do you accept this, and you click yes and then
this goes -- stored as structured, so kind of interactively as structured to the text.
By the way, there are some tools that -- one of them is an meta match which is
maintained by NIH and then there's a C Text which is from Mayo Clinic that -what these tools do is that you can input a whole document, text document, and
then tell you which of the concepts in the dictionary are related to this document.
So it will kind of try to -- do kind of information extraction on the text document
given the dictionary.
But, of course, this is kind of offline, so it will do something interactively as the
user is typing.
And, of course, NLP [phonetic] is also a very important tool here. One challenge
is that medical language is a little different than the common language because
there's many shortcuts, many, you know -- sentences don't have verbs, so there
is many things that are unique.
Yes?
>>: [inaudible]
>> Vagelis Hristidis: Yeah.
>>: [inaudible]
>> Vagelis Hristidis: Yes. That's definitely -- that's a good point. Currently it's
not supported, but, you know, the existing system, but this shows you that even
the low-hanging fruit, they have not implemented them. So, you know, the VA
health record is supposed to be one of the most advanced out there.
I mean, if you go to private doctors, as you know, it's even more old-fashioned
interfaces. So your point is good. There's like a lot of opportunity to improve
this.
>>: You mentioned somewhere that you wanted, while the text is being entered,
you wanted online characterization into structured data.
>> Vagelis Hristidis: Yeah.
>>: Is that on-the-fly or [inaudible] requirement, a stringent requirement? For
example, like if I am -- I'm going to a doctor. The nurse types in the message in
text. Once I go back home [inaudible] takes the text, analyzes and operates the
database?
>> Vagelis Hristidis: Yeah, but the problem with that is that -- these tools are not
100 percent correct, right? They make mistakes. So that's why if do you it
interactively, you know, you can -- the user can confirm that these are the correct
ways of extracting the data, because there's not perfect tool that will -- given the
text, will find the perfect concepts.
Okay. So the second direction I'm working on is on querying health data.
And, also, I'm very interested in user interfaces for querying health data. So not
only -- I mean, traditionally we started it on ranking, but there is also many other
things that, you know, need to be taken care of.
So in general -- here I just put a few bullets on, in general, what kind of things are
important for the user experience when the user is searching. So ranking,
obviously, is one important thing. But it has been shown that it's not the only
important thing.
So the other important things is how do you formulate a query, to help the user
formulate a query, for example, do some [inaudible], how do you present the
results, so do you present the results like in Google or Bing where, you know,
one after the other or do you also do some grouping, some graphics.
Also, you know, how do you handle user feedback. The user clicks on a result,
do you want to give more relevant result, or maybe you want to personalize on
the user or, you know, suggest query formulation.
So specifically for health data, you know, all these questions are open. So, for
instance, what is a good answer is one big challenge. Because suppose you're a
doctor and you type something like breast cancer complications, for instance.
What are the semantics of the answer? Are you looking for your own patients?
Are you looking for -- for patients of the hospital? Are you looking for literature?
And if you're looking for your patients, what are you looking for? Are you looking
for the names? Are you looking for the part that talks about breast cancer
complication?
You know, it's not clear what is a good answer. And also how can you use
maybe the context of what you're doing to say what's a good answer.
For instance, if a patient is kind of visiting, you have the file of the patient open,
then probably something related this patient, so the patient should become the
context [inaudible] that question.
Then -- by the way, there is some work called [inaudible] which what they do is
given -- if you're looking at the patient record, then it will find some literature that
is related to the patient's record.
And there are some pretty simple ways to -- that people have been using. For
instance, you extract some key words from the patient record and then you
submit them to PubMed and see what is related. So there's nothing very fancy
that people have done there.
Also -- I'll also mention with granularity, you want to show the whole record of the
patient. You want to see what's specific to the query. [inaudible] ranking is also
how do you -- what are semantics of ranking for health data? Do you want to
rank patients by how serious they are, by time, by location?
Also, what [inaudible] conditions do you display. Suppose -- let's say that you
have a -- what I envision is, let's say, one single textbook, and then you can
search everything from literature to your patients to other patients to studies to
experiments. So then you can imagine you have facet conditions. You say I
want to see literature, I want to see my patients, I want to see, you know, studies
about medications, and so on and so on. So you have facets which can be fixed
or can be dynamic based on the query.
User interface, you know, of course is very important. So the first question is the
is the web face interface a good interface with the text box and then the list of
results or is there some other user interface that will be more appropriate.
Also, you have personalization in a way, and not only personalization on a
personal level but also on, let's say, stakeholder type level. So you have a
patients, doctors, administrators who search the same data, but they're searching
from a different perspective. So how do you achieve that?
>>: I have a question here. So this looks a lot like [inaudible]?
>> Vagelis Hristidis: Well, okay. One unique thing, I think, is that you have these
dictionaries here, so -- which offers some -- it's a kind of unique input. That's one
thing.
Now, the other unique thing is that people have been working for decades on the
enterprise search, whereas this one is, you know, much newer area, newer for,
you know, many reasons why people think that hasn't been the right amount of
effort to build these things.
And I guess the semantics of the queries are different. I mean, of course, also in
enterprise search you can define different types of semantics, but -- hmm.
But, yeah, I think that there are many semantics here, many types of queries that
have not been addressed, like, for instance, that is different settings of query. I'm
a doctor, I have a patient in the office, I do a query. That's one setting. And then
another setting is that I go home at night, I want to see summary of my patients --
I mean, all of them, you can say, you know, there has been some related work,
but then they also are, you know, unique.
And, by the way -- so this is, again, the VA EHR system, so I'm just showing this
to show you what is currently available in terms of querying.
So, actually, I'm not showing you probably the right screen, but the only way that
you can query in this system is by querying on the patient name. So you just say
I want to open the file of this patient, then put the patient name or the patient ID
and then you get the -- let's say this is the file of the patient. There's no other set
functionality.
And this is not on the VA. Most of the health records don't have any search
functionality, which, you know, you think is surprising, but, you know, people
maybe have not agreed on what is useful to search. Maybe searching would
make things more complicated and confuse users. So there are many reasons
why there is no search. But, on the other hand, there is also some research that
says that the users would like to have search.
So this is -- now I'll present very briefly a couple of, you know, previous work we
had on searching health data. So the first is about how to do key word search on
xml where you also have -- you are aware also of the health ontologies.
So the idea is that suppose you have a health record which is in xml, can be
viewed as a tree, and then you also have some dictionaries. So suppose that -let's say one of the query key words matches the dictionary but doesn't match the
health record itself. How can you still, you know, say that maybe this record is
relevant because, for instance, the query says asthma, [inaudible] has bronchitis,
and I know that these two are related through some path here. So this is one of
the works we have done.
Another work is on how do you navigate the results of a search on PubMed. So,
as I said, the interesting thing in PubMed which is kind of unique is that every
publication is annotated manually by a set of about ten concepts, and the
concepts come from a hierarchy of concept, the mesh hierarchy.
So what you can do is you can -- one way to display the results is to say that I
will organize the results on the tree. So, for instance, this is cell physiology 161
is that 161 of the results have been annotated by this concept.
So then doing that, the problem is that if you have thousands of results, the tree
can be very big and have thousands of nodes because every node has -- every
paper has maybe about ten annotations. So then even displaying the tree may
not be very useful because the tree can be as big as the list of results.
So then what we did is we have some algorithms so that you can navigate the
tree in a more efficient way, so kind of skipping some of the levels depending on
some assumptions on what is useful for the user [inaudible] to the user and we
will do some jumping of levels such that we minimize the expected navigation
time. So we have some cost model. Based on the cost model, you also have -here at Microsoft you have developed, you know, saying that when I read
something, it has cost 1, when I click something, it has cost 1, and then given this
cost model, [inaudible] tree.
And, finally, the third direction I'm working on, which, again, this one is very new
and we don't have any publications, and actually this is more -- I'm working on
this with some people from nursing school.
So the problem is the following. I'm not sure exactly what is the technical
challenges, but this is an interesting problem.
So [inaudible] the following. In older people who are staying at home and have
home care, what happens is that there are agencies, state agencies, that send
people called case managers to visit these older people once a month to see,
you know, if they need some help.
And this is mostly about the people who -- it's more important for people who
don't have families or who are very low-income and they cannot afford their own
care, so the state takes care of them.
So what happens is the case manager goes there once in a while and then the
case manager has a form, and the form is like maybe seven pages and has
check boxes, you know, the house is clean, the person is -- has a broken
something, there is no food in patent frig and these kind of things. And then -and also, you know, what kind of medication the patient is taking.
And then this -- so then this case managers take them back to their institute of
home care, which is usually nonprofit, you know, supported by the state. And
then, also, the patients at some point go to the doctor, and because the doctor -because the patients, you know, are very old, many times have dementia and
they cannot, you know, communicate very well to the doctor.
So the problem is that at this point there is no communication between the doctor
and the case manager. Because at best what happens is that the case manager
will submit a form and then maybe will fax it to the doctor's office, and then this
form will be, you know, buried somewhere in the doctor's office and the doctor
will never see the form. So then the doctor cannot prescribe the right medication
because he or she doesn't have the communication with the case manager who
knows the patient.
So there is a kind of broken communication, so -- and this kind of shows what's
happening. So the case manager tries to contact the doctor through the nurse or
voicemail or fax, and then the patient goes to the doctor and the doctor usually
doesn't have this information.
So the idea is how can technology help in making this communication better by
building some -- let's say some central portal that, you know, all case managers,
patients, and physicians can access. So maybe one day, let's say, the patient
goes to the doctor, maybe the doctor can just look into a website and see what
the case manager has said about the patient.
So now what are the technical challenges? The technical challenges is how do
you make the user interface easy so that the doctor can just, you know, on
screen see the summary of what's important for the patient. How can the case
manager -- again, make is easy for case managers to update the information of
the patient without having to go through many pages.
You know, there has been some work on how to do -- recently some work on
how do [inaudible], like how do you order the set of questions of user surveys so
that the most important ones go first and then the ones that kind of -- can be
inferred from other questions do not have to be asked. So kind of these kind of
things. How do -- so you want to minimize the time of the users. That's the
purpose of this work. How do you build a portal such that it minimizes the effort
from all parts.
Also, how do you decide if an alert should be submitted to any of them so that
you don't get those -- you know, there's a big issue with -- on [inaudible] systems
that, you know, you don't alert too much. You don't always pop up windows or
pop up or give -- send emails with alert, so you minimize alerts.
And, finally, which kind of led to the question we had before, some of the
properties of shared data which are and are not unique, they are -- if you take
maybe each of them separately, maybe it's not unique, but if you take them all
together, maybe they become something more unique.
So you have [inaudible] issue, you have little missing values, dirty data, you have
a mix of text and structure, you have a lot of shortcuts, like [inaudible]. You have
a lot of negated phrases. So, you know, the doctor might say the patient does
not have diabetes, and this is kind of very common practice that you want to
explicitly say what patient does not have so that means a simple key word search
may fail.
Time stamps are important. They're everywhere. So you have this -- you know,
how do you handle time.
And then this concludes my talk. So I would like to thank my students, and this is
where you can find more information.
[applause].
>> Tao Cheng: Questions?
>>: So you're building tools for this in this medical area. This is a very practical,
hands-on activity. It's something you just sort of [inaudible] and developed it on
your own [inaudible] hope for the best, that's not going to work.
>> Vagelis Hristidis: Yes. So actually that's the big challenge in working this
area is that you have to work closely with the collaborators from these areas. So,
specifically, I have a very good collaborator from the VA, and he's a medical
doctor and doing medical informatics at the VA. And also I have some other
contacts which -- so I tried meet with them to see, you know, to do something
that has a chance that they will use eventually.
So this is not the kind of research you can do in your lab with your students. And
that's actually one main challenge to do that, because you depend on other
people. So you cannot say, you know, I will work hard and I will, you know, make
the deadline and submit a paper, because maybe you wait for the user survey
and the user survey takes months to --
>>: [inaudible] usable testing which is also a big piece of this. These computer
science students, are they okay with that? That's pretty much an essential
ingredient [inaudible].
>> Vagelis Hristidis: You mean usability -- you mean if usability is part of
computer science or not?
>>: Well, just surveying. It's very labor intensive. Whether these are HCI
students that are happy to do it or whether they're database students, this is
tedious --
>> Vagelis Hristidis: Well, I -- I think that the students like to get out of their strict
area and do something else, because I think that -- I think it's interesting for
students to have -- you know, to not have only one focus but to get some
experience from other areas.
>>: So you haven't had any resistance?
>> Vagelis Hristidis: No, no. I mean, I don't have any resistance from students.
The only challenge is to get time from doctors, because they're very busy. It's
hard to convince them that your collaboration with them is going to bring them
something positive for them.
>>: Well, there's that. If you're doing these sort of interfaces, you're working with
nurses, and my impression is they're all grossly overworked.
>> Vagelis Hristidis: Yeah.
>>: You're just adding [inaudible] getting research done, they're more oriented in
getting their work done so they can go home at a reasonable time.
>> Vagelis Hristidis: Yeah. And I'm working with a nurse who's -- she's an
assistant professor. So that's motivated to work until she gets done.
>>: Works best.
>> Vagelis Hristidis: Yes.
>>: Thanks.
>> Vagelis Hristidis: Thanks.
>>: All right. [inaudible] setting up all these user surveys and seeing [inaudible].
>> Vagelis Hristidis: That's right. Yes.
>> Tao Cheng: Thanks.
[applause]
Download