>> Susan Dumais: So good afternoon, folks. It's... who is at the University of Eindhoven, where he's been...

advertisement
>> Susan Dumais: So good afternoon, folks. It's my pleasure to welcome Mykola Pechenizkiy,
who is at the University of Eindhoven, where he's been for, what, five or six years? Seven, okay.
And he's done a lot of work in the area of social media analysis, the area of information
dynamics, how information changes over time, and in fact, he's going to talk about that today.
But a lot of this work strikes me as a leading indicator of what we see now in data science, where
you have interesting real-world problems, largely scaled data sets, and thinking about new ways
of capturing some of that in the large scale. So it's my pleasure to introduce him, and today he's
going to talk about how to handle concept drift in predictive modeling, because the world is not a
stationary and IID, and so it's an absolutely important question that a lot of the models overlook,
so it's fun to hear about it. Thanks.
>> Mykola Pechenizkiy: Thanks. Good afternoon, everyone. It's my pleasure to be here and an
honor to be here, and thanks, Susan, for kind introduction and giving a chance to give this talk.
So today, I will talk how to get ready for change, and I provided a subtitle for the talk for
disambiguation, just to make sure that this talk is not about climate change, not about global
warming or not about regional cooling, so with a couple of pictures from Dallas a few days back,
but probably you saw more on TV.
So I'm going to talk about predictive modeling, and many of you know that making predictions is
a hard job, quite a few wise and very experienced people, during their career, made quite a few
predictions. Not all of them were correct, so you might recognize some of the famous quotes
about what is the potential of new technology, if it's going to take off and so forth, but these are
different kinds of predictions. So these are based primarily on the expertise and intuition of
experts or gurus, so I am going to talk about data-driven analytics as a problem of concept drift.
So the outline for the talk is very straightforward, very simple. So I'm going to explain what I
mean by predictive analytics and why is it important to consider concept drift, what are the
typical approaches to handle concept drift, and after that I will try to emphasize, actually from
applications' perspective, it's interesting to see how many challenges haven't been addressed yet
from this perspective and what kind of interesting opportunities for further research it provides.
And, as the end, I will make my own forecast or prediction how the field is likely to develop
further.
So by predictive analytics, I just take a reductionist approach to say, well, essentially, it's about
data mining and about knowledge discovery. We have a possibility to collect large volumes of
data, process this data to get some useful predictive patterns, models, summaries, which can give
us an insight into phenomena, get us some actionable patterns and things like. So the most basic
supervised learnings may be presented in the following way. We have some historical data that
tell us how our target population behaves. We can induce a model from this historical data, such
that given previously unseen instances, we can generate a label. Like, for instance, we want to
build a classifier to determine which of the e-mails are important and which are not. So based on
the previously collected e-mails and knowing which of those were relevant and which were not,
we built a model, so next time, when you receive a message about a talk like this one, you can be
sure you'll not miss it. And then you can think about various types of predictive modeling tasks,
so whether it be classification, determining the relevance of an e-mail or finding out who the user
is, performing user profile and understanding whether this is a novice user or an expert, what
kind of information need the user has and so forth. So we can think about various types of
regression tasks while, for instance, making predictions how you would score a talk like this one
or performing ranking in case of search engines, in case of recommendation systems -- or
perform time series prediction. For instance, when we want to estimate the popularity of our
website, when we what want to predict the amount of traffic we can attract or click-through rate
on a particular banner or news item and so forth. So there are lots of approaches for predictive
modeling for each of these tasks. In this talk, I will focus primarily on classification examples.
However, most of the discussion we will have can be related to other predictive modeling tasks,
as well, including rank and prediction and so forth. So if we think about the geometrical
representation, you can consider number of instances which are represented in some
multidimensional case. They have labels, which represent -- which tell us to which class they
belong. And they have a true decision boundary, which allows us to discriminate in senses of
one class and the other class. So if you ever tried to build a predictive model in practice, you're
familiar with lots of major pitfalls, like we need to have clean input, or at least somehow
reasonable input to build a good model. We need to come up with representative features. We
need to come with the right problem, formulation, such that we optimize for the right things,
avoid overfitting and avoid false predictors, think carefully about operational settings, what is
known to us when, what kind of attributes, descriptive, predictive, which attributes we can
manipulate and so on. However, I would like to focus on just one single aspect related to the
problem of concept drift, which, in my opinion, is less studied in different communities and even
more, though, in practice. So think about the following problem. They want to find out whether
a particular antibiotic is going to be effective or not against a certain type of pathogen. So we
have historical data about patients, about their demographics, hospitalization data, and we know
for which patients some antibiotics were effective or not effective with respect to certain
pathogens. So we can use our favorite classifier to determine whether certain types of antibiotics
will be effective or not, given a new patient. However, the problem is that antibiotics -- excuse
me. The problem is that different pathogens can develop resistance to antibiotics over time. So
a classifier which was effective in the past over time may become ineffective, and this may
happen due to various reasons. So the following cartoon gives the main intuition. So one of the
pathogens can develop resistance, and once this is performed, actually, there are different ways
how this information can be communicated to other pathogens. So just piece of information can
be given to the other pathogen, or there could be a case of so-called microbial sex or other types
of relationships that can spread the information about resistance and make pool of antibiotics
currently used at the hospital to be ineffective. So the goal is to find out when such things
happen and find out the way the model should be adapted, adjusted, to keep them up to date and
useful. So when I think about streaming data, we can think about different types of situations or
different reasons why things may change over time. So I think about four different characteristic
examples. So we can think about changes in personal interests, well, like think about
recommendation systems or search systems. When we try to figure out what people are
interested in, we build a user model, build a user profile, of short-term, long-term interests,
current interests, and because many of those are not stable, we need to monitor for those
changes. Think about e-mail classification example, when we want to identify which e-mails are
relevant, which e-mails are spam. We can build a classifier which is accurate at the current
moment in time -- however, some of those spammers unfortunately are quite smart, so they try to
bypass the spam filters, and they develop new and new strategies how to generate e-mails. So
we need to deal with these adversity activities, again, recognize them explicitly through the
analysis of input stream or monitoring the performance of our classifiers to identify these
changes and adjust to them.
So you can also think about changes in population as such, so think about economic crisis. That
changes the profiles of people who apply for credit. So in credit score, in application, again, we
would need to account for such global changes along the models. So, finally, you can think
about complex environments where we can explicitly monitor lots and lots of pieces of
information. However, it is very -- well, it's intractable to model all possible combinations of
factors, and we just need to reduce the model only to those few elements which are critical for
our application. So while many of you might be familiar with DARPA challenge of driverless
cars, then the winning team was describing how they use adaptive learning strategies to analyze
whether it's a drivable surface or nondrivable surface, if it's a paved road or unpaved roads under
different lighting conditions and things like -- so simple adaptive learning strategies can do the
job.
So many of you are also familiar with the Netflix competition, where, again, lots of different
ideas were proposed, different factors affecting the performance of predictive models were
studied, and part of it was related to temporal dynamics related to ratings of how items become
more popular over time, how, for instance, item side effects can be observed, user side effects
can be observed with respect to rating scale, with respect to user interests, with respect to
popularity of the items, seasonal effects and so forth. So from machine-learning and data-mining
communities, people tried, or sometimes they emphasize the difference between real concept
drift and virtual concept drift. So think about our original example, where we have instances
belonging to two different classes and we have a decision boundary. So we talk about real
concept drift when this true decision boundary changes. And then we need to update our
classifiers to learn this new boundary. However, there could be also different reasons for virtual
drift, when the true decision boundary remains to be the same but the distribution of data
changes and it affects the performance of classifiers, still. So, as discussed, these different cases
can be decomposed even further. However, from a predictive modeling point of view, as soon as
changes in the data, or in the label, so in decision boundary, affect our prediction decisions, we
need to detect this and decide what to do next. So if you go back to the basic supervised learning
idea under concept drift, so now we need to question whether the population, about which we
had some historical data and learned our model, is still the same or it's different. And, if it's
different, what shall we do with our model? How we are going to update it?
So we can think about two major approaches, how to manipulate the training data, how to select
the most representative, for instance, the most recent, instances and how to perform model
update or the learning of the models, whether it's a single model or an ensemble of multiple
approaches. So we've been looking at different strategies and some general high-level
framework which would allow to describe major types of approaches for handling concept drift,
and then it can be represented with the following simplified view. So we have input data. From
this data, we learn the model and we cast predictions. So, beside that, we accumulate some
relevance feedback about the performance of our models, we monitor our input data, we monitor
how well our classifiers perform, and if detect changes, we can manifest them to our data
management system or learning subsystem such that it becomes more up to date and can
generate more accurate predictions. So, besides that, we can also alarm the identified changes
either to the main expert or to other sub-modules of a system. So, consequently, taken with this
perspective, you can think about various types of ways how to characterize major approaches for
handling concept drift with respect to memory management, different types of change detection
techniques, how to characterize the properties of learning techniques, and then how we perform
the monitoring and evaluation of these learning approaches.
So, when I talk about memory management, you can think again about short-term memory, when
we have a small portion of data that we keep in main memory and you can use for building a
model, or you can also think about long-term memory, which is captured in the models
themselves. And effectively, you can think about various strategies, how to maintain, with a
training window of fixed size, variable size, how to introduce some forgetting mechanisms such
that we, for instance, down weight the importance of instances which were observed far in the
past and then boost the importance of most recent stuff.
>>: So you're talking about it as if you get the same treatment for every, say, query or every
instance. Do you look at all at models that may have different, say, forgetting functions for
different queries or different subsets of the data?
>> Mykola Pechenizkiy: So this is a very good question. I'll try to come back to it a bit later
during this talk. So the mainstream of concept drift research assumes that we deal with a single
object. So, for instance, we have a time-series data that tell us something about an industrial
process, or we talk about an individual user and we monitor interests of this user. So think about
a single classifier or single joining mechanism. But you're absolutely right. So in many cases,
we have a multitude of objects that we trace over time, and it makes perfect sense on the one
hand to build different models, and on the other hand try to learn from multiple objects such that
they contribute to each other. However, in the most basic settings, it is considered to be all
uniform, so we have similar kinds of data about similar instances. So, with respect to change
detection, again, you can think about various types of approaches, like what can be monitored
and what kind of analysis we can perform, like, for instance, sequential analysis, like the
cumulative sum test, control charts, monitoring of two distributions, using some contextual
monitor and things alike. So from the point of view of properties of the learning models, again,
we can think about various types of categories, for instance, whether they perform retraining of
the models from scratch or try to apply some online or incremental learning where our models
are informed about accurate change or we perform evolving or blind adaptation, so we don't
know whether change happened or not, but we do anticipate, and therefore we update the
models. So, finally, whether we build individual models or try to maintain an ensemble of
individual predictors. So this can be discussed in a much more fine-grained way. However,
instead, I will try to focus on four major categories and I'll give an idea how simple these
approaches could be. So we can think about four major types of approaches. So, in the first
case, think about use or not use of detection mechanisms which would trigger information about
the change and trigger an update on the model versus evolving approaches which just adapt
models every step, so we don't know whether change happened or not, but because we know it
may happen, we perform the adaptation. And on the other side, we can think about individual
models or ensembles. In the first case, typically, they think about some reactive modeling with
some forgetting mechanism or an ensemble approach which tries to maintain some memory over
time. So, for instance, we can think about simple forgetting mechanism when we have a fixed
sliding window. It goes over time, so our model is retrained or incrementally updated with
analyzing new instances and forgetting the most outdated ones. So a very simple approach, but it
can lead to some reasonable results. So this next one is based on explicit detection of a change.
So, again, we have training data, we have change detection mechanism. Once we detected that
there was a change, we disregard the data which is now no longer relevant and built a model on
the new stuff. So, again, you can think how we can hunt for useful information about the
change, whether it's in the input, in the model itself, in the output or analysis of performance of
the models. So all of this can be useful for manifest change. So most frequently, we analyze two
windows of some statistic, which can be, again, computed as on input or model parameters or
outputs. We have a reference set on which we assume this is the current stable behavior of the
model. Then we have the most recent set. We compare the two, and if there is an observed
difference between these two sets, we manifest a change, and then different statistical tests can
be used for that. So we can build an ensemble of classifiers to track how well different models
perform, so again, think about timeline. We have multiple windows, and these windows, we
built classifiers. Every time we need to cast a new prediction, we ask every classifier to cast a
vote and apply some voting mechanism. So think about the initial situation, we have multiple
models and initial weights. They all cast a prediction. We know what is the true prediction, so
those guys which were correct get a high weight, those who were wrong get a low weight, and
then we keep doing this, and this way we always maintain a pool of classifiers and knowing
which of those are the most adequate or the most accurate to the most recent data. So, finally,
we can think about some contextual approaches, where we try to identify regions in the data for
each of the regions we find out the most suitable classifier, which has the highest accuracy or the
highest generalization performance on that piece of data. And whenever we have a new instance
to classify, we check in the instance space what are the other similar guys to this and apply the
corresponding classifiers which are expected to perform best in that neighborhood. So given
these different types of techniques, we also can think, well, which of those are most effective or
which of those would be useful in which of the settings? And we can also think about different
kinds of change. Was it gradual change or sudden change or recurrent change. Is it expected
somehow, or is it predictable, or is it completely unexpected? And, consequently, in which cases
we can be reactive or where is there room for being proactive, anticipating change, making
predictions about it and making models ready to use. So if we look into these four major groups
of strategies for handling concept drift, whenever we deal with sudden drift, so usually with
reactive models based on explicit change detection and some forgetting mechanism, we would be
doing just fine for sudden drift detection, and similar approaches can deal with gradual drift, and
for recurrent, we need to have some form of meta learning and some form of context awareness
to capture recurrency or recurrence of particular concepts. So, given this overview, of course,
there is also question how well these different approaches work in practice, why doing it so
many, and if there is no one best approach for all the situations, how we are going to choose the
most appropriate one for a particular application in mind. So many of you might be familiar with
cross-industry standard for data mining, which was proposed some years ago, and then it was
appreciated by the industry because it specified different steps in the process level, like starting
from business understanding data, understanding data preparation, to data mining, evaluation and
deployment, and then each of the steps was described with respect to operational settings, what
are the useful things to take into consideration, what are the available techniques to be used and
so forth? So now, if you look into streaming settings, effectively, all of these feedback loops,
which in the past were assumed to be performed by the main experts. So the main experts would
see how well a particular model performs and then try to fine-tune the parameters or to
reconsider what kind of technique to be used or reconsider what would be a good representation
and so forth. So now, in streaming settings, we try to automate all of these feedback loops as
much as possible. So we monitor the performance of classification models or predictive models,
and whenever something goes wrong, we need to take a decision how to update the model, how
to change the representation of the data and so on.
So we were curious to see what many people think about applying concept drift in practice, and
we did an extensive literature analysis to see how well different techniques work in practice, or
at least how different researchers believe they work in practice. And we did also quite a few
case studies ourselves, and we were surprised to see that some of the popular approaches might
work or might not work, and then we were trying to figure out what was the reason, and then
which of those techniques are in general applicable or not applicable for particular operational
settings. And we realized that, in the area of concept drift research, the situation was quite
similar to early days of machine learning and data mining. So many researchers say temporal
dynamics is important, concept drift is important, but then real data is hard to get because of
privacy concerns or some proprietary reasons or because there is too much overhead to work
with real data, or sometimes people come up with a too generic or too abstract solution, so it's
not ready to be used in real applications. So what we often can see, people play with
benchmarks or artificial data sets and then try to generalize from those experiments. However,
because in data mining, many journals, conferences, reviewers, argue that, well, this is an
application field, so you need to show the relevance to practice, you need to show that your
methods do work in practice, so many people try to invent applications and show that the ideas
do work in practice. And then, this is something I relate to the problem of green aliens, and I
will say about this in a minute. So in case of concept drift research, at some point, it was really
close to some extreme, so people were playing with artificial data, there were a couple of
benchmarks, extremely popular, with a particular type of concept drift simulated, and in many,
many cases, people were often trying to use standard UCI benchmark data sets, manipulate them
in a certain way such that concept drift would be introduced, and then it would be captured by
certain techniques. And, in my opinion, this is really, really close to the situation described in
this paper that that I really like, so it's called Novel Efficient Automated Robust Eye Detection
System for Green Aliens, and it describes the situation in the field quite well, so this is a sarcastic
short paper, and it's a very quick read, and I strongly recommend to look at this.
So this triggered us to rethink once again what kind of methods we have and in what kind of
application settings they can be used and what is the variety of application settings where
concept drift matters and we actually need different types of strategies to handle it. So we did a
very simple thing. We tried to identify different dimensions along which we can categorize or
characterize various types of applications. For instance, with respect to what types of data has
been used, what kind of problem formulation we have, what kind of changes we anticipate and
various types of operational settings we need to consider. So with respect to data, again, look, is
it time series data, relational data or a mix of those, how data is organized, is it a high-speed
stream or is data coming in batches? Can we re-access the data? How do deal with this missing
data and things like -- so, with respect to change, what kind of types of change we can
anticipate? Is it just single type, like only sudden or only gradual or simultaneous in the same
application. Different types of changes may occur. What is the source of change? Is it about
gross reactions or changes in population or, again, it could be multiple reasons present at the
same time? What are our expectations about change and what is our expectations about desired
actions, if we know that the change appeared? Because, actually, again, it really matters what we
are optimizing for if we know what the goals are. And with respect to operational settings,
again, we need to be very careful in understanding whether the labels are available to us
immediately or with some delay, or they are never available. Are they available as a ground
truth or some proxies for ground truth, or we can just recomputed from historical data, from
offline data, how likely it was that we were correct or not correct, and then other things? So we
also analyzed various types of application per industry, like in finance and banking, in security,
in e-learning, entertainment, search, recommender systems, things like -- and we tried also to
categorize them by different types of applications, like whether it's about monitoring and control
or whether it's more about personalization tasks in search and recommendations, whether it's
more about management and planning, when we need to perform demand prediction tasks, or
some ubiquitous applications, when we have, for instance, location-based services or other things
which we integrated in other, larger systems. And then, actually, we analyzed these different
types of applications and saw what kind of drifts typically occur, how frequently we have or do
not have labels available, how hard they are, how objective or subjective they are and so on.
And we also started to look on different methods and what we have and what we don't. And
then, it's interesting to see that, if you look into mainstream research on handling concept drift,
these would be the most-assumed settings. So change is typically considered to be
unpredictable. Typically, it's sudden, so if we observe multiple changes over time, they are
considered to be independent of each other, so there is no opportunity to learn from multiple
recurrence of changes, so typically we analyze only single objects, so we don't analyze multiple
objects. We assume no closed-loop control, so there is no effect or there is no reinforcement on
the behavior of a particular system, so if it's even adaptive application, like a recommender
system or an information-retrieval system, information-retrieval system, again, there is no
assumption about the biases in the data, and typically it is assumed that, given historical data, we
can replay it multiple times and fine-tune the parameters and get reasonable estimates how well a
particular technique would perform. And then beyond, besides this, most of the approaches for
handling concept drift assume that the true labels are available right after costing the prediction,
which you can imagine in practice is rarely true. So, in reality, if you think about wide spectrum
of applications, changes often recur multiple times for a given object of interest, but they also
recur across multiple objects with related behavior, so typically, we need to monitor for different
types of changes, so beyond ways -- we can think about a multisensory environment or
multisensory data when the same object has been described by different feature subsets. If they
are analyzed independently, they give somewhat unreliable signal. However, if we analyze them
together, they can help us to detect concept drift much better. So, in many cases, we have no
idea what the ground truth is, so we can also get some guess about it. In many cases, we know
quite a few bits about the process we model and with background knowledge can be utilized to
build better predictive models, but also have better ideas about what kind of changes we
anticipate and how it can be detected.
>> Susan Dumais: Have you seen many examples where the ground truth changes over time, or
at least the labels change? I want to say the ground truth changes.
>> Mykola Pechenizkiy: Right, so there are a few aspects. So you can think about change of
label, simple change in interest, so assume you have a classifier which determines whether the
user is interested or not interested in a particular topic, like soccer. And over some period of
time, you observe that the person was interested, and then you recommend, you keep
recommending items about soccer, and the user doesn't click, so this is a change in ground truth,
in the sense that you know that the user had this interest. Now, the user changed the interest.
But what I also mean here is that in many cases we cannot make an association between the
labels and the ground truth, so we collect information -- like, for instance, we collect implicit
feedback, but we are not sure if this corresponds to actual interest of the user. Or, in many cases,
when we do the sensory data, so the goal is to reconstruct the signal, so we don't know what the
ground truth is, so we try to predict it. And then based on analysis of historical data over a long
window, we can reconstruct this ground truth with a certain degree of certainty, but we're still
not sure if it is ground truth or not.
And this is also quite important when we together to optimize the performance of techniques on
historical data, so we can fine-tune the parameters for optimizing for labels, but effectively, those
labels can be quite noisy and we optimize them for the wrong thing. So, consequently, if you
look into the peculiarities of various types of applications, you can think about more advanced
approaches, both for reactive and proactive handling of concept drift. So whenever we think
about recurring changes, there is room for meta learning approaches, how to recognize similar
situations from the past, and then use more accurate detection and prediction. So whenever we
know that there are different, related sources of data, so we can learn from that external data,
which can be considered in centralized settings but also in distributed settings, and then we can
think about context-aware approaches, so I will try to give a few ideas of each of those. So
regarding the context, so think about the problem of outlier detection. So we have some
seasonality in the data, it's not sufficient to perform simple thresholding, so we need to have
some approach which will capture this seasonality. So, similarly, when we want to perform
change detection, we may want to figure out like what is normal behavior of a particular system,
so here is a particular example of sensory data. The goal is to reconstruct the true signal. This is
just a mass measurement of fuel in an industrial process. It's a very clean task, like it's very easy
to coordinate. So every moment in time, we need to say what is the actual value of this timeseries data, so we need to get rid of noise, we need to detect change points, determined with
these red circles, and we need to do this as quickly as possible such that we start relearning the
models again and again and again. So there are quite a few challenges to address here,
asymmetric outliers, also some distortions in different phases of the process, but there are more
things to that. So, first of all, because we see that there is a clear pattern -- we have one phase
when there is fuel feeding process and fuel consumption process, and then they follow each
other, so we can try to see how we can get information about this periodicity and make use of it.
So another interesting aspect here is that there are different factors which affect the behavior of
this time series, so, for instance, different fuel type would affect how many outliers we have, and
then some other projects of the signal. So if you know some of those factors and we can model
them explicitly, again, there is room for method and approaches where we can learn models for
different types of fuel, so in online settings, we can recognize what is the current quality of fuel
or type of fuel and apply the corresponding model which would make a better use of the data. So
a nice type of problem is demand prediction, like predicting how much traffic we can attract in a
certain type of day for a certain type of query and so on. So we studied this problem in case of
food wholesales prediction, which was important for the company for stock replenishment. So if
you underestimate the demand, there would be empty shelves. If you overestimate the demand,
there would be lots of perishable goods wasted. There are also interesting moments how badly
you can do if you predict the demand too late or too early, what kind of costs are associated to it.
So this problem is difficult as such, because it's hard to predict behavior of people when they
want one or another type of product and what quantities, so we formulated this time series
prediction task, which was augmented with lots of information about weather, about the
calendar, about holidays, about promotions, so we were trying to make the predictions. And
then, actually, this project started very interestingly. So the company said that they were using
simple moving average to cast the predictions, and you think like, oh, like, we will use any state
of that predictor and then we will do much better. And we picked a few dozens of nicely
behaving time series data, we applied some simple ensembles of regression models and we got
somewhat better results. At least it would beat 10% to 20% the moving average that was used,
so we were quite encouraged. We talked to the company, showed the results. They gave us a
much larger pool of products. It was several hundreds of products, so we applied the same
model, we compared across multiple products. Our models were worse than moving averages.
And then, in fact, there is a good explanation for that. So many of these products are simply
unpredictable. So if they're just noise, it doesn't matter how hard you try to predict them. If
there is no signal, you cannot do much more than just smooth out the signal, and then we
reformulated the problem into like how to determine which of the products behave somewhat in
predictable way, and for those which can be predicted, for those, we can use intelligent
predictors. Otherwise, we just need to apply smoothing.
>>: Can you give an example of an unpredictable food?
>>: Milk for some reason occurs to me.
>> Mykola Pechenizkiy: Actually, many of the products are hard to predict because you don't
know when the season starts and the season ends. For instance, even for ->>: There is seasonality ->> Mykola Pechenizkiy: Exactly, so there is seasonality, and then even for those simple cases,
when you know, for instance, that ice cream are better sold in summer, or meat is better sold
during nice weather, but sometimes, it's very hard to know when in the mind of people it's
already summer, or if this weather is indeed good weather. I can tell you more, just a bit. So the
way how we continue, we started to think, well, if we cannot make good predictions by
analyzing individual products, maybe we can collect information from multiple related projects,
and then we can do better predictions. And obviously, this can be done in different ways, like,
for instance, analyzing the taxonomy of products. Here are [indiscernible], doing sort of contentbased analysis and performing the clustering and then building models for each cluster. And I
thought this would be the most intuitive and the easiest thing to try out, and I must say that I
lived in Finland for a while, and there are few types of beer, and typically, in Finland, people go
to town, and they take beer with them. And most of the beer tastes like, you would expect that
they behave similarly. Of course, I didn't realize that once you move to Holland and when you
have hundreds of different types of beer or Belgian beer and Dutch beer, the situation is a bit
different. So there is beer that you have on Christmas, beer which is more popular on hot
summer days, in the fall, etc., etc. So it's not just about different glasses which you need to use
for different kinds of beers, but it's also about different habits, what types of beer you need to
drink, when. And then, therefore, we quickly realized that, okay, well, analyzing the descriptors
of the items is of no help, and then we need to analyze the behavior of different products and
then see how similar they are, and then based on this analysis try to build group models and then
see how better we can do. And indeed, it gave some improvement, but still, it was an amazingly
difficult task, difficult predictive task. So what we did end up with, we tried to figure out when
we can determine some seasonalities, when we can determine external events, discontinuation of
products, new contracts, appearance of a product which is complementary to some other
products and then so on. So it was an interesting study with quite a few lessons. So you can also
think about distributed settings, so, for instance, distributed classification, when you have
multiple hospitals, they perform diagnostics of certain diseases, they accumulate data. Quite
often, they cannot share data with each other, but they can share models, so they can share some
patterns, and then, for instance, in the case of antibiotic resistance, this would be an example
when you know that something is going on, something is happening, change is occurring, so you
need to update the model. So perhaps similar changes may occur at other hospitals, and you
want to alert them upfront, the same way, like when you are on highway and you see a traffic
jam, so if you suffer, you want to notify people behind you, be prepared. So similar mechanism
can be analyzed for handling concept drift, and then this is something that we've been also doing
in the recent past. So think about -- well, actually, this is also a real example of how the weather
pattern looked one day before I had to go to Dallas, so most of the flights were canceled, but
likely, there was a change and then day after it became much better. Anyway, so the idea is that,
again, we have distributed settings, we have multiple weather stations, and then whenever you
have a cyclone with some warm weather or cold weather, you can anticipate that when you know
direction of the wind and when you know about connections, you can proactively anticipate
changes of the weather in the related regions. So this idea can be considered in multiple settings,
so think about social networks or think about other ways when one object can affect other
objects. There could be different kinds of relationships that can be learned from historical data.
So we studied the settings of handling concept drift in peer-to-peer settings, to again assume that
there are multiple peers. Each tries to build a classification model, for instance, classifications
for positive, negative class, and different concepts are being observed for each of the peers. And
if you see how those change over time, you can see that actually there are some relationships
between the peers. So, for instance, in this case, something that we observed for peer A would
be later on followed by -- will be followed in peer one. And this is quite a strong pattern over a
long period of time. So the problem is how to analyze these dependencies, how to mine rules or
episodes which will tell us how likely it is that change in one peer would manifest a change in
the other peer with certain delay and with certain level of confidence. So once you can solve this
task, you can think about building different reductive learning strategies, so in our case, we come
up with an interest in assembling an approach where we would maintain two pools of classifiers,
like reactive classifiers, which monitor for the peer itself, where things change, and a pool of
proactive classifiers, which are based on the information that is acquired from other peers. And
we came up with an efficient way how to learn the dependencies between the peers.
Another interesting aspect is how we can develop trust of the users into what we do, so this is
generally a very important problem in recommendation tasks and retrieval tasks and predictive
modeling tasks. So now we have yet an additional layer of complexity, so we want to explain
not only predictive models, but we also want to explain the change, why this is a change, what
might be possible cause of the change, what we can or should do to the model according to this
change and so on. So here, I would like to give you an example of stress analytics to tell you
what interesting aspects -- 10 minutes.
>>: [Indiscernible].
>> Mykola Pechenizkiy: So I'll give this case very quickly and then try to conclude. So think
about the basic notion of stress. There are different stress factors, like, for instance, given the
stock, and then so we have our sympathetic system, which is sensitive to these stress factors, and
our body can react in different ways, like the heart rate may increase, increased sweat
production, other things. And then we can build devices, similar to these watch-like devices to
monitor these physiological signals, and then once you detect a change, you beep and alert the
person that he or she is stressed. Well, maybe it's not just about online alerting but about
accumulating this data and making people aware of how much they are stressed, during what
times of day, during what kinds of activities and so on. And then this was an idea we explored
from different perspectives, so we've been trying to build a framework which would allow to
accumulate information about different stressors, about typical activities that people perform
during the day, so all kind of agenda or [counter] data, and then link with data to actual stress
patterns that were detected. So most of the work was related to analysis of [indiscernible]
response data, so we monitor increased levels of sweat production, but we also play with analysis
of voice signals and trying to mix these few things. We also tried to link this information from
personal correspondence of people, like linking their Twitter account, Facebook account, e-mail
account to see how much positive and negative stuff is being said or how much positive and
negative stuff you get from colleagues or from family and so on. And then, once a sufficient
amount of data is collected, so you can also do some interesting pattern mining to analyze what
kind of activities bring the most stress, less stress, but I must say, this is not the part where we
are. We don't have that much data to do this part. But what we have been doing, we were trying
to analyze with different patterns of stress and how those can be detected. So, typically, what
you can see or expect to see is the following pattern. First, you're in a normal state, then there is
quick arousal. Then there is this aroused state, and then there is relaxation period and people are
back to normal.
>>: Is it normal to be -- normative or normal?
>> Mykola Pechenizkiy: This is a good question. It might take me more than 10 minutes, but to
give a very short answer, I thought that a lot of information is known about stress, and then I am
able to contribute from computer science point of view how to analyze the patterns. But the
more I read about stress studies, the more I understood that, actually, not that much is known
about stress. And, in my opinion, what is normal is a very good question. I must have a slide on
this, too. So in practice, the detection of change is somewhat different, so you can have all kinds
of noise in the data, which needs to be processed. There could be all kinds of external factors,
like people doing exercises. There could be a loose contact with the device and all other things
which bring, again, interesting and additional things to consider. So, here, interpretation of the
signal is extremely difficult, so here again, we don't know the ground truth, and it's very hard to
obtain it, even in controlled settings. So we need to find out whether the person is stressed or
not, how much stress, what would be evidence for that and so on.
>>: Do you at least have metadata for this? Do you have metadata in this?
>> Mykola Pechenizkiy: For some...
>>: Like eating, sleeping.
>> Mykola Pechenizkiy: For some, we do. So we conducted some pilot studies. What I
personally did, I ran some experiments with students during exams, and then I was asking them
to report, of course in an anonymous way, for which questions they thought they knew answers,
which they wouldn't know and this kind of stuff. Then, in a couple of courses, I post
representations of project results, and what I was doing, I was wearing the device and I asked
people presenting the projects to wear the device, and then to see how much people are in the
arousal state when they present the project to me or when they present it to their peers, to other
students, or went they listen to other presentations and things like that. So now, in Eindhoven,
there are also studies with schoolteachers. So there is some interesting data. We also conduced
really controlled experiments with a small group of people. There were 12 people and were
actually running different tests to stress them out and then see the reaction and see how this
would be reflected in GSR and in voice and so on. So we also formulated this is a classification
task, and we were trying to find out what kind of other data we can collect such that we can
disambiguate different cases, like whether it's about physical exercises or real stress, what would
be a possible reason for the stress, and so on? So, again, in many cases, you think about how to
study the phenomenon versus how to come closer to actual application such that you can still use
a small device and then use, for instance, accelerometer data to recognize activities, and then
once you know activities, you can disambiguate better. So, as also mentioned, we analyzed the
GSR and speech, also some interesting results. So, apparently, speech is much more
discriminative. It's much more predictive, but it's also much more person dependent, so you
need to fine-tune your models to particular people you're interested in. But 1 million question
you raised, what is normal? Or, actually, what is stress? Is it good or is it bad? When we talk
about acute stress, is it because someone is enthusiastic, excited, or is it because someone is in
trouble? Same about relaxation? Is it about someone being overworked or actually everything is
going to the better, or what is normal state? Is someone in coma, or this is trying to relax or just
normal state? And then, in many cases, again, the question is how to get the right metadata, how
to get the right interpretation and how to link it to different contexts. And this is, again, in a very
interesting task, where change detection, predictive analytics, context awareness come into play
such that we can better understand the problem, we can better understand labels, we can better
understand reasons, we can explain things better. So there are also other domains where concept
drift has been studied recently, but I will skip this part and jump more into conclusions. So if
any one of you is interested to learn a bit more about the area of process mining and how concept
drift can be considered in that case, just talk to me and I will tell more about this part. So a few
summary slides. So we observed quite a few cases, just a few examples where handling concept
drift is important or is essential in predictive analytics. We overviewed a few types of strategies,
how concept drift can be handled or at least mainstream approaches, what people study, and we
looked into a variety of application settings which, on the one hand, bring more challenges. So
we need to come up with new techniques which actually would correspond to application
settings. For instance, how we can think of active learning strategies to cope with this concept
drift when labels are not immediately available, but we can request them from the user. But also
to understand what kind of interesting opportunities we have just relaxing the basic assumptions
behind concept drift research and coming closer to real application settings, where much more
data is available and things repeat over time and so on. So, therefore, in my opinion, in the near
future, we will see some transformation of research in handling concept drift. So current focus is
on blind adaptivity, detect change, adapt the models. So, in the future, we would try to look in
ways how to recognize and use similar situations, how to understand change better, how to add
transparency in the change detection mechanisms and adaptation mechanisms, as well, and I
think we will continue also work on this reference framework, which would give an opportunity
for different researchers working on concept drift problems to see what actual problems they
address and do not address, and then people working on applications where concept drift matters,
so we also can show what kind of approaches are applicable, what were the benefits and
drawbacks of different types of things and perhaps develop some interesting reference points and
standards for this part. And if I would focus on just major challenges or major next steps to be
taken, so in my opinion, it's really, really crucial to go into life, into practice, and the go from
experiments and simulated data to real application settings, just to see if adaptive applications
indeed benefit from concept drift detection and what is the actual situation, and to be able to do
that, in many, many cases, we do need to improve usability and trust such that the main experts
can understand what we are trying to achieve with change detection, what we're trying to achieve
with model update, how transparent they are, how this can be visualized, explained, connected to
business logic and the like. So I would like to stop here and then say a couple of words of
acknowledgments and thanks to many colleagues and students who contributed to lots of cases
studies and development of techniques behind this talk and, especially, I would like to
acknowledge the work of two colleagues, Indre Zliobaite and Joao Gama, with whom we are
preparing survey papers, analysis of different application studies and came up with interesting
approaches which address application settings. And, of course, it's always nice to hear
constructive criticism from reviewers. I cannot name them, because they are anonymous. And I
am really glad that we had collaboration with many industry partners to get real data and see
what is important for real application settings. Thank you.
>>: Thank you.
>> Susan Dumais: We had some questions during the talk, but we have time for others.
>>: So you kind of gave this characterization of different spaces of different concept drift, but
do you have -- it would be nice to see a list of actual problem domains and applications that
people are trying to solve within the concept drift problem. Do you have that kind of a listing, as
well as sort of concrete problems where concept drift happens?
>> Mykola Pechenizkiy: Right, so very concrete problem is in monitoring of industrial
processes, so you have sensor data which tells you about the quality of the output, so you need to
monitor how good your models are and outputs of those models are, and there are different
reasons why they become out of date. And actually, because of the severity of this problem,
many people started to look into the domain of soft sensors, like how to build up adaptive
sensors which would monitor the stream and adapt to the new settings of the stream. So one
very big area of where it is recognized to be a very serious problem and there is lots of work
related to it. So you can think about several real applications related to control systems, so think
about the problem of anesthesia control. So you have a patient and you need to decide whether
you need to inject more or less medicine, such that the person is between the bounds, like not
awake and not too much asleep. So typically this is done with controllers, so now people look
how to optimize it further by detecting interpersonal and intrapersonal differences and then make
better adjustments. So if you think about the problem of personalization, this is a problem where
people were emphasizing that concept drift matters, already I think for 15 years, but if I look
what has been done in this area, it's not that much.
>>: I agree sort of with your earlier characterization that people often talked about it mattering,
but actually having the proof in data, so for example, preferences, do preferences actually drift
that much, or is it the availability of products that are difference, right? And so you're
emphasizing ->>: Or changes in population.
>>: Do people actually change all that much? It's not clear for a lot of these cases of the
distribution shift is different from actual concept changing, and both happened, so coming back,
I think, to the example, where Susan was asking for things changing category, one of the
examples is e-mail foldering. My view of the world sometimes changes, and I go back to
something that I put it in one folder, and I put it to something else. I've actually changed the
concept, and everything else that is automatically categorized now is up in the air. How would
we actually reflect on that? How we would show that for the user, that's very different in that
kind of a case than I think in others, and I think you can have cases where there's concrete
examples of that concept changing that to me seems like a challenge.
>> Susan Dumais: I mean, you've talked about this as concept drift, but there's a lot that could
be going on. It's actually interesting.
>> Mykola Pechenizkiy: What you refer to is actually -- is a separation of real concept drift and
virtual concept drift, but also what I was trying to emphasize a few slides before that is that you
can think about different reasons why things change or why decision boundary changes. So in
case of spam classification, okay, well, preferences of people might not change, but because of
adversary activities of spammers, you need to detect that your current models are not good
anymore or that, I don't know, maybe you'll have lots of spam about some rich people trying to
give your money away, and then you'll get more Viagra spam and more about some other type of
products. And then, just because you have this population shift, your decision boundary
changes, and then you need to detect it and then learn it early on. So depending on what kind of
change you anticipate, you would monitor either the input -- so if it is just about population shift,
it is enough to monitor the input space. However, if it is about change in real decision boundary,
so you need to get labels, so you need to get proxies of the labels and then see if your
classification accuracy deteriorates or not, because otherwise you cannot capture it. And then
there could be a few more distinct points, so this was one of the points I was trying to emphasize,
that actually, the variety of applications and reasons why things change over time, they can be
quite different, and we need to come up with different strategies to address them. But your
original question was like, name an application where it is proven ->>: Not name an application. I was just saying it would be an interesting contribution -- not just
the ontology of categorizing it, but here's a listing of 50, 60 applications and how they map into
that space.
>> Susan Dumais: Or which ones work well -- that would be really, really fun.
>> Mykola Pechenizkiy: This was our original idea, so this is something we were trying to do,
so we've been analyzing lots of papers which were of applied nature and they were trying to
emphasize what kind of interesting technique they developed and for what purposes. And so
we've been trying to list all kind of application areas, all kind of concrete tasks and what kind of
techniques people were using. But when we looked closer, so then you realize that, in most of
the cases, you don't have strong evidence that that particular approach did work well in practice,
simply because people played either with simulated data or with some benchmarks and
artificially introduced drift, and then we refocused a bit. But, indeed, this is an interesting
aspect. So, this is something that we typically try to do when we perform a case study, so we
always try to quantify what is the effect of not capturing concept drift. So you have a predictive
model and what would be effect so that you didn't detect the change point, or what would be an
effect that you detected it too late, or what is the effect that it is detected not accurately? And
then you can see what is the actual effect of this misdetection or false alarm on the behavior of
the system, or what is the tradeoff? What happens if many more false alarms are triggered, so
you rerun the model, what is the corresponding effect on the final performance? But I would say
that mainstream research and reality and concept drift area are a bit disconnected. So there is
lots of interesting ideas proposed. They were tested on benchmarks, but for some of them, we
don't know yet whether they really work in practical settings, and then there are a few reasons for
that, and some of them I was trying to list. So there should be much more systemic view on the
problem, so we have predictive analytics tasks and concept drift is just one of the ingredients.
And then you can tell how important it is and how to quantify these effects. So does it answer,
or does it relate?
>>: Are there some examples from industry where concept drift is accounted for and effective?
>> Mykola Pechenizkiy: Actually, it is so. You also need to realize that this is quite a large
problem space, and in many research areas, the same concept was called slightly differently. So,
for instance, temporal dynamics is known to be important, and it has been studied in different
area, including information retrieval, recommender systems, industrial applications. So, for
instance, in Netflix competition, you can say lots of interesting things about it, but nevertheless,
one of the interesting aspects is capturing this temporal dynamics, like understanding how users'
rating scale shifts, understanding how ratings change over time or how co-ratings change over
times and many, many other effects. And then, if you take them into account, yes, you can do
better. Again, the question is whether you need to have an explicit change-detection mechanism,
or it is enough to build evolving models which just incrementally update. So this is another
aspect of it.
So if you look into DARPA challenge, to me, this is like very good example of concept drift
research. Even though they don't name it explicitly this way, but this is a very concrete example
of how you build models which evolve over time.
>> Susan Dumais: And then financial markets, the whole financial industry. Do they predict
change, or do they just continuously change the model?
>> Mykola Pechenizkiy: I have strong negative opinions about ->> Susan Dumais: I see. On that note. Thanks again.
Download