>> Susan Dumais: So good afternoon, folks. It's my pleasure to welcome Mykola Pechenizkiy, who is at the University of Eindhoven, where he's been for, what, five or six years? Seven, okay. And he's done a lot of work in the area of social media analysis, the area of information dynamics, how information changes over time, and in fact, he's going to talk about that today. But a lot of this work strikes me as a leading indicator of what we see now in data science, where you have interesting real-world problems, largely scaled data sets, and thinking about new ways of capturing some of that in the large scale. So it's my pleasure to introduce him, and today he's going to talk about how to handle concept drift in predictive modeling, because the world is not a stationary and IID, and so it's an absolutely important question that a lot of the models overlook, so it's fun to hear about it. Thanks. >> Mykola Pechenizkiy: Thanks. Good afternoon, everyone. It's my pleasure to be here and an honor to be here, and thanks, Susan, for kind introduction and giving a chance to give this talk. So today, I will talk how to get ready for change, and I provided a subtitle for the talk for disambiguation, just to make sure that this talk is not about climate change, not about global warming or not about regional cooling, so with a couple of pictures from Dallas a few days back, but probably you saw more on TV. So I'm going to talk about predictive modeling, and many of you know that making predictions is a hard job, quite a few wise and very experienced people, during their career, made quite a few predictions. Not all of them were correct, so you might recognize some of the famous quotes about what is the potential of new technology, if it's going to take off and so forth, but these are different kinds of predictions. So these are based primarily on the expertise and intuition of experts or gurus, so I am going to talk about data-driven analytics as a problem of concept drift. So the outline for the talk is very straightforward, very simple. So I'm going to explain what I mean by predictive analytics and why is it important to consider concept drift, what are the typical approaches to handle concept drift, and after that I will try to emphasize, actually from applications' perspective, it's interesting to see how many challenges haven't been addressed yet from this perspective and what kind of interesting opportunities for further research it provides. And, as the end, I will make my own forecast or prediction how the field is likely to develop further. So by predictive analytics, I just take a reductionist approach to say, well, essentially, it's about data mining and about knowledge discovery. We have a possibility to collect large volumes of data, process this data to get some useful predictive patterns, models, summaries, which can give us an insight into phenomena, get us some actionable patterns and things like. So the most basic supervised learnings may be presented in the following way. We have some historical data that tell us how our target population behaves. We can induce a model from this historical data, such that given previously unseen instances, we can generate a label. Like, for instance, we want to build a classifier to determine which of the e-mails are important and which are not. So based on the previously collected e-mails and knowing which of those were relevant and which were not, we built a model, so next time, when you receive a message about a talk like this one, you can be sure you'll not miss it. And then you can think about various types of predictive modeling tasks, so whether it be classification, determining the relevance of an e-mail or finding out who the user is, performing user profile and understanding whether this is a novice user or an expert, what kind of information need the user has and so forth. So we can think about various types of regression tasks while, for instance, making predictions how you would score a talk like this one or performing ranking in case of search engines, in case of recommendation systems -- or perform time series prediction. For instance, when we want to estimate the popularity of our website, when we what want to predict the amount of traffic we can attract or click-through rate on a particular banner or news item and so forth. So there are lots of approaches for predictive modeling for each of these tasks. In this talk, I will focus primarily on classification examples. However, most of the discussion we will have can be related to other predictive modeling tasks, as well, including rank and prediction and so forth. So if we think about the geometrical representation, you can consider number of instances which are represented in some multidimensional case. They have labels, which represent -- which tell us to which class they belong. And they have a true decision boundary, which allows us to discriminate in senses of one class and the other class. So if you ever tried to build a predictive model in practice, you're familiar with lots of major pitfalls, like we need to have clean input, or at least somehow reasonable input to build a good model. We need to come up with representative features. We need to come with the right problem, formulation, such that we optimize for the right things, avoid overfitting and avoid false predictors, think carefully about operational settings, what is known to us when, what kind of attributes, descriptive, predictive, which attributes we can manipulate and so on. However, I would like to focus on just one single aspect related to the problem of concept drift, which, in my opinion, is less studied in different communities and even more, though, in practice. So think about the following problem. They want to find out whether a particular antibiotic is going to be effective or not against a certain type of pathogen. So we have historical data about patients, about their demographics, hospitalization data, and we know for which patients some antibiotics were effective or not effective with respect to certain pathogens. So we can use our favorite classifier to determine whether certain types of antibiotics will be effective or not, given a new patient. However, the problem is that antibiotics -- excuse me. The problem is that different pathogens can develop resistance to antibiotics over time. So a classifier which was effective in the past over time may become ineffective, and this may happen due to various reasons. So the following cartoon gives the main intuition. So one of the pathogens can develop resistance, and once this is performed, actually, there are different ways how this information can be communicated to other pathogens. So just piece of information can be given to the other pathogen, or there could be a case of so-called microbial sex or other types of relationships that can spread the information about resistance and make pool of antibiotics currently used at the hospital to be ineffective. So the goal is to find out when such things happen and find out the way the model should be adapted, adjusted, to keep them up to date and useful. So when I think about streaming data, we can think about different types of situations or different reasons why things may change over time. So I think about four different characteristic examples. So we can think about changes in personal interests, well, like think about recommendation systems or search systems. When we try to figure out what people are interested in, we build a user model, build a user profile, of short-term, long-term interests, current interests, and because many of those are not stable, we need to monitor for those changes. Think about e-mail classification example, when we want to identify which e-mails are relevant, which e-mails are spam. We can build a classifier which is accurate at the current moment in time -- however, some of those spammers unfortunately are quite smart, so they try to bypass the spam filters, and they develop new and new strategies how to generate e-mails. So we need to deal with these adversity activities, again, recognize them explicitly through the analysis of input stream or monitoring the performance of our classifiers to identify these changes and adjust to them. So you can also think about changes in population as such, so think about economic crisis. That changes the profiles of people who apply for credit. So in credit score, in application, again, we would need to account for such global changes along the models. So, finally, you can think about complex environments where we can explicitly monitor lots and lots of pieces of information. However, it is very -- well, it's intractable to model all possible combinations of factors, and we just need to reduce the model only to those few elements which are critical for our application. So while many of you might be familiar with DARPA challenge of driverless cars, then the winning team was describing how they use adaptive learning strategies to analyze whether it's a drivable surface or nondrivable surface, if it's a paved road or unpaved roads under different lighting conditions and things like -- so simple adaptive learning strategies can do the job. So many of you are also familiar with the Netflix competition, where, again, lots of different ideas were proposed, different factors affecting the performance of predictive models were studied, and part of it was related to temporal dynamics related to ratings of how items become more popular over time, how, for instance, item side effects can be observed, user side effects can be observed with respect to rating scale, with respect to user interests, with respect to popularity of the items, seasonal effects and so forth. So from machine-learning and data-mining communities, people tried, or sometimes they emphasize the difference between real concept drift and virtual concept drift. So think about our original example, where we have instances belonging to two different classes and we have a decision boundary. So we talk about real concept drift when this true decision boundary changes. And then we need to update our classifiers to learn this new boundary. However, there could be also different reasons for virtual drift, when the true decision boundary remains to be the same but the distribution of data changes and it affects the performance of classifiers, still. So, as discussed, these different cases can be decomposed even further. However, from a predictive modeling point of view, as soon as changes in the data, or in the label, so in decision boundary, affect our prediction decisions, we need to detect this and decide what to do next. So if you go back to the basic supervised learning idea under concept drift, so now we need to question whether the population, about which we had some historical data and learned our model, is still the same or it's different. And, if it's different, what shall we do with our model? How we are going to update it? So we can think about two major approaches, how to manipulate the training data, how to select the most representative, for instance, the most recent, instances and how to perform model update or the learning of the models, whether it's a single model or an ensemble of multiple approaches. So we've been looking at different strategies and some general high-level framework which would allow to describe major types of approaches for handling concept drift, and then it can be represented with the following simplified view. So we have input data. From this data, we learn the model and we cast predictions. So, beside that, we accumulate some relevance feedback about the performance of our models, we monitor our input data, we monitor how well our classifiers perform, and if detect changes, we can manifest them to our data management system or learning subsystem such that it becomes more up to date and can generate more accurate predictions. So, besides that, we can also alarm the identified changes either to the main expert or to other sub-modules of a system. So, consequently, taken with this perspective, you can think about various types of ways how to characterize major approaches for handling concept drift with respect to memory management, different types of change detection techniques, how to characterize the properties of learning techniques, and then how we perform the monitoring and evaluation of these learning approaches. So, when I talk about memory management, you can think again about short-term memory, when we have a small portion of data that we keep in main memory and you can use for building a model, or you can also think about long-term memory, which is captured in the models themselves. And effectively, you can think about various strategies, how to maintain, with a training window of fixed size, variable size, how to introduce some forgetting mechanisms such that we, for instance, down weight the importance of instances which were observed far in the past and then boost the importance of most recent stuff. >>: So you're talking about it as if you get the same treatment for every, say, query or every instance. Do you look at all at models that may have different, say, forgetting functions for different queries or different subsets of the data? >> Mykola Pechenizkiy: So this is a very good question. I'll try to come back to it a bit later during this talk. So the mainstream of concept drift research assumes that we deal with a single object. So, for instance, we have a time-series data that tell us something about an industrial process, or we talk about an individual user and we monitor interests of this user. So think about a single classifier or single joining mechanism. But you're absolutely right. So in many cases, we have a multitude of objects that we trace over time, and it makes perfect sense on the one hand to build different models, and on the other hand try to learn from multiple objects such that they contribute to each other. However, in the most basic settings, it is considered to be all uniform, so we have similar kinds of data about similar instances. So, with respect to change detection, again, you can think about various types of approaches, like what can be monitored and what kind of analysis we can perform, like, for instance, sequential analysis, like the cumulative sum test, control charts, monitoring of two distributions, using some contextual monitor and things alike. So from the point of view of properties of the learning models, again, we can think about various types of categories, for instance, whether they perform retraining of the models from scratch or try to apply some online or incremental learning where our models are informed about accurate change or we perform evolving or blind adaptation, so we don't know whether change happened or not, but we do anticipate, and therefore we update the models. So, finally, whether we build individual models or try to maintain an ensemble of individual predictors. So this can be discussed in a much more fine-grained way. However, instead, I will try to focus on four major categories and I'll give an idea how simple these approaches could be. So we can think about four major types of approaches. So, in the first case, think about use or not use of detection mechanisms which would trigger information about the change and trigger an update on the model versus evolving approaches which just adapt models every step, so we don't know whether change happened or not, but because we know it may happen, we perform the adaptation. And on the other side, we can think about individual models or ensembles. In the first case, typically, they think about some reactive modeling with some forgetting mechanism or an ensemble approach which tries to maintain some memory over time. So, for instance, we can think about simple forgetting mechanism when we have a fixed sliding window. It goes over time, so our model is retrained or incrementally updated with analyzing new instances and forgetting the most outdated ones. So a very simple approach, but it can lead to some reasonable results. So this next one is based on explicit detection of a change. So, again, we have training data, we have change detection mechanism. Once we detected that there was a change, we disregard the data which is now no longer relevant and built a model on the new stuff. So, again, you can think how we can hunt for useful information about the change, whether it's in the input, in the model itself, in the output or analysis of performance of the models. So all of this can be useful for manifest change. So most frequently, we analyze two windows of some statistic, which can be, again, computed as on input or model parameters or outputs. We have a reference set on which we assume this is the current stable behavior of the model. Then we have the most recent set. We compare the two, and if there is an observed difference between these two sets, we manifest a change, and then different statistical tests can be used for that. So we can build an ensemble of classifiers to track how well different models perform, so again, think about timeline. We have multiple windows, and these windows, we built classifiers. Every time we need to cast a new prediction, we ask every classifier to cast a vote and apply some voting mechanism. So think about the initial situation, we have multiple models and initial weights. They all cast a prediction. We know what is the true prediction, so those guys which were correct get a high weight, those who were wrong get a low weight, and then we keep doing this, and this way we always maintain a pool of classifiers and knowing which of those are the most adequate or the most accurate to the most recent data. So, finally, we can think about some contextual approaches, where we try to identify regions in the data for each of the regions we find out the most suitable classifier, which has the highest accuracy or the highest generalization performance on that piece of data. And whenever we have a new instance to classify, we check in the instance space what are the other similar guys to this and apply the corresponding classifiers which are expected to perform best in that neighborhood. So given these different types of techniques, we also can think, well, which of those are most effective or which of those would be useful in which of the settings? And we can also think about different kinds of change. Was it gradual change or sudden change or recurrent change. Is it expected somehow, or is it predictable, or is it completely unexpected? And, consequently, in which cases we can be reactive or where is there room for being proactive, anticipating change, making predictions about it and making models ready to use. So if we look into these four major groups of strategies for handling concept drift, whenever we deal with sudden drift, so usually with reactive models based on explicit change detection and some forgetting mechanism, we would be doing just fine for sudden drift detection, and similar approaches can deal with gradual drift, and for recurrent, we need to have some form of meta learning and some form of context awareness to capture recurrency or recurrence of particular concepts. So, given this overview, of course, there is also question how well these different approaches work in practice, why doing it so many, and if there is no one best approach for all the situations, how we are going to choose the most appropriate one for a particular application in mind. So many of you might be familiar with cross-industry standard for data mining, which was proposed some years ago, and then it was appreciated by the industry because it specified different steps in the process level, like starting from business understanding data, understanding data preparation, to data mining, evaluation and deployment, and then each of the steps was described with respect to operational settings, what are the useful things to take into consideration, what are the available techniques to be used and so forth? So now, if you look into streaming settings, effectively, all of these feedback loops, which in the past were assumed to be performed by the main experts. So the main experts would see how well a particular model performs and then try to fine-tune the parameters or to reconsider what kind of technique to be used or reconsider what would be a good representation and so forth. So now, in streaming settings, we try to automate all of these feedback loops as much as possible. So we monitor the performance of classification models or predictive models, and whenever something goes wrong, we need to take a decision how to update the model, how to change the representation of the data and so on. So we were curious to see what many people think about applying concept drift in practice, and we did an extensive literature analysis to see how well different techniques work in practice, or at least how different researchers believe they work in practice. And we did also quite a few case studies ourselves, and we were surprised to see that some of the popular approaches might work or might not work, and then we were trying to figure out what was the reason, and then which of those techniques are in general applicable or not applicable for particular operational settings. And we realized that, in the area of concept drift research, the situation was quite similar to early days of machine learning and data mining. So many researchers say temporal dynamics is important, concept drift is important, but then real data is hard to get because of privacy concerns or some proprietary reasons or because there is too much overhead to work with real data, or sometimes people come up with a too generic or too abstract solution, so it's not ready to be used in real applications. So what we often can see, people play with benchmarks or artificial data sets and then try to generalize from those experiments. However, because in data mining, many journals, conferences, reviewers, argue that, well, this is an application field, so you need to show the relevance to practice, you need to show that your methods do work in practice, so many people try to invent applications and show that the ideas do work in practice. And then, this is something I relate to the problem of green aliens, and I will say about this in a minute. So in case of concept drift research, at some point, it was really close to some extreme, so people were playing with artificial data, there were a couple of benchmarks, extremely popular, with a particular type of concept drift simulated, and in many, many cases, people were often trying to use standard UCI benchmark data sets, manipulate them in a certain way such that concept drift would be introduced, and then it would be captured by certain techniques. And, in my opinion, this is really, really close to the situation described in this paper that that I really like, so it's called Novel Efficient Automated Robust Eye Detection System for Green Aliens, and it describes the situation in the field quite well, so this is a sarcastic short paper, and it's a very quick read, and I strongly recommend to look at this. So this triggered us to rethink once again what kind of methods we have and in what kind of application settings they can be used and what is the variety of application settings where concept drift matters and we actually need different types of strategies to handle it. So we did a very simple thing. We tried to identify different dimensions along which we can categorize or characterize various types of applications. For instance, with respect to what types of data has been used, what kind of problem formulation we have, what kind of changes we anticipate and various types of operational settings we need to consider. So with respect to data, again, look, is it time series data, relational data or a mix of those, how data is organized, is it a high-speed stream or is data coming in batches? Can we re-access the data? How do deal with this missing data and things like -- so, with respect to change, what kind of types of change we can anticipate? Is it just single type, like only sudden or only gradual or simultaneous in the same application. Different types of changes may occur. What is the source of change? Is it about gross reactions or changes in population or, again, it could be multiple reasons present at the same time? What are our expectations about change and what is our expectations about desired actions, if we know that the change appeared? Because, actually, again, it really matters what we are optimizing for if we know what the goals are. And with respect to operational settings, again, we need to be very careful in understanding whether the labels are available to us immediately or with some delay, or they are never available. Are they available as a ground truth or some proxies for ground truth, or we can just recomputed from historical data, from offline data, how likely it was that we were correct or not correct, and then other things? So we also analyzed various types of application per industry, like in finance and banking, in security, in e-learning, entertainment, search, recommender systems, things like -- and we tried also to categorize them by different types of applications, like whether it's about monitoring and control or whether it's more about personalization tasks in search and recommendations, whether it's more about management and planning, when we need to perform demand prediction tasks, or some ubiquitous applications, when we have, for instance, location-based services or other things which we integrated in other, larger systems. And then, actually, we analyzed these different types of applications and saw what kind of drifts typically occur, how frequently we have or do not have labels available, how hard they are, how objective or subjective they are and so on. And we also started to look on different methods and what we have and what we don't. And then, it's interesting to see that, if you look into mainstream research on handling concept drift, these would be the most-assumed settings. So change is typically considered to be unpredictable. Typically, it's sudden, so if we observe multiple changes over time, they are considered to be independent of each other, so there is no opportunity to learn from multiple recurrence of changes, so typically we analyze only single objects, so we don't analyze multiple objects. We assume no closed-loop control, so there is no effect or there is no reinforcement on the behavior of a particular system, so if it's even adaptive application, like a recommender system or an information-retrieval system, information-retrieval system, again, there is no assumption about the biases in the data, and typically it is assumed that, given historical data, we can replay it multiple times and fine-tune the parameters and get reasonable estimates how well a particular technique would perform. And then beyond, besides this, most of the approaches for handling concept drift assume that the true labels are available right after costing the prediction, which you can imagine in practice is rarely true. So, in reality, if you think about wide spectrum of applications, changes often recur multiple times for a given object of interest, but they also recur across multiple objects with related behavior, so typically, we need to monitor for different types of changes, so beyond ways -- we can think about a multisensory environment or multisensory data when the same object has been described by different feature subsets. If they are analyzed independently, they give somewhat unreliable signal. However, if we analyze them together, they can help us to detect concept drift much better. So, in many cases, we have no idea what the ground truth is, so we can also get some guess about it. In many cases, we know quite a few bits about the process we model and with background knowledge can be utilized to build better predictive models, but also have better ideas about what kind of changes we anticipate and how it can be detected. >> Susan Dumais: Have you seen many examples where the ground truth changes over time, or at least the labels change? I want to say the ground truth changes. >> Mykola Pechenizkiy: Right, so there are a few aspects. So you can think about change of label, simple change in interest, so assume you have a classifier which determines whether the user is interested or not interested in a particular topic, like soccer. And over some period of time, you observe that the person was interested, and then you recommend, you keep recommending items about soccer, and the user doesn't click, so this is a change in ground truth, in the sense that you know that the user had this interest. Now, the user changed the interest. But what I also mean here is that in many cases we cannot make an association between the labels and the ground truth, so we collect information -- like, for instance, we collect implicit feedback, but we are not sure if this corresponds to actual interest of the user. Or, in many cases, when we do the sensory data, so the goal is to reconstruct the signal, so we don't know what the ground truth is, so we try to predict it. And then based on analysis of historical data over a long window, we can reconstruct this ground truth with a certain degree of certainty, but we're still not sure if it is ground truth or not. And this is also quite important when we together to optimize the performance of techniques on historical data, so we can fine-tune the parameters for optimizing for labels, but effectively, those labels can be quite noisy and we optimize them for the wrong thing. So, consequently, if you look into the peculiarities of various types of applications, you can think about more advanced approaches, both for reactive and proactive handling of concept drift. So whenever we think about recurring changes, there is room for meta learning approaches, how to recognize similar situations from the past, and then use more accurate detection and prediction. So whenever we know that there are different, related sources of data, so we can learn from that external data, which can be considered in centralized settings but also in distributed settings, and then we can think about context-aware approaches, so I will try to give a few ideas of each of those. So regarding the context, so think about the problem of outlier detection. So we have some seasonality in the data, it's not sufficient to perform simple thresholding, so we need to have some approach which will capture this seasonality. So, similarly, when we want to perform change detection, we may want to figure out like what is normal behavior of a particular system, so here is a particular example of sensory data. The goal is to reconstruct the true signal. This is just a mass measurement of fuel in an industrial process. It's a very clean task, like it's very easy to coordinate. So every moment in time, we need to say what is the actual value of this timeseries data, so we need to get rid of noise, we need to detect change points, determined with these red circles, and we need to do this as quickly as possible such that we start relearning the models again and again and again. So there are quite a few challenges to address here, asymmetric outliers, also some distortions in different phases of the process, but there are more things to that. So, first of all, because we see that there is a clear pattern -- we have one phase when there is fuel feeding process and fuel consumption process, and then they follow each other, so we can try to see how we can get information about this periodicity and make use of it. So another interesting aspect here is that there are different factors which affect the behavior of this time series, so, for instance, different fuel type would affect how many outliers we have, and then some other projects of the signal. So if you know some of those factors and we can model them explicitly, again, there is room for method and approaches where we can learn models for different types of fuel, so in online settings, we can recognize what is the current quality of fuel or type of fuel and apply the corresponding model which would make a better use of the data. So a nice type of problem is demand prediction, like predicting how much traffic we can attract in a certain type of day for a certain type of query and so on. So we studied this problem in case of food wholesales prediction, which was important for the company for stock replenishment. So if you underestimate the demand, there would be empty shelves. If you overestimate the demand, there would be lots of perishable goods wasted. There are also interesting moments how badly you can do if you predict the demand too late or too early, what kind of costs are associated to it. So this problem is difficult as such, because it's hard to predict behavior of people when they want one or another type of product and what quantities, so we formulated this time series prediction task, which was augmented with lots of information about weather, about the calendar, about holidays, about promotions, so we were trying to make the predictions. And then, actually, this project started very interestingly. So the company said that they were using simple moving average to cast the predictions, and you think like, oh, like, we will use any state of that predictor and then we will do much better. And we picked a few dozens of nicely behaving time series data, we applied some simple ensembles of regression models and we got somewhat better results. At least it would beat 10% to 20% the moving average that was used, so we were quite encouraged. We talked to the company, showed the results. They gave us a much larger pool of products. It was several hundreds of products, so we applied the same model, we compared across multiple products. Our models were worse than moving averages. And then, in fact, there is a good explanation for that. So many of these products are simply unpredictable. So if they're just noise, it doesn't matter how hard you try to predict them. If there is no signal, you cannot do much more than just smooth out the signal, and then we reformulated the problem into like how to determine which of the products behave somewhat in predictable way, and for those which can be predicted, for those, we can use intelligent predictors. Otherwise, we just need to apply smoothing. >>: Can you give an example of an unpredictable food? >>: Milk for some reason occurs to me. >> Mykola Pechenizkiy: Actually, many of the products are hard to predict because you don't know when the season starts and the season ends. For instance, even for ->>: There is seasonality ->> Mykola Pechenizkiy: Exactly, so there is seasonality, and then even for those simple cases, when you know, for instance, that ice cream are better sold in summer, or meat is better sold during nice weather, but sometimes, it's very hard to know when in the mind of people it's already summer, or if this weather is indeed good weather. I can tell you more, just a bit. So the way how we continue, we started to think, well, if we cannot make good predictions by analyzing individual products, maybe we can collect information from multiple related projects, and then we can do better predictions. And obviously, this can be done in different ways, like, for instance, analyzing the taxonomy of products. Here are [indiscernible], doing sort of contentbased analysis and performing the clustering and then building models for each cluster. And I thought this would be the most intuitive and the easiest thing to try out, and I must say that I lived in Finland for a while, and there are few types of beer, and typically, in Finland, people go to town, and they take beer with them. And most of the beer tastes like, you would expect that they behave similarly. Of course, I didn't realize that once you move to Holland and when you have hundreds of different types of beer or Belgian beer and Dutch beer, the situation is a bit different. So there is beer that you have on Christmas, beer which is more popular on hot summer days, in the fall, etc., etc. So it's not just about different glasses which you need to use for different kinds of beers, but it's also about different habits, what types of beer you need to drink, when. And then, therefore, we quickly realized that, okay, well, analyzing the descriptors of the items is of no help, and then we need to analyze the behavior of different products and then see how similar they are, and then based on this analysis try to build group models and then see how better we can do. And indeed, it gave some improvement, but still, it was an amazingly difficult task, difficult predictive task. So what we did end up with, we tried to figure out when we can determine some seasonalities, when we can determine external events, discontinuation of products, new contracts, appearance of a product which is complementary to some other products and then so on. So it was an interesting study with quite a few lessons. So you can also think about distributed settings, so, for instance, distributed classification, when you have multiple hospitals, they perform diagnostics of certain diseases, they accumulate data. Quite often, they cannot share data with each other, but they can share models, so they can share some patterns, and then, for instance, in the case of antibiotic resistance, this would be an example when you know that something is going on, something is happening, change is occurring, so you need to update the model. So perhaps similar changes may occur at other hospitals, and you want to alert them upfront, the same way, like when you are on highway and you see a traffic jam, so if you suffer, you want to notify people behind you, be prepared. So similar mechanism can be analyzed for handling concept drift, and then this is something that we've been also doing in the recent past. So think about -- well, actually, this is also a real example of how the weather pattern looked one day before I had to go to Dallas, so most of the flights were canceled, but likely, there was a change and then day after it became much better. Anyway, so the idea is that, again, we have distributed settings, we have multiple weather stations, and then whenever you have a cyclone with some warm weather or cold weather, you can anticipate that when you know direction of the wind and when you know about connections, you can proactively anticipate changes of the weather in the related regions. So this idea can be considered in multiple settings, so think about social networks or think about other ways when one object can affect other objects. There could be different kinds of relationships that can be learned from historical data. So we studied the settings of handling concept drift in peer-to-peer settings, to again assume that there are multiple peers. Each tries to build a classification model, for instance, classifications for positive, negative class, and different concepts are being observed for each of the peers. And if you see how those change over time, you can see that actually there are some relationships between the peers. So, for instance, in this case, something that we observed for peer A would be later on followed by -- will be followed in peer one. And this is quite a strong pattern over a long period of time. So the problem is how to analyze these dependencies, how to mine rules or episodes which will tell us how likely it is that change in one peer would manifest a change in the other peer with certain delay and with certain level of confidence. So once you can solve this task, you can think about building different reductive learning strategies, so in our case, we come up with an interest in assembling an approach where we would maintain two pools of classifiers, like reactive classifiers, which monitor for the peer itself, where things change, and a pool of proactive classifiers, which are based on the information that is acquired from other peers. And we came up with an efficient way how to learn the dependencies between the peers. Another interesting aspect is how we can develop trust of the users into what we do, so this is generally a very important problem in recommendation tasks and retrieval tasks and predictive modeling tasks. So now we have yet an additional layer of complexity, so we want to explain not only predictive models, but we also want to explain the change, why this is a change, what might be possible cause of the change, what we can or should do to the model according to this change and so on. So here, I would like to give you an example of stress analytics to tell you what interesting aspects -- 10 minutes. >>: [Indiscernible]. >> Mykola Pechenizkiy: So I'll give this case very quickly and then try to conclude. So think about the basic notion of stress. There are different stress factors, like, for instance, given the stock, and then so we have our sympathetic system, which is sensitive to these stress factors, and our body can react in different ways, like the heart rate may increase, increased sweat production, other things. And then we can build devices, similar to these watch-like devices to monitor these physiological signals, and then once you detect a change, you beep and alert the person that he or she is stressed. Well, maybe it's not just about online alerting but about accumulating this data and making people aware of how much they are stressed, during what times of day, during what kinds of activities and so on. And then this was an idea we explored from different perspectives, so we've been trying to build a framework which would allow to accumulate information about different stressors, about typical activities that people perform during the day, so all kind of agenda or [counter] data, and then link with data to actual stress patterns that were detected. So most of the work was related to analysis of [indiscernible] response data, so we monitor increased levels of sweat production, but we also play with analysis of voice signals and trying to mix these few things. We also tried to link this information from personal correspondence of people, like linking their Twitter account, Facebook account, e-mail account to see how much positive and negative stuff is being said or how much positive and negative stuff you get from colleagues or from family and so on. And then, once a sufficient amount of data is collected, so you can also do some interesting pattern mining to analyze what kind of activities bring the most stress, less stress, but I must say, this is not the part where we are. We don't have that much data to do this part. But what we have been doing, we were trying to analyze with different patterns of stress and how those can be detected. So, typically, what you can see or expect to see is the following pattern. First, you're in a normal state, then there is quick arousal. Then there is this aroused state, and then there is relaxation period and people are back to normal. >>: Is it normal to be -- normative or normal? >> Mykola Pechenizkiy: This is a good question. It might take me more than 10 minutes, but to give a very short answer, I thought that a lot of information is known about stress, and then I am able to contribute from computer science point of view how to analyze the patterns. But the more I read about stress studies, the more I understood that, actually, not that much is known about stress. And, in my opinion, what is normal is a very good question. I must have a slide on this, too. So in practice, the detection of change is somewhat different, so you can have all kinds of noise in the data, which needs to be processed. There could be all kinds of external factors, like people doing exercises. There could be a loose contact with the device and all other things which bring, again, interesting and additional things to consider. So, here, interpretation of the signal is extremely difficult, so here again, we don't know the ground truth, and it's very hard to obtain it, even in controlled settings. So we need to find out whether the person is stressed or not, how much stress, what would be evidence for that and so on. >>: Do you at least have metadata for this? Do you have metadata in this? >> Mykola Pechenizkiy: For some... >>: Like eating, sleeping. >> Mykola Pechenizkiy: For some, we do. So we conducted some pilot studies. What I personally did, I ran some experiments with students during exams, and then I was asking them to report, of course in an anonymous way, for which questions they thought they knew answers, which they wouldn't know and this kind of stuff. Then, in a couple of courses, I post representations of project results, and what I was doing, I was wearing the device and I asked people presenting the projects to wear the device, and then to see how much people are in the arousal state when they present the project to me or when they present it to their peers, to other students, or went they listen to other presentations and things like that. So now, in Eindhoven, there are also studies with schoolteachers. So there is some interesting data. We also conduced really controlled experiments with a small group of people. There were 12 people and were actually running different tests to stress them out and then see the reaction and see how this would be reflected in GSR and in voice and so on. So we also formulated this is a classification task, and we were trying to find out what kind of other data we can collect such that we can disambiguate different cases, like whether it's about physical exercises or real stress, what would be a possible reason for the stress, and so on? So, again, in many cases, you think about how to study the phenomenon versus how to come closer to actual application such that you can still use a small device and then use, for instance, accelerometer data to recognize activities, and then once you know activities, you can disambiguate better. So, as also mentioned, we analyzed the GSR and speech, also some interesting results. So, apparently, speech is much more discriminative. It's much more predictive, but it's also much more person dependent, so you need to fine-tune your models to particular people you're interested in. But 1 million question you raised, what is normal? Or, actually, what is stress? Is it good or is it bad? When we talk about acute stress, is it because someone is enthusiastic, excited, or is it because someone is in trouble? Same about relaxation? Is it about someone being overworked or actually everything is going to the better, or what is normal state? Is someone in coma, or this is trying to relax or just normal state? And then, in many cases, again, the question is how to get the right metadata, how to get the right interpretation and how to link it to different contexts. And this is, again, in a very interesting task, where change detection, predictive analytics, context awareness come into play such that we can better understand the problem, we can better understand labels, we can better understand reasons, we can explain things better. So there are also other domains where concept drift has been studied recently, but I will skip this part and jump more into conclusions. So if any one of you is interested to learn a bit more about the area of process mining and how concept drift can be considered in that case, just talk to me and I will tell more about this part. So a few summary slides. So we observed quite a few cases, just a few examples where handling concept drift is important or is essential in predictive analytics. We overviewed a few types of strategies, how concept drift can be handled or at least mainstream approaches, what people study, and we looked into a variety of application settings which, on the one hand, bring more challenges. So we need to come up with new techniques which actually would correspond to application settings. For instance, how we can think of active learning strategies to cope with this concept drift when labels are not immediately available, but we can request them from the user. But also to understand what kind of interesting opportunities we have just relaxing the basic assumptions behind concept drift research and coming closer to real application settings, where much more data is available and things repeat over time and so on. So, therefore, in my opinion, in the near future, we will see some transformation of research in handling concept drift. So current focus is on blind adaptivity, detect change, adapt the models. So, in the future, we would try to look in ways how to recognize and use similar situations, how to understand change better, how to add transparency in the change detection mechanisms and adaptation mechanisms, as well, and I think we will continue also work on this reference framework, which would give an opportunity for different researchers working on concept drift problems to see what actual problems they address and do not address, and then people working on applications where concept drift matters, so we also can show what kind of approaches are applicable, what were the benefits and drawbacks of different types of things and perhaps develop some interesting reference points and standards for this part. And if I would focus on just major challenges or major next steps to be taken, so in my opinion, it's really, really crucial to go into life, into practice, and the go from experiments and simulated data to real application settings, just to see if adaptive applications indeed benefit from concept drift detection and what is the actual situation, and to be able to do that, in many, many cases, we do need to improve usability and trust such that the main experts can understand what we are trying to achieve with change detection, what we're trying to achieve with model update, how transparent they are, how this can be visualized, explained, connected to business logic and the like. So I would like to stop here and then say a couple of words of acknowledgments and thanks to many colleagues and students who contributed to lots of cases studies and development of techniques behind this talk and, especially, I would like to acknowledge the work of two colleagues, Indre Zliobaite and Joao Gama, with whom we are preparing survey papers, analysis of different application studies and came up with interesting approaches which address application settings. And, of course, it's always nice to hear constructive criticism from reviewers. I cannot name them, because they are anonymous. And I am really glad that we had collaboration with many industry partners to get real data and see what is important for real application settings. Thank you. >>: Thank you. >> Susan Dumais: We had some questions during the talk, but we have time for others. >>: So you kind of gave this characterization of different spaces of different concept drift, but do you have -- it would be nice to see a list of actual problem domains and applications that people are trying to solve within the concept drift problem. Do you have that kind of a listing, as well as sort of concrete problems where concept drift happens? >> Mykola Pechenizkiy: Right, so very concrete problem is in monitoring of industrial processes, so you have sensor data which tells you about the quality of the output, so you need to monitor how good your models are and outputs of those models are, and there are different reasons why they become out of date. And actually, because of the severity of this problem, many people started to look into the domain of soft sensors, like how to build up adaptive sensors which would monitor the stream and adapt to the new settings of the stream. So one very big area of where it is recognized to be a very serious problem and there is lots of work related to it. So you can think about several real applications related to control systems, so think about the problem of anesthesia control. So you have a patient and you need to decide whether you need to inject more or less medicine, such that the person is between the bounds, like not awake and not too much asleep. So typically this is done with controllers, so now people look how to optimize it further by detecting interpersonal and intrapersonal differences and then make better adjustments. So if you think about the problem of personalization, this is a problem where people were emphasizing that concept drift matters, already I think for 15 years, but if I look what has been done in this area, it's not that much. >>: I agree sort of with your earlier characterization that people often talked about it mattering, but actually having the proof in data, so for example, preferences, do preferences actually drift that much, or is it the availability of products that are difference, right? And so you're emphasizing ->>: Or changes in population. >>: Do people actually change all that much? It's not clear for a lot of these cases of the distribution shift is different from actual concept changing, and both happened, so coming back, I think, to the example, where Susan was asking for things changing category, one of the examples is e-mail foldering. My view of the world sometimes changes, and I go back to something that I put it in one folder, and I put it to something else. I've actually changed the concept, and everything else that is automatically categorized now is up in the air. How would we actually reflect on that? How we would show that for the user, that's very different in that kind of a case than I think in others, and I think you can have cases where there's concrete examples of that concept changing that to me seems like a challenge. >> Susan Dumais: I mean, you've talked about this as concept drift, but there's a lot that could be going on. It's actually interesting. >> Mykola Pechenizkiy: What you refer to is actually -- is a separation of real concept drift and virtual concept drift, but also what I was trying to emphasize a few slides before that is that you can think about different reasons why things change or why decision boundary changes. So in case of spam classification, okay, well, preferences of people might not change, but because of adversary activities of spammers, you need to detect that your current models are not good anymore or that, I don't know, maybe you'll have lots of spam about some rich people trying to give your money away, and then you'll get more Viagra spam and more about some other type of products. And then, just because you have this population shift, your decision boundary changes, and then you need to detect it and then learn it early on. So depending on what kind of change you anticipate, you would monitor either the input -- so if it is just about population shift, it is enough to monitor the input space. However, if it is about change in real decision boundary, so you need to get labels, so you need to get proxies of the labels and then see if your classification accuracy deteriorates or not, because otherwise you cannot capture it. And then there could be a few more distinct points, so this was one of the points I was trying to emphasize, that actually, the variety of applications and reasons why things change over time, they can be quite different, and we need to come up with different strategies to address them. But your original question was like, name an application where it is proven ->>: Not name an application. I was just saying it would be an interesting contribution -- not just the ontology of categorizing it, but here's a listing of 50, 60 applications and how they map into that space. >> Susan Dumais: Or which ones work well -- that would be really, really fun. >> Mykola Pechenizkiy: This was our original idea, so this is something we were trying to do, so we've been analyzing lots of papers which were of applied nature and they were trying to emphasize what kind of interesting technique they developed and for what purposes. And so we've been trying to list all kind of application areas, all kind of concrete tasks and what kind of techniques people were using. But when we looked closer, so then you realize that, in most of the cases, you don't have strong evidence that that particular approach did work well in practice, simply because people played either with simulated data or with some benchmarks and artificially introduced drift, and then we refocused a bit. But, indeed, this is an interesting aspect. So, this is something that we typically try to do when we perform a case study, so we always try to quantify what is the effect of not capturing concept drift. So you have a predictive model and what would be effect so that you didn't detect the change point, or what would be an effect that you detected it too late, or what is the effect that it is detected not accurately? And then you can see what is the actual effect of this misdetection or false alarm on the behavior of the system, or what is the tradeoff? What happens if many more false alarms are triggered, so you rerun the model, what is the corresponding effect on the final performance? But I would say that mainstream research and reality and concept drift area are a bit disconnected. So there is lots of interesting ideas proposed. They were tested on benchmarks, but for some of them, we don't know yet whether they really work in practical settings, and then there are a few reasons for that, and some of them I was trying to list. So there should be much more systemic view on the problem, so we have predictive analytics tasks and concept drift is just one of the ingredients. And then you can tell how important it is and how to quantify these effects. So does it answer, or does it relate? >>: Are there some examples from industry where concept drift is accounted for and effective? >> Mykola Pechenizkiy: Actually, it is so. You also need to realize that this is quite a large problem space, and in many research areas, the same concept was called slightly differently. So, for instance, temporal dynamics is known to be important, and it has been studied in different area, including information retrieval, recommender systems, industrial applications. So, for instance, in Netflix competition, you can say lots of interesting things about it, but nevertheless, one of the interesting aspects is capturing this temporal dynamics, like understanding how users' rating scale shifts, understanding how ratings change over time or how co-ratings change over times and many, many other effects. And then, if you take them into account, yes, you can do better. Again, the question is whether you need to have an explicit change-detection mechanism, or it is enough to build evolving models which just incrementally update. So this is another aspect of it. So if you look into DARPA challenge, to me, this is like very good example of concept drift research. Even though they don't name it explicitly this way, but this is a very concrete example of how you build models which evolve over time. >> Susan Dumais: And then financial markets, the whole financial industry. Do they predict change, or do they just continuously change the model? >> Mykola Pechenizkiy: I have strong negative opinions about ->> Susan Dumais: I see. On that note. Thanks again.