>> Kaushik Chakrabarti: Professor Keogh is an associate professor... computer science and engineering department at University of California-

>> Kaushik Chakrabarti: Professor Keogh is an associate professor in the computer science and engineering department at University of CaliforniaRiverside. His research interests are in the areas of machine learning and information retrieval, although the special focus is in the area of mining and searching in large time series datasets. And he's actually a really well-known name in the time series indexing area. Actually, if you enter time series indexing in one of the most popular search engines, actually suggests Professor Keogh's name as one of the query suggestions. He has authored over 100 papers and he has several best paper awards in SIGMOD, ICDM and KDD. Today he is going to talk about a set of primitives for mining time series datasets. So without further ado, it is all yours, Eamonn. >> Eamonn Keogh: Thank you for attending the talk. I like kind of controversial talks or strange claims. So here's one. I'm going to claim that for mining time series datasets that these three tools, shapelets, motifs and discords, are all you need, that it kind of subsumes everything else out there. And if you can do these things properly, everything else is going to be easy. So hopefully you will either believe or not that at the end, but that's the claim. Here is an outline of the talk. I'm going to talk about what are motifs, discords and shapelets. I'm going to kind of gloss over how you find them efficiently. It actually, of course, very important for massive datasets. I'm not going to show any data structures or algorithms. I'm going to basically try to convince you it is useful for lots of case studies, and then hopefully you believe they are useful when you think about to how to find them more efficiently. As I mentioned, a lot of this work I conduct with my students Jin-Wien who is here today. So, again, the disclaimer actually is that -- I'm not going to talk about the algorithms, data structures, notation very much. It is really to try to convince you that these things actually are very, very useful. So just briefly on the subject of the ubiquity of time series to convince you that time series is actually useful problem to solve, you probably already believe them because time series is everywhere: In finance, in query logs, even in video in some sense can be kind of extracted. Time series can be pulled out of this. And even that it's not typical time series, things like handwriting or music can be kind of massaged into time series, as we shall see. So time series is ubiquitous. We are going to be able to mine it. How are we going to do that? So, actually, here's the only page of notation. It is very trivial. And the important fact is that these time series which, of course, can be very massive, billion data points, are not the tip of the interests of the global properties. We don't really care about the maximum or the minimum or the average of the entire data set. Almost always we are interested in is small subsections. These subsections can be extracted by a sliding window. You simply pick a length, we say 5 seconds, and you start it across and you can pull out the entire set of subsections. And as we'll see in a moment, motifs, discords and shapelets are nothing but subsections with special properties. And special properties actually make them very interesting and useful, again as we'll see in a moment. So let's jump right into the first example which is time series motifs. What are these time series motifs? If we look at this industrial data set here, a question you could ask is: Do the patterns repeat themselves ever? Let's say at the length of this pink bar here. So do I find a pattern here that repeats, say, here and so on and so forth? Even in a small data set by eyes that have to do this, here's actually the best answer. So it happens that this thing here repeats approximately right here. If you look at the zoom in here, they are not identical, of course. That would be too easy to solve. But they are very, very similar. You kind of have to believe that something caused it to be that similar. Maybe the same mechanical set of valves opened and closing at the right times produced a signal here which repeats. Okay. So you can find them. What use are they? I will tell you some examples in a moment. But you can kind of guess the utility of this. If you find this motif and you notice that after you see it the first time, within five minutes the boiler explodes, then you actually have a warning system that in the future, if you see this again, you can actually sound an alarm and be ready for an explosion or whatever it is. So these are some case studies in motif discovery. These are all from the last year or so. The first actually came from these guys in Harvard. They came to us and they want to build a dictionary of all possible brain waves. People have been trying this for the last, I think, 40 years but always by hand, by having doctors look at these things and pull out these things by knowledge. But these want to do it actually totally black box. Dump in all the data and have a dictionary pop out. And the problem is that for just one hour of EEG in one trace, it takes about 24 hours of fast code to find the answer. And, of course, they haven't got just one item. It has eight hours from a person. They haven't just one trace but maybe 128 from a skull cap. And they have thousands of patients and so on and so forth. So there is no way to do this actually using fast brute force search. This gives you an idea what the data looks like. This is one hour of data, and here's a tenfold zoom in, tenfold zoom in, tenfold zoom in. So they are really interested in data at this kind of scale. This is where things basically happen. At the longest scales, it is kind of irrelevant what happens in that sense. That's local patterns that we're interested in. Okay. So here's the one hour of data. And there are 5 billion possible pairs you could actually take randomly from this, compare them and see how similar they are. And of those, the most similar pair is the motif. It is the most similar repeated pattern in this entire data set. And as it happens, this is the answer right here. Actually, looks a bit like a square root sign. So what is this good for? How interesting is this? Well, first of all, of course, there has to be some motif in the data set by definition. Something will pop out. But does it actually have any meaning? And so one thing is it is kind of suggestive. If we search for these patterns, we find some similar examples in the data again and in other data sets. So it kind of suggests potentially to have some meaning. And more interestingly, if you look at the literature, we actually find this is a known pattern. This is actually from a recent pattern, and people have actually known about this before. It has a medical name. It has a medical cause and so on and so forth. So you basically discovered this automatically without any domain knowledge. So what are motifs good for in general? What would you use them for? Here's kind of a trivial example that's kind of interesting. Look at some text here. And each individual word is actually a standard English word, nothing exciting, interesting about this text in general. Imagine this is actually a time series stream. What you find actually is that the example in the green section of the text, some letters are unrepresented. So here's the expected frequency and the observed frequency of T, for example, and they're pretty close. But here the expected frequency of E is about 13% and the observed frequency is exactly 0. So E happens never in this case, which is actually very surprising. So it is unrepresented. And then the pink text, here's the expected frequency of Z, and the observed frequency of Z and, of course, Z is much more overrepresented than you expect by chance. So this is true for the discrete strings as it is for DNA and for English text. But now in motifs, we can actually do this for time series. So the previous example I showed you, I can actually see where it occurs in this dataset. And the two original examples are here and here, and these other check marks are other similar examples. And what you actually see is that they happen pretty much at uniform frequency, but occasionally you'll find they don't appear here at all. I call this a motif vacuum. So for some reason, this motif doesn't appear here and it is kind of suggested that maybe at this time something unusual happened in the brain process. In a few moments I will show you another example actually on a similar dataset but with different doctors that's actually even more telling than this one here. So here's one example of things you can do with motifs. You can find anomalous time periods, either by the overabundance of a pattern or by the non-existence of a pattern. So here's another fun case study with insects. So this guy actually here, the beet leafhopper, is a very interesting insect. It is basically a vegetarian mosquito, if you'd like, right? So unlike -- instead of attacking animals and humans, it actually attacks plants. It sticks in the stylus, and it sucks out nutrients from the plant. And that by itself actually isn't particularly harmful for the plant. That's fine. But the problem is if one plant has a disease and this guy is going from plant to plant, then very rapidly all your plants have a disease. And this caused about $400 million of damage in California each year. It is actually a very nasty insect for that reason. So the good news actually is that we can wire this guy up, right? Well, what we actually do, or etymologists do is they attach a small gold wire to its back. They complete the circuit into the ground literally, or into the plant, and they measure the voltage change, or whatever it is, over time. So now we have a time series for this insect as it does its behavior. So the good news, we can wire this guy up. But the bad news is that that is really, really nasty and messy. So here's some examples. It is very difficult to see any kind of structure. It looks pretty much random and nasty. So how are we actually going to explore this data and understand what's going on? Well, the answer you might guess is motifs. So here's one small subsection of 14 minutes. We look for motifs in this dataset, and we find one here and here which correspond to this and this. And, once again, when you see this you can't believe this is a coincidence, right? This is so similar, it surely must mean something. As it happens, the beauty of this actually is that we do have video which we can index back into. So we can go back into the video at this time period and this time period and see what actually happens. And what turns out is this actually corresponds in a moment the stylette is actually injected into the plant. This is what the pathogen actually looks like. And the second example here, which you see is much more complicated and noisy but, nevertheless, it is the second best motif, if you'd like, in this case, actually what happens is the insect builds up some honeydew from its rear and the honeydew actually eventually causes a bridge between the insect and the plant, changes the route, then the B breaks off and we go back down to zero here. So the cool thing is, we can actually take tens and hundreds and millions of time series data points and we can summarize them into a few nice events. So prior to this basically we have a grad student looking at these videos hour after hour after hour with a notebook looking for interesting behavior, which is not very scalable. With this, at least we can say "go look at these time periods, something appears to be happening here which is actually prototypical or interesting," or what have you. So one thing you could do with this -- we actually haven't done this, but this is something we're interested in -- is the following, which is if you find these motifs, you can simply just call them with different discrete labels. This is -- I'll call it A. This is I'll call it B. I might have a C pattern, a D pattern, so on and so forth. And now you can convert these incredibly massive time series noisy datasets into a base of discrete strings. So B, B, C, A, B, C, C and so on and so forth. And once you have these, you can actually now search for patterns with kind of classic algorithms for DNA or for other kinds of discrete symbols. So, for example, you might find that if you see a B followed by another B, then the next thing you are going to find is going to be a C with very, very high probability. Again, we actually haven't quite got this far yet, but this is actually next on our list. So one more example for motifs. This is actually again from another experiment with EEGs although with a different set of doctors. This is actually from University of California-San Diego. So these doctors have lots of traces, very high dimensional, very, very nasty. And they have the following problem, which is they're showing people pictures actually much more complicated than this of surveillance -- satellite surveillance. And occasionally there will be an airplane in the picture. And when the person sees that, they're supposed to click a little button. And what you actually want to do essentially is read their mind and figure out when they see an airplane. So here's the basic experimental setup. So they've actually tried this with various hard-coded rules. The question is, could you do this automatically? And so what we actually do is we look for motifs. So we find some motifs in this dataset. Here's one example. And we ask ourselves, do these motifs actually appear with different frequency depending upon what the person is actually seeing? And so what we are going to do actually is plot the time series and every time we see one of these, we are going to put down a green dot. When we do that, here's what we see. Basically, independent traces, if you'd like. This is time moving forward, and at this point, it will show a little signal. And look at what we see. So normally these green dots are pretty much uniformly distributed, but right after they see the stimulus at the right delay of latency, we see an incredible burst of these motifs. And then shortly after that, it is harder to see. There is almost a vacuum of motifs. They don't actually appear here again for a while, and then they settle back into normal frequency again. The doctors were very impressed with this because they could actually do this before but with hard-coded rules that they actually observed with humans and doctors, backwards and forwards. This, of course, is totally black box. They simply look for motifs that are correlated with the actual class itself. So here's one final example of motifs. Just to give you an idea of scalability that we're interested in, we have 40 million small thumbnail images. Here's some examples here. And we want to find new duplicates. It is actually exact duplicates are even to find with Hashem. The problem is to find near duplicates. And so, of course, these are not time series, but we actually can simply convert the color images to red, green, blue histograms which are, basically, pseudo time series so that we can look for them. And, of course, as you might guess, 40 million things is not going to live in main memory. It is going to live on disk, and it is going to be scurried because the brute force algorithm will be right erratic. And quite erratic when the data lives on disks means a lot of trashing backwards and forwards. This will actually take a long time to do naively. Actually, if you do this naively with the disk trashing, we are looking at a couple hundred years of time to solve this problem. With Hashem, we can actually solve this. So here's the answers. Here are the repeated patterns. And actually they're not identical in any case. So if you look at the dog here, it has a red -- or that little dot here which doesn't appear here, the [inaudible] here has something lit up here which is not lit up here. And the equation of a difference. So there are new duplicates, not exact duplicates which we could find. And, amazingly, we could actually find these, I think, in a little bit over 24 hours, so not 200 years but in a day. So the last idea we'll use motifs for is what we call motif joins. In the past, you can imagine we had one series and the motifs can come from anywhere. But suppose I simply divided that into two halves and I say one must come from here, one must come from here. That makes logical sense. I can do that. What are they good for? So imagine you're in NASA, for example, and you have some rocket telemetry and five years ago this rocket exploded or crashed and then just recently this rocket exploded or crashed. You might want to ask yourselves, what is the common thing between these two things? Maybe it has something to do with the anomaly. So you do a motif join and you find that this pattern appears in both of the crash ones. You test to see if it appears in the non-crash ones and so on and so forth. So a motif join can be quite useful. One question is, would it be scalable because these datasets will be quite large. Let's see how scalable motif joins could be. So, first of all, let me mention something, is you can convert DNA into a time series. It is actually a [inaudible] transform. The idea would simply be that you walk across the DNA and if you see a G, you go up one unit. If you see an A, you step up two units. If you see a C, you step down one unit and so on and so forth. So you simply walk across the DNA producing the time series. Okay. So here's two primates. Notice they actually have the same kind of hairline, which is actually interesting. And we can convert their DNA into time series. If we do that, we actually have 3 billion time series here, 3 billion time series here. So a lot of data. And we notice actually that the human has 23 chromosomes, the monkey has 21. So somewhere in history either we gained two or they lost two or some other combination actually separates those. What that means is we do an alignment or a join here. We can't expect a straight line which we'd have from human to human, right? Somehow there must be kind of a non-linear join in this. We're actually going to find that. What we're going to do is take a small sliding window of a length 1,024, slide it across here, slide it across here these two very large datasets and find the pair that joins the best. So if we do that, where does it appear? The answer is it appears right here which is not very interesting to see at this scale. Let me just take this section here and zoom in on it. And I added a few extra points, so the second join, the third join, the fourth join up to a few thousand. What you actually see, of course, here is that it looks like, like is almost certainly the case, that human chromosome 2 is actually composed of monkey chromosome 12 and 13. And you can also see, for example, that all of 12 maps to 2, but for 13, there is actually two little gaps here which presumably appear somewhere else in DNA. You have to actually go and look for those separately. So the points actually simply show you the scalability of this. If you can join two datasets the size of 3 billion, 3 billion, you can problem solve problems in industrial scale. So, again, I'm going not going to talk about algorithms very much, but how long does this all take? Naively, it would take in square time to compare all to all and that for any non-trivial dataset would be very, very nasty. And for that reason, there is dozens of researchers that have solved this problem approximately, typically in analog end time with very high constants but only for approximate answers. So the answer they give you is good but not optimal. Recently, Wien, one of my students here, have come up with a beautiful exact algorithm which is actually incredibly fast. And to give you an idea how fast it is for the EEG dataset, those guys were very smart computer scientists and Ph.D. and medical doctors, they can actually solve this for one hour in 24 hours, one hour of data. We can actually do one hour of data in about two minutes. And, again, for the other example with 40 million time series which live on disk, we can solve this in -- I say hours. Actually probably tens of hours. Let's say a day or two, so actually really scalable enough to handle these massive datasets. So a quick summary of motifs and I will go on to the other examples. We can find motifs now in very large datasets, and they have some potentially very interesting things we can do with them. We can monitor the frequency of these motifs in data stream to anomaly detection. And we can even sound an alarm if we don't see a pattern, right? If I don't see this pattern for five minutes, that's unusual. I can sound an alarm. Usually motifs -- usually anomaly detection only works when you see something. This actually could work when you don't see something, which is kind of interesting. And there is a few things we can do with this like find the motifs and streams, which is kind of basically future work. Okay. Shift gears very slightly and talk about something different, which is shapelets. Actually, shapelets are basically supervised motifs, as we'll see. So I'm going to show you this actually in the shape domain, but it works really for time series, as we'll see in a moment. So here we have two different classes of shapes, stinging nettles and false nettles. Let's say you want to classify these, tell them apart. One problem actually is they look very, very similar at a global scale, and the problem is that they also can have insect damage, like you have here. So any kind of global measure tends to work very, very badly. So the idea of shapelets actually is to say, let's ignore the global measures. Let's zoom in and find local patterns that might tell these things apart. So how are we going to do this? First of all, we take the shape and convert it to a time series. There are many ways of doing that. We have a way of doing it, but it doesn't really matter too much. The point of this actually is one to a map, and I can go back from this to the different shape, if I wanted to. And, again, this is actually a global pattern for a leaf now; but small subsections of it, like the subsection here, might be all it takes to distinguish these two classes. So I'm actually going to look for all possible subsequences to find the best such pattern. And it happens to be, in fact, this one right here. What you actually see is that for false nettles, the pattern looks like this. But the closest possible pattern in the true nettles actually looks radically different. And the reason now is obvious. For true nettles, the leaf joins at 90 degrees, essentially 90 degrees. But for the false nettles, the angle is much shallower. Actually another rule, if you look back, you see, oh, yeah, that kind of makes sense. So you are going to use this fact to make a decision tree very easily, right? The decision tree works like this. You simply get a new leaf to classify. You find all the subsequences of the right length and you compare to this one here. If one of those shapelets -- if one of those subsequences is less than 5.1 from this, you say it is a false nettle. Otherwise, you say it is a true nettle. And actually as it happens, of course, in this case it is very robust to leaf bites, especially around here. One cool thing about this actually is not only can it classify very accurately in many cases, but unlike other classifiers, it tells you why, right? So we get accurate results here, but we can also go back and brush this on to the shape and say, the reason why these things are different has something to do with whatever happens around here. And we actually figure out why the difference exists, which is very useful in some domains. So briefly, how do we actually decide which shapelet is the one to use? So for this subsequence here, I have to test every possible subsequence from everywhere of every possible length from tiny to very, very long, from every single shape in my database. So how do I actually choose this particular subsequence in the shapelet? Well, for every subsequence candidate, what I do is I put it here and I sort all the objects in my database based upon the distance to that candidate. And what I hope to find is that on my number line, all of one class -- see, the blue class is on this side. All of the red class is on this side, and I can separate them with a clean split point here. In this example actually, I have a pretty good example, not a perfect example. One thing here actually is out of order. Maybe a different shapelet would actually pull this blue thing back here, this red thing back here and I would have a perfect separation. Now, one small problem with this actually is that if I do this naively, it would take a long time. Even for a tiny total dataset of length 200 with 200 objects in it, there is about 8 1/2 million of these calculations to do. Each one of these calculations actually isn't just moving dots around. Each time you place the dot, you have to do a lot of distance measures, different things. So naively, it could take a long time. As you may guess, we actually have ways of speeding up, which I will briefly talk about in a moment. This makes it actually useful for visual domains. On my campus, actually we have 1 million of these things, mostly in covered boxes. They actually have been photographed. We have actually classified them, not only classified them but classified them by being robust to things like broken points and classified them with some explanation of why they're classified that way. So we've done this, and here's the answer. So, again, we can take these [inaudible] points, we can build a decision tree which actually is quite accurate, as it happens. But it also tells you why it made the decision in some sense. So the first split here is based upon this subsequence here which corresponds, if I brush it back, to this section here. So what it basically says is if you have this kind of a deep, concave dish at the base, you are a clovis. That's what defines a clovis. It isn't this point here because actually it is very common in all kinds of things, but this is unique to clovis. And, likewise, the second subtree here has this decision based upon this shape here, which you can brush back to this. And what it basically says is if you have a side notch here, you're avonlea. But if you don't, you're not. That makes a split right here. So, once again, we're actually more accurate. We're a lot faster to actually classify, which is kind of not that important in this domain. In some domains it can be. But the real cool thing actually is, it is telling you why it made a decision. One last example before I move on, this is a classic problem called gun-no gun. The young lady in the question is either pointing over there at the wall or she is pointing the gun at the wall. And I'm going to classify this. We can do this quite accurately, a little more accurate than previous people have done this. But more importantly actually it kind of tells you why I made the decision. So the shapelet actually brushes back here, which I can brush back into the video, and it turns out actually that in this case, the young lady is quite a small young lady, has a very heavy gun. And she puts it back on the holster, it basically has an overshoot. The inertia carries her hand further past the holster, then she puts it back in again. It is a subtle thing, but it only occurs in this example as we can actually guess then it is the difference between the two classes. So just a brief one slide to show you the scalability and the accuracy on a classic benchmark. So finding the shapelets is the slow part. Once you find them, classification as a decision tree, it is incredibly fast. But finding them can be quite slow. So here in this classic dataset to find the best shapelet can take us in a brute force algorithm about five days. With some clever [inaudible] ideas, we are actually going to find the exact answer in a few minutes instead. So we're actually going to find these things really fast for large datasets. More interestingly is we are a lot more accurate than many people in many domains. So here actually in this problem, there is actually 2,000 things in a training set. If we just use ten of those things, a tiny fraction in the training set, we're not quite as good as the best known approach in the world. But if we use 20 things, which is only 1% of the entire dataset, we're already better than the best approach. And as we add a few more things in, we get better and better again. And so why are we so much better than everything else? The trick is basically the shapelet in this case essentially finds the pattern just right here is the key difference. Shapelets can actually ignore most of the data. And as it happens for many problems, throwing away most of the data is the key thing. The difference is only in a small subtle place. And shapelets can do that. All the other approaches are basically forced to account for all the data. You are going to find some noise. You are going to overfit. You are going to cause problems. Okay. So the last thing I'm going to talk about are time series discords, which again are simply just some more subsequences with special properties. So what's the property in this case? So discords are the subsequences which are maximally far from the nearest neighbor. So, for example, if I had this subsection here, it actually looks like that one there. Or if I have this subsection here, it looks pretty much like that subsection there. But as it happens for this subsection here, its nearest neighbor somewhere in here is very, very far aware and that's what the definition of "discord" actually is. As it happened in this case, it does correspond to a known phenomenon in the special dataset. It has found a true anomaly. So here's some examples of discords. Here's a Web log we have from you guys a couple years ago. And you can see that most things have kind of classic patterns. So many things have this daily periodicity. I guess it is because people simply go to work and have more Internet access at work. And they actually have well-known patterns. Here's actually a pattern that you can see which are called anticipated bursts, for movies or book -- new book, whatever it is. What you see it actually is a big buildup in excitement. The movie is released and then people get bored and the excitement falls off. Of course, what is this bump here? DVDs, right? Actually, you see kind of a similar pattern but little bit different for dead celebrities. So for dead celebrities, you see a small interest. He is found dead in Bangkok or whatever it is. You see a big spike in interest, which falls off again, right? So the point actually is that most things are set on something else. So Germany might be similar to Austria. Stock market might be similar to finance. Spiderman is similar to Star Wars and so on and so forth. So given this, what's the most unusual thing you can find in all the English words? It is a tough puzzle, right? It is not easy what it is. The answer actually is full moon. And why is that? Well, most things have a periodicity of a day. Some things have a periodicity of one year. Some things have a periodicity of a month. At the end of the month, your insurance expires. People look for insurance at the end of the month for some reason, right? But full moon is the only thing that has a periodicity of exactly that of a full moon. And so its periodicity is unlike everything else in the world. It is the only thing you can observe from anywhere on the planet and people apparently go out for a walk, see the full moon, it looks pretty. They go home they hit the search engine, they type in "full moon." So the periodicity is exactly that of a lunar month. Kind of an interesting little puzzle there. I don't want to beat this to death but the discords actually work not only in onedimensional space but also in two-dimensional and three-dimensional and other spaces. The cool thing actually is they work incredibly well. So we've actually compared this to many other approaches, and the problem is many approaches find anomalies in time series, typically have four or five or six parameters. And you have to tune them and set them and you can make them work well for one dataset, but you rarely generalize to other datasets. The cool thing about discords actually is they have exactly one parameter, which is the length of them. Once you set that, there is nothing else you can tune. You walk away. It gives you the answer and, surprisingly, it often is the right answer. I think actually simplicity here is not a weakness, it's a strength because if you have lots of parameters, you are going to overfit. It is almost impossible to avoid that. I won't go into this in great detail but actually one question is: How do you know if the discord you find is really unusual? Because, once again, there has to be some kind of discord in every dataset, even if it is not very meaningful. So one trick you can do is -- in this case, there is a dataset that has two real anomalies which are known by external sources. If you simply sort them based upon the discord distance, what you actually find is that the background normal stuff is relatively low and there is a big jump, discontinuity, into discords. So by looking at this plot here with a knee plot, an elbow plot, you can probably kind of guess that this actually corresponds in threshold for true anomalies in this dataset. Just to go back to the example we have with this young lady with the gun. So in the entire video sequence, which is quite long, we actually look for discords in two-dimensional space. And we found actually it exists right here. So why is this unusual? Once again, we go back to the original video and find the answer. So normally the girl is very diligent and she points the gun, returns the gun, points the gun multiple times for this video. But in one sequence beginning right here, she returns the gun and she actually misses the holster or she fumbles around a little bit. She gets embarrassed. She looks at the cameraman and she begins to double over and laughs and jokes around, then realizes she's wasting time and gets back into character and returns back to normal. So here discord finds in two-dimensional space the interesting, unusual anomaly for this dataset. Once again, if you list naively, it could be really, really nasty because the discord algorithm requires you to actually look for every subsequence compared to every other subsequence that's quadratic. And quadratic algorithms can be nasty especially if the data lives on disk. As it happens, we have a data algorithm that you can actually do this very, very fast. For disk it basically takes two scans of the data and you can find the right answer. So we can do this actually for 100 million objects, which is about a third of a terabyte, in about 90 hours. This is actually a few years old. Now it can probably do a little bit better than that. But again, it is actually very impressive. A brute force algorithm for this would take thousands of years. So, again, I know you might not be that impressed. You are guys are with Microsoft with 100 million objects. But by most academic standards, that's a really, really, really, big dataset. Just to define how big is it, the classic thing you say is a needle in a haystack. Suppose that actually all the time series in the hundred million examples I gave you was a straw in a haystack. How big is the haystack? Well, the haystack would be about this size. This is actually to scale, 262 meters. As it happens, there's actually a much harder problem than that needle in the haystack because when you find the needle, you know you found the needle and you're done. What I'm really asking you here is find the one piece of straw that's least like all the other pieces of straw. That's a much harder problem. And, again, it is kind of surprising you can find the exact answer in tens of hours and not thousands of years. So, again, I've been selling these discord ideas for a while. They are very, very simple, almost insultingly simple but work very well. And recently got some nice confirmation of that. So Vippin Kumar actually had some students test this. So they tested on 19 very different kinds of datasets from all kinds of domains, and they tested the nine most famous techniques of anomaly detection out there. And they actually found that discords win virtually every time. I think once or twice they come in second place; but essentially discords, even though they're insultingly simple, work incredibly well. And I could make the same claim for shapelets and motifs. They are very, very simple ideas, but simple ideas tend to work very well in my experience. And certainly in every domain we've tried, this has been the case. And, again, the reason why I think they actually do work so well is because there's basically very few parameters to mess with essentially. The only parameter really is the length, and even that in some cases you can actually remove that as a parameter and have no parameters. And, finally, they are actually scalable and parallelizable. You can actually do all kinds of clever tricks and make this really, really fast. Now what I would like to do -- we haven't done yet actually -- is to port this to streaming data. So imagine instead of saying in this batch dataset, what's the discord or motif, saying, in the last one hour of a window that moves forever, what's the most unusual discord you've seen? What's the best motif you've seen? They are tricky problems. In the motif case actually, we maybe can find it, the answer. Mueen is working on that. For the discord case actually it is quite difficult and we are not sure the answer can be computed exactly. Okay. So the overall conclusions, motifs, motif joins, shapelets, discords are really very simple but very effective tools that we can use for understanding massive, massive datasets, at least in a batch case and potentially also in a streaming case. My personal philosophy is that motifs -- that parameters are bad. Every time you add a parameter, you half the utility of your idea. So if you have a reasonably good idea and you have five parameters, you half how good it is, half it again, half it again, half it again. It is not that good of an idea basically. And motifs and all these other things are great because they're very few to no parameters. And as always, if you have cool datasets, cool problems, we're very interested in those. Before I go, a quick plug. I'm giving a talk next week in Paris, a tutorial, on how to do good research. It's basically designed for young faculty and grad students who are maybe not from a big powerhouse like CMU or MIT who are trying to do good research and actually get it published. And I've gotten lots of great ideas from people all over the world. If you have any interesting ideas for helping these people, these grad students and faculty, I would love to hear them. So, again, it might be simply that you have reviewed papers recently and you said, "These guys had a good paper but they did this and it condemned the paper," so what is "this" that condemned the paper. If you tell me, I will try to summarize that and actually give that information back out to the community. Great. I'm all done at this point. Any questions? Comments? Sir? >> Question: So your one parameter is the window slides? >> Eamonn Keogh: Yes. >> Question: How do you handle that, say, in your motifs? >> Eamonn Keogh: Actually, it is surprisingly easy in many cases. So, like, for cardiology, the doctors will tell you that the interesting stuff happens at about one second. What happens in five minutes is kind of irrelevant because it will drift in that range. It makes no difference. What happens in a millisecond in a heartbeat has almost no interest, too, right? So they kind of know the natural scale for interesting stuff is about that. So for the entomology thing again, the entomologist suggests the light scale actually is about two or three seconds, right? Beyond that it is kind of random unless you miss things. One thing simply to ask the expert, sometimes you can basically, because it is efficient enough, you can search over some range. So essentially you would say, pick half a second and double that a few times and with some statistic measure, the statistical significance is what you find. And I think the real length is actually 2.1 seconds, and here's what we find in that range. So a combination of domain experts and the search over parameter usually solves the problem. >> Question: Do you think there might be a way to automate that? Like, look at, I don't know, [inaudible] or transforms or along those lines? >> Eamonn Keogh: I think there is probably somebody doing that. I mean, maybe someone more statistically smarter than me potentially, right, is probably the true answer. My guess is probably some kind of entity actually would make this. So the way of actually looking at doing this is actually by looking at this problem as a compression problem. So if you actually have motifs, you can compress well because you simply just take the occurrences of the motif. You give it a letter A in your dictionary and you kind of compress the data. So if you basically say, well, the best motif lengths are the one that compressed the data the most, which has some plausibility in some domains, then actually you could simply do a hands-off thing, find the best compression and, say, This is the true structure you could actually have. We actually are working on that. >> Question: So, for example, if you look at the moon for various [inaudible], sliding windows motifs, the meaningfulness of the motif would probably give some answer, right? If it is a true shut window, you could get a lot of false motifs. If it is too long, no motifs at all. >> Eamonn Keogh: Somebody could do that, right? There is kind of a sweet spot, I think, if you have a very short thing, then almost everything matches. For very, very long windows, nothing matches because curse dimensionality, as you might guess, everything is equally distant. If you plot the statistics, you see this nice clean bump here and it is typically what you expect to be the right answer for most domains, yes. So, like, for dance and martial arts, if you try it on motion capture, it will be about half a second. That seems to be kind of a plausible length, and move and dance might be about that length. And it repeats 11 times in a dance. And even the best dancer dancing to very synchronized music, they can't exactly repeat themselves with great fidelity much longer than a few seconds because it will go out of phase of themselves eventually. Sir? >> Question: What's your decision criteria? How do you decide [inaudible]. >> Eamonn Keogh: Sorry, once more. >> Question: How do you decide [inaudible]? >> Eamonn Keogh: This is Euclidean distance. Euclidean distance is a very simple measure. Since actually for classification problems, Euclidean distance works incredibly well. There are many other kinds of measures you could have, like the number time warping, logarithmic subsequence. Actually, there is at least 50 different measures out there for time series. We actually do a test on classification problems. Euclidean distance actually basically wins every single time. It is kind of unfortunate because you would like the clever idea to work, but the simplest thing you can imagine, Euclidean distance works very, very, very well. These are all based upon Euclidean distance. >> Question: So Euclidean distance is fresher? >> Eamonn Keogh: In the motif case, you minimize that Euclidean distance. You are simply finding the Euclidean distance as the minimum value. Once you learn that minimum value, you can actually maybe make a threshold out of that. So it is true actually that for small datasets, non-Euclidean things actually work a little bit better. So the intuition is -- let's say you a face like my friend Kaushik here. If I have a million people in a room, there is a good chance I will find a similar face to his and there would be very little difference or warp in between them. But in a smaller room, I can't find someone that looks like him so much. I have to warp or change his face more to match someone else's face. So for very small datasets, Euclidean distance actually can't handle the irregularities and the warp and the changes. And so number time warping or some other kind of metric can work very well. Once a data is reasonably large, as it turns out, Euclidean distance works beautifully. Anyone else? >> Kaushik Chakrabarti: Okay. Thank you. >> Eamonn Keogh: Thank you. [applause]

>> Kaushik Chakrabarti: Professor Keogh is an associate professor... computer science and engineering department at University of California-

Related documents

Products

Support

&gt;&gt; Kaushik Chakrabarti: Professor Keogh is an associate professor... computer science and engineering department at University of California-

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Kaushik Chakrabarti: Professor Keogh is an associate professor... computer science and engineering department at University of California-