Fast Calculations of Simple Primitives in Time Series Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences New York university Joint work with Richard Cole, Xiaojian Zhao (correlation), Zhihua Wang (humming), Yunyue Zhu (both), and Tyler Neylon (svds, trajectories) 1 Roadmap Section 1 : Motivation Section 2 : Statstream: A Fast Sliding Window based Correction Detector Problem Statement Cooperative and Uncooperative Time Series Algorithmic Framework DFT based Scheme and Random Projection Combinatorial Design and Bootstrapping Empirical Study Section 3 : Elastic Burst Detection Problem Statement Challenge Shifted Binary Tree Astrophysical Application 2 Overall Motivation Financial time series streams are watched closely by millions of traders. What exactly do they look for and how can we help them do it faster? Typical query:“Which pairs of stocks had highly correlated returns over the last three hours?” Physicists study the time series emerging from their sensors. Typical query:“Do there exist bursts of gamma rays in windows of any size from 8 milliseconds to 4 hours?” Musicians produce time series. Typical query: “Even though I can’t hum well, please find this song. I want the CD.” 3 Why Speed Is Important As processors speed up, algorithmic efficiency no longer matters … one might think. True if problem sizes stay same but they don’t. As processors speed up, sensors improve -satellites spewing out a terabyte a day, magnetic resonance imagers give higher resolution images, etc. Desire for real time response to queries. /86 4 Surprise, surprise More data, real-time response, increasing importance of correlation IMPLIES Efficient algorithms and data management more important than ever! /86 5 Section 2: Statstream: A Fast Sliding Window based Correction Detector 6 Scenario Stock prices streams The New York Stock Exchange (NYSE) 50,000 securities (streams); 100,000 ticks (trade and quote) Pairs Trading, a.k.a. Correlation Trading Query:“which pairs of stocks were correlated with a value of over 0.9 for the last three hours?” XYZ and ABC have been correlated with a correlation of 0.95 for the last three hours. Now XYZ and ABC become less correlated as XYZ goes up and ABC goes down. They should converge back later. I will sell XYZ and buy ABC … 7 Motivation: Online Detection of High Correlation Correlated! Correlated! 8 Problem Statement Synchronous time series window correlation Given Ns streams, a start time tstart,and a window size w, find, for each time window W of size w, all pairs of streams S1 and S2 such that S1 during time window W is highly correlated with S2 during the same time window. (Possible time windows are [tstart tstart+w - 1], [tstart+1 tstart+w], where tstart is some start time.) Asynchronous correlation Allow shifts in time. That is, given Ns streams and a window size w, find all time windows W1 and W2 where |W1 |= |W2 |= w and all pairs of streams S1 and S2 such that S1 during W1 is highly correlated with S2 during W2. 9 Cooperative and Uncooperative Time Series Cooperative time series Exhibit a fundamental degree of regularity at least over the short term, Allow long time series to be compressed to a few coefficients with little loss of information using data reduction techniques such as Fourier Transforms and Wavelet Transforms. Example: stock price time series Uncooperative time series Regularities are absent – resembles noise. Example: stock return time series (difference in price/avg price) 10 Algorithmic Framework Basic Definitions: Timepoint: Basic window: The smallest unit of time over which the system collects data, e.g., a second. A consecutive subsequence of time points over which the system maintains a digest (i.e., a compressed representation) and returns resuls e.g., two minutes. Sliding window: A user-defined consecutive subsequence of basic windows over which the user wants statistics, e.g., an hour. The user might ask, “which pairs of streams were correlated with a value of over 0.9 for the last hour?” Then again 2 minutes later. 11 Definitions: Sliding window and Basic window Time point Basic window Stock 1 Stock 2 Stock 3 … … Stock n Sliding window Time axis Sliding window size=8 Basic window size=2 12 Algorithmic Strategy (cooperative case) time series 1 time series 2 time series 3 … time series n Dimensionality Reduction (DFT, DWT, SVD) digest 1 digest 2 … Grid structure Correlated pairs digest n … … 13 GEMINI framework (Faloutsos et al.) Transformation ideally has lower-bounding property 14 DFT based Scheme* Time Basic window digests: sum DFT coefs Basic window digests: sum DFT coefs point Basic window Sliding window *D. Shasha and Y. Zhu. High Performance Discovery in Time Series: Techniques and Case Studies. Springer, 2004. 15 Incremental Processing Compute the DFT a basic window at the time. Then add (with angular shifts) to get a DFT for the whole sliding window. Time is just DFT time for basic window + time proportional to number of DFT components we need. Using the first few DFT coefficients for the whole sliding window, represent the sliding window by a point in a grid structure. End up having to compare very few time windows, so a potentially quadratic comparison problem becomes linear in practice. 16 Grid Structure x ( x1 , x2 ,...xk ) 17 Problem: Doesn’t always work DFT approximates the price-like data type very well. However, it is poor for stock returns (today’s price – yesterday’s price)/yesterday’s price. Return is more like white noise which contains all frequency components. DFT uses the first n (e.g. 10) coefficients in approximating data, which is insufficient in the case of white noise. 18 DFT on random walk (works well) and white noise (works badly) Random Walk 1.2 0.8 0.6 ratio 0.4 0.2 White Noise 0 96 0.8 0.7 0.6 0.5 ratio 0.4 0.3 0.2 0.1 96 101 91 86 81 76 71 66 61 56 51 46 41 36 31 26 21 16 11 6 0 1 Ratio over total energy The number of coefficients 101 91 86 81 76 71 66 61 56 51 46 41 36 31 26 21 16 11 6 0.9 1 Ratio over total energy 1 The number of coefficients 19 Random Projection: Intuition You are walking in a sparse forest and you are lost. You have an outdated cell phone without a GPS. You want to know if you are close to your friend. You identify yourself as 100 meters from the pointy rock and 200 meters from the giant oak etc. If your friend is at similar distances from several of these landmarks, you might be close to one another. Random projections are analogous to these distances to landmarks. 20 How to compute a Random Projection* Random vector pool: A list of random vectors drawn from stable distribution (like the landmarks) Project the time series into the space spanned by these random vectors The Euclidean distance (correlation) between time series is approximated by the distance between their sketches with a probabilistic guarantee. Note: Sketches do not provide approximations of individual time series window but help make comparisons. •W.B.Johnson and J.Lindenstrauss. “Extensions of Lipshitz mapping into hilbert space”. Contemp. Math.,26:189-206,1984 21 Random Projection X’ current position inner product x ( x1 , x2 , x3 ,...xw ) y ( y1 , y2 , y3 ,... yw ) X’ relative distances Rocks, buildings … R1 (r11, r1 2, r1 3,..., r1w) R2 (r21, r2 2, r2 3,..., r2 w) R3 (r31, r3 2, r3 3,..., r3 w) ( xsk1, xsk 2, xsk3, xsk 4) ( ysk1, ysk 2, ysk 3, ysk 4) R4 (r41, r4 2, r4 3,..., r4 w) Y’ current position raw time series random vector sketches Y’ relative distances Sketch Guarantees Johnson-Lindenstrauss Lemma: For any 0 1 and any integer n, let k be a positive integer such that 2 3 1 k 4( / 2 / 3) ln n d k d Then for any set V of n points in R , there is a map f : R R such that for all u , v V (1 ) || u v ||2 || f (u ) f (v) ||2 (1 ) || u v ||2 Further this map can be found in randomized polynomial time 23 Empirical Study: sketch distance/real distance 5% 4% 3% number 1% Sketch=1000 Factor(Real distance/Sketch distance) Factor distribution Factor distribution 6% 5% 4% number 3% 2% 1% 0% Percentage of distance 7% 12% 10% 8% number 6% 4% 2% Factor(Real distance/Sketch distance) 0. 93 0. 94 0. 96 0. 98 1. 00 1. 02 1. 04 1. 06 1. 09 1. 11 1. 14 1. 16 1. 19 1. 25 1. 20 1. 16 1. 12 1. 09 1. 05 1. 02 0. 99 0. 96 0. 93 0% 0. 91 0. 88 1. 33 1. 27 1. 2 1. 1 1. 15 1. 05 1. 01 0. 97 0. 93 0. 9 0. 87 0. 84 0% Sketch=80 Percentage of distance Sketch=30 2% 0. 81 Percentage of distance Factor distribution Factor(Real distance/Sketch distance) 24 Empirical Comparison : DFT, DWT and Sketch Stock Return Data 30 dft 20 dwt 15 sketch 10 dist 5 976 911 846 781 716 651 586 521 456 391 326 261 196 131 66 0 1 Distance 25 Data Points 25 Algorithm overview using random projections/sketches Partition each sketch vector s of size N into groups of some size g; The ith group of each sketch vector s is placed in the ith grid structure (of dimension g). If two sketch vectors s1 and s2 are within distance cd, where d is the target distance, in more than a fraction f of the groups, then the corresponding windows are candidate highly correlated windows and should be checked exactly. 26 Optimization in Parameter Space Next, how to choose the parameters g, c, f, N? Size of Sketch (N): 30, 36, 48, 60 Group Size (g): 1, 2, 3, 4 Distance Multiplier (c): 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1.1, 1.2, 1.3 Fraction (f): 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1 27 Optimization in Parameter Space Essentially, we will prepare several groups of good parameter candidates and choose the best one to be applied to the given data But, how to select the good candidates? Combinatorial Design (CD) Bootstrapping 28 Combinatorial Design The pair-wise combinations of all the parameters Informally: Each value of parameter X will be combined with each value of parameter Y in at least one experiment, for all X, Y Example: if there are four parameters having respectively 4, 4, 13, and 10 values, exhaustive search requires 2080 experiments vs. 130 for pairwise combinatorial design. sketchparam.csv *http://www.cs.nyu.edu/cs/faculty/shasha/papers/comb.html 29 Exploring the Neighborhood around the Best Values Because combinatorial design is NOT exhaustive, we may not find the optimal combination of parameters at first. Solution: When good parameter values are found, their local neighbors will be searched further for better solutions 30 How Bootstrapping Is Used Goal: Test the robustness of a conclusion on a sample data set by creating new samples from the initial sample with replacement. Procedure: A sample set with 1,000,000 pairs of time series windows. Among them, choose with replacement 20,000 sample points Compute the recall and precision each time Repeat many times (e.g. 100 or more) 31 Testing for stability Bootstrap 100 times Compute mean and std of recalls and precisions What we want from good parameters Mean(recall)-std(recall)>Threshold(recall) Mean(precision)-std(precision)>Threshold(precision) If there are no such parameters, enlarge the replacement sample size 32 X Z Y Inner product with random vectors r1,r2,r3,r4,r5,r6 ( ysk1 , ysk 2 , ysk3 , ysk 4 , ysk5 , ysk 6 ) ( xsk1 , xsk2 , xsk3 , xsk4 , xsk5 , xsk6 ) ( zsk1 , zsk 2 , zsk 3 , zsk 4 , zsk 5 , zsk 6 ) ( xsk5 , xsk6 ) ( xsk3 , xsk4 ) ( xsk1 , xsk2 ) ( ysk 5 , ysk 6 ) ( ysk 3 , ysk 4 ) ( ysk1 , ysk2 ) ( zsk 5 , zsk 6 ) ( zsk 3 , zsk 4 ) ( zsk1 , zsk 2 ) Grid structure 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 std_dft bu w oy ind _s en ev sor ap or ato fo r et al_ sp ot e cg _e xr at e ste s am ge n w in di ng pr ic e re tu rn mean_dft Practical Data Sets Sketch Distance/Real Distance 2.5 2 ratio ee g cs tr Ratio DFT Distance/Real Distance 1.5 std_sketch 1 mean_sketch 0.5 0 tr cs g ee cg en es ng ric e urn or or t ns orat al_e xra t mg indi p e re a s t e p e e w _ t a s fo pot_ oy ev s bu w d in Practical Data Sets tes g ge n wi nd m l_ ec g ste a fo eta ee cs tr wi nd in g xr a tor tur n or a sp ot_ e ev ap re pr ice Wall Clock Time(second) Experiments Comparison of Processing Time 1.2 1 0.8 sketch 0.6 dft 0.4 scan 0.2 0 Practical Data Sets 36 Section 3: Elastic Burst Detection 37 Elastic Burst Detection: Problem Statement Problem: Given a time series of positive numbers x1, x2,..., xn, and a threshold function f(w), w=1,2,...,n, find the subsequences of any size such that their sums are above the thresholds: all 0<w<n, 0<m<n-w, such that xm+ xm+1+…+ xm+w-1 ≥ f(w) Brute force search : O(n^2) time Our shifted binary tree (SBT): O(n+k) time. k is the size of the output, i.e. the number of windows with bursts 38 Burst Detection: Challenge Single stream problem. What makes it hard is we are looking at multiple window sizes at the same time. Naïve approach is to do this one window size at a time. 39 Astrophysical Application Motivation: In astrophysics, the sky is constantly observed for high-energy particles. When a particular astrophysical event happens, a shower of high-energy particles arrives in addition to the background noise. An unusual event burst may signal an event interesting to physicists. Technical Overview: 900 1.The sky is partitioned into 1800*900 buckets. 2.14 Sliding window lengths are monitored from 0.1s to 39.81s 3.The original code implements the naive window-at-a-time algorithm. Can’t do more windows. 1800 40 Bursts across different window sizes in Gamma Rays Challenge : to discover not only the time of the burst, but also the duration of the burst. 41 Shifted Binary Tree (SBT) Define threshold for node for size 2k to be threshold for window of size 1+ 2k-1 42 Burst Detection using SBT Any window of size w, 2i-1+2 w 2i+1, is included in one of the windows at level i+1. For non-negative data stream and a monotonic aggregation function, if a node at level i+1 doesn’t exceed the threshold for window size 2i-1+2, none of the windows of sizes between 2i-1+2 and 2i+1 will contain a burst; otherwise need detailed search to test for real bursts Filter many windows, thus reducing the CPU time dramatically Shortcoming: fixed structure. Can do badly if bursts very unlikely or relatively likely. 43 Shifted Aggregation Tree Hierarchical tree structure-each node is an aggregate Different from the SBT in two ways: ■ ■ Parent-child structure: Define the topological relationship between a node and its children Shifting pattern: Define how many time points apart between two neighboring nodes at the same level 44 Aggregation Pyramid (AP) N-level isosceles triangularshaped data structure built on a sliding window of length N Level 0 has a one-to-one correspondence to the input time series Level h stores the aggregates for h+1 consecutive elements, i.e, a sliding window of length h+1 AP stores every aggregate for every window size starting at every time point 45 Aggregation Pyramid Property 45o diagonal: same starting time 135o diagonal: same ending time Shadow of cell(t,h): a sliding window starting at time t and ending at t+h-1 Coverage of cell(t,h): all the cells in the sub-pyramid rooted at cell(t,h) Overlap of cell(t1,h1) and cell(t2,h2): a cell at the intersection of the 135o diagonal touching cell(t1,h1) and the 45o diagonal touching cell(t2,h2) 46 Embed Shifted Binary Tree in Aggregation Pyramid 47 Aggregation Pyramid as a Host Data Structure Many structures besides Shifted Binary Tree in an Aggregation Pyramid The update-filter-search framework guarantees detection of all the bursts as long as the structure includes the level 0 cells and the top-level cell What kind of structures are good for burst detection? 48 Which Shifted Aggregation Tree to be used? Many Shifted Aggregation Trees available, all of them guarantee detection of all the bursts, which structure to be used? Intuitively, the denser a structure, the more updating time, and the less detailed search time, and vice versa. The structure minimizing the total CPU running time, given the input 49 State-space Algorithm View a Shifted Aggregation Tree (SAT) as a state View the growth from one SAT to another as a transformation between states 50 State-space Algorithm Initial state: a Shifted Aggregation Tree (SAT) containing only level 0 Transformation rule: If by adding one level to the top of SAT B, we get SAT A, state B is transformed to state A Final state: a SAT which can cover the max window size of interest Traversing strategy: best-first search Associate each state with a cost Prune: to explore more reasonable structures 51 Results CPU Time vs. l - Poisson 20000 CPU Time (ms) Shifted Aggregation Tree outperforms Shifted Binary Tree. A factor of 35 times speedup in some experiment Shifted Aggregation Tree can adjust its structure to adapt different inputs. CPU Time (ms) 15000 SAT 10000 SBT Naive 5000 0 0.001 0.01 0.1 1 10 100 1000 l CPU Time vs. Threshold - Poisson 40000 30000 SAT 20000 SBT 10000 0 2 3 4 5 6 7 8 Burst Probability p=10-k 9 10 52 Greedy Dynamic Burst Detection Real world data keeps changing Poor if the training data differs significantly from the data to be detected 10%-250% more detection time shown in the figure below CPU time using different training sets 7000 6000 CPU time (ms) 5000 4000 static 3000 2000 1000 0 0.8 0.9 1 1.1 1.2 Training sets w ith different l 53 Ideas Basic ideas: change a structure if a change helps to reduce the overall cost Greedy when making a structure denser If the saved detailed search cost is greater than the updating/filtering cost, add this level Delayed greedy when making a structure sparser Alarms likely occur in clusters, across multiple sizes and multiple neighboring windows A lower level may support a higher level Don’t remove a level if an alarm occurred recently 54 Algorithm Sketch Start with a trained structure using the state-space algorithm If an alarm is raised at level k If adding a level in between level k and level k-1 can save some cost, add this level If can’t add due to some lower level to support this level not in the structure, add the lower level If can’t add because that the shift doesn’t satisfy the property of Shifted Aggregation Tree, legally narrow the shifts elseif the aggregate at level k doesn’t exceed the minimum threshold for level k-1 If no alarm occurred recently If legal to remove level k-1, remove it else legally double the shift 55 Results Different training sets The dynamic algorithm overcomes the discrepancy resulting from a biased training data Different sets of window sizes When the number of window sizes is small, the dynamic algorithm performs slightly less well than the static algorithm static 7000 6000 5000 4000 3000 2000 1000 0 dynamic 0.8 0.9 1 1.1 1.2 Training Sets w ith different l CPU tim e for different sets of w indow sizes Poisson 5000 CPU time (ms) CPU Times (ms) CPU tim e for different training sets vs. dynam ic algorithm - Poisson 4000 3000 2000 Static 1000 Dynamic 0 10 25 50 75 100 Num ber of w indow sizes 56 Volume Spike Detection in Stocking Trading Trading volume indicates buying/selling interest, the underlying force for price movements Volume spike: a burst of trading activity, a signal in program trading High rate: more than 100 updates per second per stock (from marketvolume.com) 57 Volume Spike Detection in Stocking Trading CPU Time vs. Threshold - IBM SBT CPU Time (ms) 80000 SAT - dynamic 40000 20000 0 2 3 4 5 6 7 8 Burst Probability p=10 9 10 -k CPU Time vs. Max Window Size of Interest - IBM 300000 CPU Time (ms) Setup Jan. 2001 - May 2004 tick-bytick trading activities of the IBM stock 23 million time points Exponential distribution Results Real time response: 0.01ms per time point on average 3-5 times speedup over Shifted Binary Tree The dynamic algorithm performs slightly better than the static algorithm. SBT 250000 SAT - static 200000 SAT - dynamic 150000 100000 50000 0 10 30 60 120 300 600 1800 Max Window Size of Interest CPU Time vs. Set of Window Sizes - IBM SBT 25000 CPU Time (ms) SAT - static 60000 SAT - static 20000 SAT - dynamic 15000 10000 5000 0 10 30 60 120 Number of Window Sizes 240 58 Click Fraud Detection in Website Traffic Monitoring CPU Time vs. Threshold - SDSS SBT SAT - static SAT - dynamic CPU Time (ms) 80000 Setup ■ ■ ■ 2003 Sloan Digital Sky Survey web traffic log, same type of data as click data Number of requests within each second 31 million time points 20000 2 The dynamic algorithm performs better than the static algorithm. 5 6 7 8 9 10 -k 500000 SBT 400000 SAT-static 300000 SAT-dynamic 200000 100000 0 Poisson distribution 2-5 times speedup over Shifted Binary Tree 4 CPU Time vs. Max Window Size of Interest - SDSS 10 30 60 120 300 600 1800 Max Window Size of Interest Results: 3 Burst Probability p=10 CPU Time (ms) ■ 40000 0 CPU Time vs. Set of Window Sizes - SDSS SBT 40000 CPU Time (ms) 60000 SAT-static 30000 SAT-dynamic 20000 10000 0 10 30 60 120 Number of Window Sizes 240 59 Fast and Accurate Time Series Matching with Time-Warping 60 Outline Problem Statement Related work review Case study: Query-by-Humming Future work 61 Goal of this work Goal Build fast and accurate similarity search algorithms for large scale time series system that allow complex time shifting in the query Two Major Challenges Query ambiguity The large size of the database 63 Related Work Review GEMINI framework Dynamic Time Warping (DTW) Introduced by C. Faloutsos, M. Ranganathan and Y. Manolopoulos to avoid linear scan comparison Introduced by D. Berndt and J. Clifford to allow timeshifting in the time series comparison Envelope and Envelope Transforms Introduced by E. Keogh to index DTW distance Generalized into GEMINI framework by our group. 64 Dynamic Time Warping DTW Distance between two time series x,y is Equal to optimal path finding Time Series 1 Time Series 2 • Each path (1,1)(m,n) is an alignment • (i,j) represents aligning x(i) with y(j) • cost(i,j) equals |x(i)-y(j)| • Optimal path has minimum total cost 65 Problem of DTW Distance DTW Distance does not obey triangle-inequality. which most standard indexing methods require. 66 Envelope and Envelope Transform Filter out bad candidates with lower computing cost and guarantee no false negative Feature Space • Envelope Filter • Transformed Envelope Filter 67 Example of Envelope Transform Piecewise Aggregate Approximation (PAA) Original time series Upper envelope Lower envelope U_new L_new 68 Case Study: Query by Humming Humming query Music Database System return Similar music segments Application Specific Challenges Related Work Proposed Framework Experiment 69 Challenges People may hum out of tune Different base key Inaccurate relative pitch Instable pitch Different tempo Varying tempo Hard to segment notes in the humming 70 Flourishing Literature String-represented note-sequence matching [A. Uitdenboderd et al Matching techniques for large music databases.ACM Multimedia 99] Data Reduction Transformations (STFT) [N. Kosugi et al A pratical query-by-humming system for a large music database. ACM Multimedia 2000] Melody slope matching [Yongwei Zhu et al Pitch tracking and melody slope matching for song retrieval. Advances in Multimedia Information Processing PCM 2001] Dynamic Time Warping (DTW) on pitch contour [Y. Zhu and D. Shasha Warping indexes with envelope transforms for query by humming. ACM SIGMOD 2003] String-editing on note sequence combined with rhythm alignment [W. Archer A. Methods for retrieving musical information based on rhythm and pitch correlation CSGSC 2003] 71 Problems with Related Work Difficult to do performance comparison No Standard for Evaluation Data set and test set are different Definition of accuracy is not reliable General conclusions nevertheless possible: Warped distance measure is more robust Need to scale up warped distance matching 72 Experiment on Scaling Up Top K Hit-rate= Test set # recognized in TopK list 41 hummed tunes from Beatles songs # recognized by human Collected from both amateurs and professionals Recognizable by more than three persons Two data sets A. 53 Beatles songs (included by B) B. 1032 songs including 123 Beatles songs, 908 American rock-andpop songs and one Chinese game music * Both system are optimized using another test set which includes 17 hummings 73 Framework Proposal rhythm (duration) & melody (note) Query criteria Humming with ‘ta’ segment note/duration melody (note) notes sequence keywords statistics based features boosting Boosted feature Keyword Database filter Database filter Nearest-N search Database Alignment verifier on DTW Top N distance match with transformed envelope filter Top N’ match 74 Important Heuristic: ‘ta’ Based Humming* Solve the problem of note segmentation in most cases Compare humming with ‘la’ and ‘ta’ * Idea from N. Kosugi et al “A pratical query-by-humming system for a large music database” ACM Multimedia 2000 75 Benefits of ‘ta’ Style Humming Decrease the size of time series by orders of magnitude. Thus reduce the computation of DTW distance 76 Important Heuristic: Statistics-Based Filters * Low dimensional statistics feature Example Lower computation cost comparing to DTW distance Quickly filter out true negatives Filter out candidates whose note length is much larger/smaller than the query’s note length More Standard Deviation of note values Zero crossing rate of note values Number of local minimum/maximum of note values Histogram of note values * Intuition from Erling Wold et al “Content-based classification, search and retrieval of audio” IEEE Multimedia 1996 http://www.musclefish.com 77 Important Heuristic: Boosting Characteristics of statistics-based filters Quick but does not guarantee no false negative Becomes weak classifier for bad parameters setting Ideal candidates for boosting Boosting * “An algorithm for constructing a ‘strong’ classifier using only a training set and a set of ‘weak’ classification algorithm” “A particular linear combination of these weak classifiers is used as the final classifier which has a much smaller probability of misclassification” * Cynthia Rudin et al “On the Dynamics of Boosting” In Advances in Neural Information Processing Systems 2004 78 Important Heuristic: Alignment Verification Nearest-N search only used melody information, which does not guarantee the rhythm will match Will A. Arentz et al. suggests combining rhythm and melody information in similarity comparison Results are generally better than using only melody information Not appropriate when the sum of several notes’ duration in the query may be related to duration of one note in the candidate Our method: 1. 2. 3. 4. Use melody information for DTW distance computing Reject candidates that have bad local note alignment Merge durations appropriately based on the note alignment Reject candidates which have bad duration alignment 79 Experiments Data Set 1,032 songs were divided and organized into 73, 051 melodic segments Computing Environment Pentium IV 2.4G 768M memory: all the data can be loaded K: an array processing language Query Criteria Return Top 15 list which has DTW distance less than 0.5 Allow 0.05 * N local time shifting for query with N notes 80 Experiment One: Human Humming Query Set 109 `ta`-style hummings The previous 41 hummings for Beatles songs 65 hummings for American rock-and-pop song 3 hummings for the Chinese game music Recognizable by at least two persons Number of notes varies from 6 to 24 81 Future Work Model and estimate the error more accurately Analyze the relationship between algorithm's performance and observed humming errors Build a standard benchmark to evaluate and compare different QbH systems Investigate more lower-bounding filters on lower levels Investigate more classifiers to boost Build intelligent system Self-improving by adjusting to a particular user’s humming patterns 82 Themes and Approaches Our approach: take simple problems and make them fast. Correlation and related primitives (e.g. matching pursuit) are simple, but we want to do them for many thousands of time series incrementally and in near linear time. Burst detection over many window sizes in near linear time. Query by humming: large scale and accurate. Coming to a store near you? 83