Learn to Compress and Restore Sequential Data∗ Yi Wang and Jianhua Feng Shixia Liu Department of Computer Science, Tsinghua University, Beijing, 100084, China. Information Visualization and Interactive Visual Analytics, IBM China Research Lab, Beijing, 100094, China. Introduction able to accurately model highly varied sequential data. Moreover, as a hidden Markovian model, VLHMM is generally applicable to all kinds of sequences, whatever discrete/continuous and univariate/multivariate. Data compression methods can be classified into two groups: lossless and lossy. Usually the latter achieves a higher compression ratio than the former. However, to develop a lossy compression method, we have to know, for a given type of data, what information can be discarded without significant degradation of the data quality. A usual way to obtain such knowledge is by experiments. For example, from user statistics, we know that human eyes are insensitive to some frequency channels of the light signal. Thus we can compress image data by decomposing them into various frequency channels using a DCT transformation, and neglect the coefficients of the channels that are insensitive to human eyes. However, it is complex and expensive for human analysts to conduct and study so many experiments. Alternatively, we propose to learn the knowledge automatically by using machine learning techniques. Under the framework of Bayesian learning, general prior knowledge is expressed by designing the statistical models, and the refined posterior knowledge can be learned automatically from data to be compressed. More particularly, we consider the compression of some input data as learning a statistical model from the data, and consider the restoration of data as sampling from the learned model. Therefore, only the estimated model parameters are saved as the compressed version. A key to this idea is to design a statistical model that can accurately describe the data (so it is possible to recover the data precisely) and is defined by a compact set of parameters (so to achieve high compression ratio). For a general application of compressing sequential data, we designed the Variable-length Hidden Markov Model (VLHMM), whose learning algorithm automatically learns a minimal set of parameters (by optimizing a MinimumEntropy criterion) that accurately models the sequential data (by optimizing a Maximum-Likelihood criterion). The selfadaption ability of the learning algorithm makes VLHMM The VLHMM Approach The VLHMM combines the advantages of the hidden Markov model (HMM), the high-order HMM (n-th order HMM or n-HMM), and the variable-length Markov model (VLMM) (Rissanen 1983). Like HMM and n-HMM, VLHMM models hidden Markovian process switching over a specified number, say S, of states. We call the several previous states used to determine the current state a context. For each state transition, the model moves from the current context to a new state, which, together with the current context, forms the new context. For HMM the new context is truncated to have the fixed length 1; for n-HMM all contexts have the fixed length of n; whereas for VLHMM the contexts have variable lengths that are learned from data. Thus the model can be represented by a directed graph, whose nodes correspond to the contexts and whose edges represent the context transitions. Using the graph representation, simulating a VLHMM, like simulating a HMM or an n-HMM, can be considered a random walking over the graph. Because on each context an output pdf is defined, and after each context transition an observable is sampled from the output pdf of the destination context, simulating the model generates a sequence of observables. Because the output pdfs can have rather flexible form, the observables can be whatever discrete/continuous scalar/vector. This makes VLHMM applicable to model various kinds of sequential data. Using VLHMM, the idea of “learning to compress” becomes to learn a VLHMM, and use the Viterbi algorithm to align the training sequence to a path of the learned context graph, where the path is the one from which the training sequence is most likely generated. Thus, only the transition probabilities and the output pdfs along the path are needed to be saved as the compressed data. We restore the training sequence data by simulating the saved path. Although HMM and n-HMM can also be learned and simulated, they are not suitable for compression. The HMM, because of its well known restriction of single-state contexts, cannot accurately model sequential data that is highly ∗ This work was finished during YW’s visit to IBM China Research Lab. YW hope to thank the support from NFS of China (No. 60573094), 973 Program of China (No. 2006CB303103), and City University of Hong Kong (CityU. 118205 and 1211/04E) for critical preliminary research. c 2007, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 1912 varied over time. Thus simulating the model may not be able to restore the training sequence correctly. The n-HMM using long contexts is accurate, but since there is a total of S n potential contexts, the number of model parameters would increase exponentially, learning which would require intractable amount of training data and learning time. Whereas, VLHMM is an accurate high-order model and it learns a minimum set of contexts to achieve high compression ratio. VLHMM makes use of the fact that it is not necessary to extend all the contexts to a fixed large length n. Instead, it learns longer contexts to model parts of the training sequence that are highly varied over time, and learns shorter ones for parts with plain dynamics. The effectiveness of VLHMM is ensured by that the dynamical complexity of most sequential data changes over time. For example, the dynamical complexity of the digital voice sequences varies according to the change of the topic, the mood, and etc. VLMM also learns a minimum set of contexts with variable lengths. But as an “observable” model without the output pdfs, it can only model sequences of univariate and discrete values. Although continuous/multivariate sequences can be discretized before learning a VLMM, the discretization without considering the temporal coherence of the sequences usually introduces significant error that is hard to control (Wang & et al 2006). VLHMM is a hidden Markov model, and is generally applicable to all kinds of sequential data. Similar with VLMM, its contexts have variable lengths, and are the shortest but long enough to accurately determine the next state. Shortening the contexts would exponentially reduce the total number of contexts and model parameters, and thus ensures optimal compression ratio. Whereas the enough lengths ensure the accuracy of modeling and thus precise data restoration. However the variable lengths also make the total number of contexts of VLHMM unknown prior to learning, even with the number of states, S given. This is different from HMM and n-HMM. In Bayesian learning, it is referred by unknown model structure and is well known difficult. Our solution is to derived a structural-EM algorithm, which is proved to converge to a Maximum-Likelihood estimation (Friedman 1997). Compatible with the proof, our algorithm invokes a context-tree growing/pruning procedure in its M-step to update the estimate of contexts by optimizing a Minimum-Entropy criterion. Figure 1: Trajectories of compressed ballet motion. The red trajectories of two hands show a short segment of MoCap ballet data. Its two compressed-and-restored versions are in green and blue respectively. The green one is learned with S set to 60 and is compressed to 11.77% to the original motion. Its restored trajectories are very close to those of the original motion. The blue one is compressed to 2.14% with S = 7. At this extreme compress ratio, the restored ending frame has obvious difference from that of the original version. However the restored trajectories still preserve the general structure of the original version. is based on indexing the MoCap data by exploiting structural information derived from the skeletal virtual human model. Both papers claim high compression ratio with low visual quality degradation. We compare the VLHMM method with both of these methods, as well as other potential methods that are mentioned and tested in these two papers. Experiments show that the VLHMM method achieves comparable compression ratio with comparatively little degradation (measured by the metrics proposed in (Arikan 2006)). In addition, a noticeable advantages of VLHMM over these methods is that VLHMM is not restricted for only the MoCap data. Experiment details can be found online at http: //dbgroup.cs.tsinghua.edu.cn/wangyi/VLHMM, and more experiments on other types of sequential data, like digital movie and audio, are on going. References Arikan, O. 2006. Compression of motion capture databases. In Proc. ACM SIGGRAPH. Chattopadhyay, S., and et al. 2007. Human motion capture data compression by model-based indexing: A power aware approach. IEEE Trans. on Visualization and Computer Graphics 13(1):5– 14. Friedman, N. 1997. Learning Bayesian networks in the presence of missing values and hidden variables. In Proc. Uncertainty in AI. Rissanen, J. 1983. A universal data compression system. IEEE Trans. on Information Theory 29:656–664. Sigal, L., and Black, M. J. 2006. Humaneva: Synchronized video and motion capture dataset for evaluation of articulated human motion. Technical report, Department of Computer Science, Brown University. Wang, Y., and et al. 2006. Mining complex time-series data by learning Markovian models. In Proc. IEEE ICDM. Experiments and Progress Although the VLHMM is a general compressor for various types of sequential data, our primary experiments are conducted for the motion capture (MoCap) data. Because MoCap data record 3D body movements of human performers, it is typical as high-dimensional and with complex dynamics (Sigal & Black 2006). With the recent development of the MoCap technology, researchers started to consider the compression of MoCap data (Arikan 2006) (Chattopadhyay & et al 2007). As most lossy compression methods introduced in Section ”Introduction”, these two are based on carefully collected expertise: (Arikan 2006) approximates short clips of motion using Bezier curves and clustered principal component analysis, and (Chattopadhyay & et al 2007) 1913