4838 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 11, NOVEMBER 2014 Video Tonal Stabilization via Color States Smoothing Yinting Wang, Dacheng Tao, Senior Member, IEEE, Xiang Li, Mingli Song, Senior Member, IEEE, Jiajun Bu, Member, IEEE, and Ping Tan, Member, IEEE Abstract— We address the problem of removing video color tone jitter that is common in amateur videos recorded with handheld devices. To achieve this, we introduce color state to represent the exposure and white balance state of a frame. The color state of each frame can be computed by accumulating the color transformations of neighboring frame pairs. Then, the tonal changes of the video can be represented by a time-varying trajectory in color state space. To remove the tone jitter, we smooth the original color state trajectory by solving an L1 optimization problem with PCA dimensionality reduction. In addition, we propose a novel selective strategy to remove small tone jitter while retaining extreme exposure and white balance changes to avoid serious artifacts. Quantitative evaluation and visual comparison with previous work demonstrate the effectiveness of our tonal stabilization method. This system can also be used as a preprocessing tool for other video editing methods. Index Terms— Tonal stabilization, color state, L1 optimization, selective strategy. I. I NTRODUCTION A VIDEO captured with a hand-held device, such as a cell-phone or a portable camcorder, often suffers from undesirable exposure and white balance changes between successive frames. This is caused mainly by the continuous automatic exposure and white balance control of the device in response to illumination and content changes of the scene. We use “tone jitter” to describe these undesirable exposure and white balance changes. The first row of Fig. 1 shows an example of a video with tone jitter; it can be seen that some surfaces (e.g., leaves, chairs and glass windows) in frames extracted from the video have different exposures and white balances. Manuscript received September 24, 2013; revised March 23, 2014 and July 23, 2014; accepted September 5, 2014. Date of publication September 17, 2014; date of current version September 30, 2014. This work was supported in part by the National Natural Science Foundation of China under Grant 61170142, in part by the Program of International Science and Technology Cooperation under Grant 2013DFG12840, National High Technology Research and Development Program of China (2013AA040601), and in part by the Australian Research Council under Project Grant FT-130101457 and Project DP-120103730. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Joseph P. Havlicek. (Corresponding author: Mingli Song.) Y. Wang, X. Li, M. Song, and J. Bu are with the College of Computer Science, Zhejiang University, Hangzhou 310027, China (e-mail: brooksong@ieee.org). D. Tao is with the Centre for Quantum Computation and Intelligent Systems, Faculty of Engineering and Information Technology, University of Technology, Sydney, NSW 2007, Australia (e-mail: dacheng.tao@uts.edu.au). P. Tan is with the School of Computing Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada (e-mail: pingtan@sfu.ca). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2014.2358880 It is of great importance to create a tonally stabilized video by removing tone jitter for online sharing or further processing. In this paper therefore, we address this video tonal stabilization problem to remove undesirable tone jitter in a video. Farbman and Lischinski [1] have proposed a method to stabilize the tone of a video. One or more anchors to the frames of the input video are first designated, and then an adjustment map is computed for each frame to make all frames appear to be filmed with the same exposure and white balance settings as the corresponding anchor. The adjustment map is propagated from one frame to its neighbor, based on an assumption that a large number of the pixel grid points from the two neighboring frames will sample the same scene surfaces. However, this assumption is usually erroneous, especially when the camera undergoes sudden motion or the scene has complex textures. In this case, a very small corresponding pixel set will be produced, and erratic color changes will occur in some regions of the final output. The performance of this method also depends on the anchor selection. Therefore, it is tedious for users to carefully examine the entire video and select several frames as anchors following strict rules. If we simply set one anchor to the middle frame, or two anchors to the first and last frames, the result video might suffer from over-exposure artifacts or contrast loss, especially in videos of a scene with high dynamic range. Exposure and white balance changes in an image sequence have been studied in panoramic image construction before the work of Farbman and Lischinski on tonal stabilization. To compensate for these changes, earlier approaches compute a linear model that matches the averages of each channel over the overlapping area in RGB [2] or YCbCr color spaces [3], while Zhang et al. [4] constructed a mapping function between the color histograms in the overlapping area. However, these models are not sufficiently accurate to represent tonal changes between frames and may result in unwanted compensation results. Other methods have been proposed to perform color correction using non-linear models, such as a polynomial mapping function [5] and linear correction for chrominance and gamma correction for luminance [6]. However, these models have the limitations of large accumulation errors and high computational complexities when adapted to video tonal stabilization. If the camera response function is known, the video tone can be stabilized by applying the camera response function inversely to each frame. Several attempts have been made to model the camera response function by utilizing 1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. WANG et al.: VIDEO TONAL STABILIZATION VIA COLOR STATES SMOOTHING 4839 Fig. 1. Five still frames extracted from a video with unstable tone and the results of tonal stabilization. Top: the original frames. Middle: the result of removing all tone jitter. Bottom: the result using our tonal stabilization strategy. a gamma curve [7], polynomial [8], semi-parametric [9] or PCA model [10]. However, most of these methods require perfect pixel-level image alignment, which is unrealistic in practice for amateur videos. The work proposed by Kim et al. [11], [12] jointly tracks features and estimates the radiometric response function of the camera as well as exposure differences between the frames. Grundmann et al. [13] employed the KLT feature tracker to find the pixel correspondences for alignment. After alignment, they locally computed the response curves for key frames and then interpolated these curves to generate the pixel-to-irradiance mapping. These two methods adjust all frames to have the same exposure and white balance according to the estimated response curves, without taking into account any changes in the illumination and content of the scene; this leads to artifacts of over-exposure, contrast loss or erratic color in the results. Color transfer is a topic that is highly related to this paper. It is possible to stabilize a video using a good color transformation model to make all frames have a tone similar to the selected reference frame. Typical global color transformation models are based on a Gaussian distribution [14], [15] or histogram matching [16]. An and Pellacini [17] proposed a joint model that utilizes an affine model for chrominance and a mapping curve for luminance. Chakrabarti et al. [18] extended the six-parameter color model introduced in [19] to three nonlinear models, independent exponentiation, independent polynomial and general polynomial, and proved that the general polynomial model has the smallest RMS errors. Local model-based methods [20]–[23] either segment the images and then compute a local model for each corresponding segment pair or estimate the local mapping between a small region of the source image and the target image and then propagate the adjustment to the whole image. While these global and local models are powerful for color transfer between a pair of images, stabilizing the tone of a video by using frame-to-frame color transfer is still impractical because of error accumulation. Furthermore, they cannot handle the large exposure and white balance changes contained in some video. Commercial video editing tools, such as Adobe Premiere or After Effect, can be used to remove tone jitter. However, too many user interactions are required to manually select the key frames and edit their exposure and white balance. In summary, there are two major difficulties in stabilizing the tone of a video: • How to represent the tone jitter? A robust model is required to describe the tonal change between frames. Because the video contains camera motion and the exposure and white balance setting are not constant, it is very challenging to model the exposure and white balance change accurately. • How can the tone jitter be removed selectively? A good strategy should be proposed for tonal stabilization. It should be able to remove tonal jitter caused by imperfect automatic exposure and white balance control, while preserving necessary tonal changes due to illumination and content change of the scene. Videos captured in complex light conditions may have a wide exposure and color range, and neighboring frame pairs from such videos may exhibit very sharp color or exposure changes. Removing these sharp changes will produce artifacts of over-exposure, contrast loss or erratic colors (refer to the second row of Fig. 1). A perfect tonal stabilization strategy will eliminate small tone jitter while preserving sharp exposure and white balance changes, as in the result shown in the last row of Fig. 1. To overcome these two difficulties, a novel video tonal stabilization framework is proposed in this paper. We introduce a new concept of color state, which is a parametric representation of the exposure and white balance of a frame. The tone jitter can then be represented by the change of the color states between two successive frames. To remove the tone jitter, a smoothing technique is applied to the original color states to obtain the optimal color states. We then adjust each frame to its new color state and generate the final output video. In this way, our method stabilizes the tone of the input video and increases its visual quality. Additionally, the proposed method can also serve as a pre-processing step for other video processing and 4840 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 11, NOVEMBER 2014 Fig. 2. Flowchart of our tonal stabilization method. (a) The input frames. (b) The aligned frames. (c) The correspondence masks. (d) The original color states St . (e) Mt . (f) The new color states Pt . (g) The update matrices Bt . (h) The output frames. computer vision applications, such as video segmentation [24], object tracking [25], etc. Inspired by the camera shake removal work of Grundmann et al. [26], in which the camera pose of each frame in an input video is first recovered and then smoothed to produce a stabilized result, our method further extends the framework and applies it to this new video tonal stabilization problem. Specifically, the contributions of our work are as follows: • We use color state, a parametric representation, to describe the exposure and white balance of an image. With this representation, the tone of a frame is described as a point in a high dimensional space, and then the video tonal stabilization problem can be modeled as a smoothing process of the original color states; • For the first time, we propose a selective strategy to remove undesirable tone jitter while preserving exposure and white balance changes due to sharp illumination and scene content changes. This strategy can help to avoid the artifacts of over-exposure, contrast loss or erratic color when processing videos with high dynamic ranges of tone; • To achieve tonal stabilization, we combine PCA dimensionality reduction with linear programming for color state smoothing. This not only significantly improves the stabilization results but also greatly reduces the computational cost. II. OVERVIEW In this paper, we use color state to represent the exposure and white balance of each frame in the input video. With this representation, the tonal changes between successive frames form a time-varying trajectory in the space of the color states. Undesirable tone jitter in the video can then be removed by adaptively smoothing the trajectory. Fig. 2 shows the flowchart of our method. We first conduct a spatial alignment and find the corresponding pixel pairs between successive frames. This helps us estimate the original color states, denoted as S. The path of S is then smoothed by an L1 optimization with PCA dimensionality reduction to obtain the stabilized color states P. An update matrix Bt is then estimated, and by applying it, each frame t is transferred from the original state St to the new color state Pt to generate the final output video. We propose a selective strategy to implement video tonal stabilization. Because some videos have sharp exposure and white balance changes, transferring all of the frames to have the same tone will result in serious artifacts. Our goal is to keep the color states constant in the sections of the video with little tone jitter and give the color states a smooth transition between the sections with sharp tone changes. Thus, we adopt the idea in [26] to smooth the path of color states so that it contains three types of motion corresponding to different situations of exposure and white balance changes: • Static: A static path means the final color states stay unchanged, i.e., Dt1 (P) = 0, where Dtn (.) is the n-th derivative at t. • Constant speed: A constant rate of changes allows the tone of the video to change uniformly from one color state to another, i.e., Dt2 (P) = 0. • Constant acceleration: The segments of static and constant rate are both stable; constant acceleration in color state space is needed to connect two discrete stable segments. The transition from one stable segment with a constant acceleration to another segment will make the video tone change smoothly, i.e., Dt3 (P) = 0. To obtain the optimal path composed of distinct constant, linear and parabolic segments instead of a superposition of them, we use L1 optimization to minimize the derivatives of the color states. Our main reason for choosing L1 rather than L2 optimization is that the solution induced by the L1 cost function is sparse, i.e., it will attempt to satisfy many of the above motions along the path exactly. The computed path therefore has derivatives that are exactly zero for most segments, which is very suitable for our selective strategy. On the other hand, the L2 optimization will satisfy the above motions WANG et al.: VIDEO TONAL STABILIZATION VIA COLOR STATES SMOOTHING 4841 on average (in the least-squares sense), which results in small but non-zero gradients. Qualitatively, the L2 optimized color state path always has some small non-zero motion (most likely in the direction of the original color state motion), while the L1 optimized path is composed only of segments resembling static, constant speed and constant acceleration. The rest of this paper is organized as follows. In Section III, we introduce a clear definition of the color state and show how to estimate it. A color state smoothing method is presented in Section IV to stabilize the path of color states. We show our experimental results in Section V and conclude the paper in Section VI. III. D EFINITION OF C OLOR S TATE A. Frame Color State In this paper, we use the term “color state” to represent the exposure and white balance of an image. Let St denote the color state of frame t of the video. The change to color state St from St −1 is considered to be the exposure and white balance changes between these two frames. We use the following affine transformation to model the color state change between two successive frames, ⎤ ⎡ a00 a01 a02 b0 ⎢a10 a11 a12 b1 ⎥ ⎥ (1) A=⎢ ⎣a20 a21 a22 b2 ⎦ . 0 0 0 1 An affine transformation includes a series of linear transformations, such as a translation, scaling, rotation or similarity transformation. These transformations can model the exposure and white balance changes well. An affine model has been successfully applied to user-controllable color transfer [17] and color correction [27]. In practice, although most cameras contain non-linear processing components, we find in our experiment that an affine model can approximate a non-linear transformation well and produce results with negligible errors. Given a pair of images I and J of different tones, a color j can be applied to transfer the transformation function A i pixels in I to have the same exposure and white balance as their corresponding pixels in J . Let x and x denote a pair of corresponding pixels in I and J , respectively, and Ix = [IxR , IxG , IxB ]T and Jx = [ JxR , JxG , JxB ]T represent the j (Ix ). colors of these two pixels. Then, Jx = A i ⎡ ⎤ ⎡ ⎤⎡ R ⎤ ⎡ ⎤ R J Ix a00 a01 a02 b0 ⎢ xG ⎥ ⎣ ⎢ G⎥ ⎣ ⎦ ⎦ (2) ⎣ Jx ⎦ = a10 a11 a12 ⎣ Ix ⎦ + b1 a20 a21 a22 b2 J B IxB x Note that the color transfer process in all our experiments is performed in the log domain due to the gamma correction of input videos. Let Att −1 denote the color transformation between frames t −1 and t. Then, the color state of frame t can be represented by St = Att −1 St −1 . The color transformation between S0 and St can be computed by accumulating all of the transformation matrices from frame 0 to frame t, i.e., St = Att −1 . . . A21 A10 S0 . (3) Fig. 3. Color transfer results by our affine model. (a) The source image I . (b) The target image J . (c) The aligned image. (d) The correspondence mask. The white pixels are the corresponding pixels used to estimate A, and the black ones are the expelled outliers. (e) The result without the constraint term. (f) The result with the identity constraint and using a uniform weight ωc = 2 × 103 . Thus St is a 4 × 4 affine matrix and has 12 degrees of freedom. We can further set the color state of the first frame S0 to be identity matrix. Therefore, to compute the color state of each video frame, we need only to estimate the color transformation matrices for all neighboring frame pairs. In the next subsection, we will present how to estimate the color transformation model. B. Color Transformation Model Estimation For a common scene point projected in two different frames I and J at pixel locations x and x , respectively, the colors of these two pixels Ix and Jx should be the same. Therefore, j if the corresponding pixels can be found, the matrix Ai describing the color transformation between frames I and J can be estimated by minimizing j (Ix )2 . Jx − A (4) i x To estimate the color transformation matrix, we first need to find the pixel correspondences. However, we cannot use the sparse matching of local feature points directly to estimate the transformation because local feature points are usually located at corners, where the surface color is not well defined. The positive aspect is that local feature descriptors (such as SIFT [28] and LBP [29]) are robust to tone differences. Thus, we can use the sparse matching of local feature points to align frames. To achieve that, we track the local feature points using pyramidal Lucas-Kanade [30] and then compute a homography between two successive frames to align them. Fig. 3 shows a challenging alignment case in which (c) is the aligned result from (a) to (b). After alignment, two pixels having the same coordinates in the two images can be treated as a candidate corresponding pixel pair. To estimate the color transformation more accurately, we further process the candidate correspondence set to remove outliers. Firstly, the video frames may contain noise, which will affect the model computation. To avoid that, we employ a bilateral filter [31] to smooth the video frames. Secondly, the colors of pixels on edges are not well defined and cannot be 4842 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 11, NOVEMBER 2014 used to estimate the color transformation. Therefore, we conduct edge detection and exclude pixels around edges. Thirdly, because under- and over-exposure truncation may affect estimating the model, we discard all under- and over-exposed pixels from the candidate correspondence set. Fourthly, we adopt the RANSAC algorithm to further expel outliers during the model estimation, avoiding the effect caused mainly by noise and dynamic objects (such as moving persons, cars or trees, in the scene). Fig. 3(d) shows the correspondence mask. Note that in our implementation, frames are downsampled to a small size (shorter side of 180 pixels), so that the computational cost is reduced while an accurate color transformation can still be estimated. We notice that if we estimate the color transformation by directly minimizing Equation (4), the affine model tends to over-fit the data and accumulate errors at each stage, especially for scenes that contain large regions of similar color. To avoid the over-fitting problem, we can add a regularization term to Equation (4). Because the color tone of a natural video is unlikely to change much between two successive frames, the color transformation of two successive frames should be very close to an identity matrix. Based on this observation, the estimation of color transformation can be re-formulated as ωc j (Ix )2 + A j − I4×4 2 . Jx − A (5) i i |X| x ωc Here, I4×4 is a 4 × 4 identity matrix. |X| is the weight used to combine the two terms, where |X| denotes the number of corresponding pixel pairs and ωc was set to 2 × 103 in our experiments. The identity constraint helps to choose the solution closer to the identity matrix only when getting several solutions with similar small errors by minimizing the first term of Equation (5), which reduces the over-fitting problem and improves the estimation accuracy. Taking the scene in Fig. 3 as an example, the estimated model without using the regularizer over-fits to the creamy-white pixels (table, bookshelf and wall) and causes large errors for the highlighted regions. To provide a numerical comparison, we placed a color check chart into this scene. The accuracy of color transfer is measured by an angular error [32] in RGB color space, which is the angle between the mean color of the pixels inside a color quad (c) in the transfer result their mean (ĉ) in the target image, cT ĉ and . The average angular errors of each e AN G = arccos c ĉ color quad from (e) and (f) are 5.943◦ and 1.897◦, respectively. The quad in the color of Red (Row 3, Col 3) in (e) has the highest error, 22.03◦. For comparison, the highest error of the color check chart in (f) is 4.476◦ from Light Blue (Row 3, Col 6). Note that the color check chart is not used to aid model estimation. IV. C OLOR S TATES S MOOTHING To remove tone jitter, we can transfer all video frames to have a constant color state. However, as shown in Section I, large tone changes need to be retained in videos that contain large illumination or scene content changes; otherwise, artifacts may arise after removing tone jitter, such as over-exposure, contrast loss or erratic color. In this paper, we propose a selective strategy for video tonal stabilization. Under this selective strategy, we generate a sequence of frames with smooth color states and meanwhile constrain the new color states close to the original. Here, ‘smooth’ means that the color states remain static or are in uniform or uniformly accelerated motion. We utilize an L1 optimization to find the smooth color states by minimizing the energy function, E = ωs E(P, S) + E(P). (6) E(P, S) reflects our selective strategy, which is a soft constraint and ensures that the new color states P do not deviate from the original states S, E(P, S) = |P − S|1 . (7) E(P) smoothes the frame color states. As mentioned in Section II, the first, second and third derivatives of Pt should be minimized to make the path of Pt consist of segments of static, constant speed or constant acceleration motion. E(P) = ω1 |D 1 (P)|1 + ω2 |D 2 (P)|1 + ω3 |D 3 (P)|1 , (8) where D n (P) represents the n-th derivative of the new color states P. Minimizing |D 1 (P)|1 causes the color states to tend to be static. Likewise, |D 2 (P)|1 constrains the color states to uniform motion, and |D 3 (P)|1 is relative with the acceleration of the color state motion. The weights ω1 , ω2 and ω3 balance these three derivatives. In our experiments, ω1 , ω2 and ω3 were set to 50, 1 and 100, respectively. The weight ωs combining the two terms is the key parameter in our method. It makes the new color states either have an ideal smooth path or remain very close to the original states. We conducted many experiments to analyze ωs and found that [0.01, 0.3] is the tunable range of ωs . When ωs = 0.01, the new color states remain constant. In contrast, if ωs = 0.3, the new paths of color states retain part of the initial motion. A detailed discussion of parameter setting is presented in Section V. Here we discuss the smoothness of color states in different frames. We can optimize Pt similarly to the L1 camera shake removal method [26], using forward differencing to derive |D n (P)|1 and minimizing their residuals. However, this method has limitations. From its optimization objective, the 12 elements of color state are smoothed independently, whose result is that the 12 elements do not change synchronously. Fig. 4 shows the curves of the new states generated by an L1 optimization-based method as in [26]. To relieve this problem in removing camera shake, an inclusion constraint is utilized in [26] that the four corners of the crop rectangle must reside inside the frame rectangle. However, we cannot find the corners or boundaries of color states. So optimizing the new color states by the L1 optimization directly will result in some new color states being outside the clusters of the original color states, and the corresponding output frames will have erratic colors (shown in the middle row of Fig. 5). We therefore seek to improve the smoothing method so that all 12 elements of color state can change in the same way. The path of original color states is a curve in a 12D space; if we WANG et al.: VIDEO TONAL STABILIZATION VIA COLOR STATES SMOOTHING 4843 Algorithm 1 LP for Color States Optimization Fig. 4. The curves of the 12 color state elements before and after the L1 optimization without PCA. Green curves: original color state elements. Red curves: new color state elements. From top to bottom, left to right, each curve corresponds to an element of color state. The vertical axis plots the value of the corresponding element, and the horizontal axis plots the frame index. Fig. 6. An example of the paths of original color states (Red curve) and the linear subspace (Green curve) generated by PCA. Note that the plotted dots are not the real color states but simulation values. We choose a point in the scene and use the color curve over time to simulate the color state path. Fig. 5. The comparison of stabilization results generated by the L1 optimization methods without and with PCA. Top: the input video. Middle: the result generated by the L1 optimization without PCA, which contains erratic color in some output frames. Bottom: the result using the L1 optimization with PCA. can constrain all of the new color states to be along a straight line, the above problem will be solved. We employ Principal Component Analysis (PCA) [33], [34] to find a linear subspace near the original color state path. Using PCA, the color states ci ci can be represented by St = S̄ + i βt S , where S̄ denotes c c the mean color state over t, and S i and βt i denote the eigenvector and eigenvalue of the i -th component, respectively. S̄ and S ci are 4 × 4 matrices, and βtci is a scalar. The mean color state S̄ and the eigenvector of the first component S c1 are used to build this linear subspace, and the new color states are encoded as Pt = S̄ + Mt S . c1 (9) Here, Mt denotes the new coefficient, which is a scalar. Because S c1 is the first principle component corresponding to the largest eigenvalue βtc1 , the line of the new color states will not deviate much from the original color state path, as shown in Fig. 6. In this way, we limit the degrees of freedom of the solution to a first order approximation. Then, in the smoothing method we only need to take into account minimizing the magnitude of the velocity and acceleration of the color state path in the L1 optimization and do not need consider their direction changes. Our method is different from the dimensionality-reduction-based smoothing methods that directly smooth the coefficients of the first or several major components with a low-pass filter; we will find a smoothly changing function (a smooth curve) subject to t, and then all 12 elements of the new color states will have the same motion as the curve of Mt . After PCA, our L1 optimization objective is re-derived and minimized based on Equation (9). Minimizing E(P, S): The new formulation of Pt is substituted into Equation (7), S̄ + Mt S c1 − St . (10) E(P, S) = 1 t Minimizing E(P): Forward differencing is used to compute the derivatives of P. Then |D 1 (P)| = |Pt +1 − Pt | t = |( S̄ + Mt +1 S c1 ) − ( S̄ + Mt S c1 )| t ≤ |S c1 ||Mt +1 − Mt |. t Because |S c1 | is known, we only need minimize |Mt +1 − Mt |, i.e., |D 1 (M)|. Similarly, we can prove that |D 2 (P)| and |D 3 (P)| are equivalent to |D 2 (M)| and |D 3 (M)|, respectively. Then, our goal is to find a curve Mt such that (1) it changes smoothly and (2) after mapping along this curve, the new color states are close to the original states. Algorithm 1 summarizes the entire process of our L1 optimization. To minimize Equation (6), we introduce non-negative slack variables to bound each dimension of the color state derivatives and solve a linear programming problem as described in [26]. Using Mt 4844 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 11, NOVEMBER 2014 Fig. 7. The curves of the 12 color state elements before and after our L1 optimization with PCA. Green curves: original color state elements. Red curves: new color state elements. to represent the color states, the number of unknown variables to be estimated for each frame becomes 1, and the number of introduced slack variables also declines substantially. This significantly reduces the time and space costs of the linear programming. In our implementation, we employed COIN CLP1 to minimize the energy function and generate a sequence of stable states. The curves of our optimal color states and the original color states are shown in Fig. 7. In contrast to the result from the L1 optimization without PCA, all 12 elements of color state obey the same motion. The last row of Fig. 5 shows examples of the output frames generated by our L1 optimization with PCA; the problem of unusual color has been avoided. After obtaining the new color states, an update matrix Bt is calculated to transfer each frame from the original color state to the new color state. From the definition of color state, Pt = Bt St . The update matrix Bt can then be computed as Bt = Pt St−1 . (11) It is applied to the original frame t pixel by pixel to generate the new frame. V. E XPERIMENTS AND E VALUATION Fig. 8. The curves of Mt with different ωs for the video in Fig. 1. c Red curves: Mt . Green curves: βt 1 . ω1 = 50, ω2 = 1 and ω3 = 100. (a) ωs = 0.01. (b) ωs = 0.1. (c) ωs = 0.3. (d) ωs = 1.0. Fig. 9. The curves of Mt with different ω1 , ω2 and ω3 for the video in c Fig. 1. Red curves: Mt . Green curves: βt 1 . In this experiment, ωs = 0.3. (a) ω1 = 100, ω2 = ω3 = 1. (b) ω2 = 100, ω1 = ω3 = 1. (b) ω3 = 100, ω1 = ω2 = 1. (d) ω1 = 50, ω2 = 1, ω3 = 100. The weight that balances the constraint and smoothness terms is the most important parameter in our system. As this weight is varied, the system generates different results. If ωs is small (such as 0.01), Mt tends to be a straight line, which maps all of the frames to the same color state. This keeps the exposure and white balance unchanged always. For videos whose exposure and color ranges are too wide, a straight line mapping will cause some frames to have artifacts, such as over-exposure, contrast loss or erratic color. If ωs is large (such as 1.0), Pt will be very close to St . Then, most color state changes will be retained. The weight ωs balances these two aspects, and it is difficult to find a general value suitable for all types of videos. We leave this parameter to be tuned by users. We suggest that users tune ωs within three levels, 0.01, 0.1 and 0.3, which were widely used in our experiments. Variant weights were tried to stabilize the video in Fig. 1; the curves of Mt are shown in Fig. 8. The comparison of the output videos with different parameters is shown in our supplementary material.2 Three other parameters affecting color state trajectory are ω1 , ω2 and ω3 . We explored different weights for stabilizing the video in Fig. 1 and plotted the curves of Mt in Fig. 9. If we set one of the three parameters to a large value but depress the other two, it is apparent that the new color state path tends to be (a) constant non-continuous, (b) linear with sudden changes or (c) a smooth parabola, but it is rarely static. A more agreeable viewing experience is produced by setting ω1 and ω3 larger than ω2 because we hope that the optimal path can sustain longer static segments and be absolutely smooth. For this paper, we set ω1 = 50, ω2 = 1 and ω3 = 100; the corresponding curve of Mt is shown in Fig. 9(d). In practical situations, users may prefer the exposure and white balance of some frames and hope to keep the tone of these frames unchanged. Our system can provide this function 1 COIN CLP is an Open Source linear programming solver that can be freely downloaded from http://www.coin-or.org/download/source/Clp/. 2 The supplementary material can be found in our project page, http://eagle.zju.edu.cn/~wangyinting/TonalStabilization/. A. Parameter Setting WANG et al.: VIDEO TONAL STABILIZATION VIA COLOR STATES SMOOTHING Fig. 10. The stabilization results generated by keeping the exposure and white balance of the ‘preferred frame’ unchanged. Top: the original video. Middle: the stabilization result with the first frames fixed. Bottom: the stabilization result with the last frame fixed. ωs,t is set to 100 for the preferred frame and 0.01 for the others. by using a non-uniform weight ωs,t instead of ωs . We ask the users to point out one or several frames as ‘preferred frames’ and set higher ωs,t for these frames. The weights ωs,t for the other frames are chosen by our selective strategy. Then, the new optimal color states for the ‘preferred frames’ will be very close to the original ones. Fig. 10 gives an example in which the second row is the result generated by setting the first frame as the ‘preferred frame’, and the third row is the result with the last frame fixed. In this example, we set ωs,t equal to 100 for the ‘preferred frame’ and 0.01 for the others. In a similar way, the weight ωc in Equation (5) can be changed to a spatial-variant one, ωc,x , and then the estimated affine model would be more accurate for the pixels with larger weights. We can extract the salient regions [35], [36] of each frame or ask the users to mark their ‘preferred regions’ and track these regions in each frame. In this way, our system will generate more satisfying results for the regions to which users pay more attention. A numeral comparison experiment is described in the next subsection. B. Quantitative Evaluation To illustrate the effectiveness of our tonal stabilization method, we employed a low reflectance grey card to do quantitative evaluation as [1]. We placed an 18% reflectance grey card into a simple indoor scene and recorded a video with obvious tone jitter with an iPhone 4S; five example frames from the video are shown in the first row of Fig. 11. This video was then processed by our tonal stabilization method. 4845 Note that the grey card was not used to aid and improve the stabilization. We compared our results with Farbman and Lischinski’s pixel-wise color alignment method (PWCA) [1]. For PWCA, we set the first frame as the anchor. To reach a similar processing result, we set ωs,t to 100 for the first frame and 0.01 for the other frames, so that the exposure and white balance of the first frame was fixed and propagated to the others. Both uniform weight ωc and non-uniform weight ωc,x were tried in this experiment. For uniform weight, we set ωc = 2 × 103 . For non-uniform weight, we set ωc,x to 104 for all of the corresponding pixel pairs inside the grey card and 2 × 103 for the others. We measured the angular error [32] in RGB color space between the mean color of the pixels inside the grey card of each frame and the first frame. The plot in Fig. 11(a) shows the angular errors from the first frame of the original video (Red curve), the video stabilized by PWCA (Blue curve), the results generated by our method with uniform weight (Green curve) and non-uniform weight (Dark Green curve). The second column of Table I is the average errors over each frame of the original video and the results generated by PWCA and our method. Both PWCA and our method performed well, and our method with non-uniform weight came out slightly ahead. To assess the benefits of tonal stabilization for white balance estimation, we conducted a similar experiment to that presented in [1]; we applied a Grey-Edge family of algorithms [37] to a video and its stabilization results and compared the performance of white balance estimation. The two white balance estimation methods chosen assume that some properties of the scene, such as average reflectance (GreyWorld) or average derivative (Grey-Edge), are achromatic. We computed the angular error [32] of the mean color inside the grey card of all frames to the ground truth; the plots of the estimation error are shown in Figs. 11(b) and (c). The third and fourth columns of Table I are the average angular errors of each frame after white balance estimation by Grey-World and Grey-Edge, respectively. The grey card restricts that the camera motion during video shooting should not be large and the camera should not be very far from the scene; thus, the video used for evaluation will be relatively simple. On the other hand, PWCA performs extremely well for large homogeneous regions (grey card). These two factors are the reasons why our method only led PWCA a little in the quantitative evaluations. C. Comparison From the discussion above, we find that both PWCA and our method can generate good results for simple videos. However, PWCA sometimes does not work well for videos that include scenes with complex textures or which have sudden camera shaking. For these cases, it produces a very small robust set and results in a final output that is not absolutely stable. Our method aligns the successive frames and detects the corresponding pixels in a more robust way. Fig. 12 compares the result of PWCA with an anchor set to the central frame and our result with ωs = 0.01. We can see that our result is more stable. Another advantage of our method is the selective 4846 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 11, NOVEMBER 2014 Fig. 11. Quantitative evaluation of tonal stabilization and consequent white balance correction. The first row is several still frames from the input video. The second row compares the numeral errors of the original video and the results generated by PWCA and our method. (a) The angular errors in RGB color space of average color within the grey card of each frame with respect to the first frame. (b) and (c) The angular errors from the ground truth after white balance estimation by grey-world and grey-edge. Red curves: original video. Blue curves: the result of PWCA. Green curves: the result of our method with uniform ωc = 2 × 103 . Dark Green curves: the result of our method with non-uniform ωc,x , 104 for the pixels within the grey card and 2 × 103 for the other pixels. TABLE I M EAN A NGULAR E RRORS OF E ACH F RAME . T HE S ECOND C OLUMN I S THE E RRORS W ITH R ESPECT TO THE F IRST F RAME OF THE O RIGINAL V IDEO AND I TS S TABILIZATION R ESULTS . T HE T HIRD AND F OURTH C OLUMNS A RE THE E RRORS F ROM THE G ROUND T RUTH A FTER W HITE BALANCE E STIMATION BY G REY-W ORLD AND G REY-E DGE , R ESPECTIVELY Fig. 12. Comparison of the stabilization results generated by PWCA and our method. Top: the original video. Middle: the result of PWCA, which still contains tone jitter. Bottom: the result of our method. stabilization strategy. It allows us to adaptively smooth the path of color states according to the original color states. Users need only tune the parameter ωs to generate results with different degrees of stability and choose the best of them. Even for a video with a large dynamic range, our method is still very convenient to use. Benefitting from the selective strategy, the stabilized result will remove the small tone jitter and retain the sharp tone changes. In contrast, to generate comparable result, PWCA requires that the user choose the anchors carefully. If one or two anchors are set automatically, the result may include artifacts. The second row of Fig. 13 is the result of PWCA with two anchors set to the first and last frames. It is clear that some frames of the output video are over-exposed. Fig. 13. Another comparison of PWCA and our method. The right two frames of the result generated by PWCA are over-exposed. D. Challenging Cases A video containing noise or dynamic objects is very challenging for affine model estimation because of the outliers from the noise and dynamic objects. Our robust correspondence selection helps to handle these challenging cases by discarding most of the outliers. Figs. 14 and 15 show two examples of these challenging cases and their stabilization results from our method with a uniform ωc (2 × 103 ) and a small ωs (0.01). Because noise and dynamic objects do not affect PWCA, we choose its stabilization results as borderline; they are shown in the second row of each figure. These two examples demonstrate that our method can perform well for videos containing noise and dynamic objects, and our results are even a little better than those generated WANG et al.: VIDEO TONAL STABILIZATION VIA COLOR STATES SMOOTHING 4847 Fig. 16. Failure case. These four frames are extracted from our stabilization result. The stone appears in two different sections of the video, and the colors in our result are not coherent. we plan to accelerate the performance of our method by using GPUs. F. Discussion Fig. 14. A video containing dynamic object and its stabilization results. Top: the original frames from the video. Middle: the result of PWCA. Some tone jitter can be found in the 3rd and 4th columns. Bottom: the result of our method. Fig. 15. A noisy video and its stabilization results with PWCA and our method. The exposure and color of the 2nd column is a little different from the other columns in the PWCA result. by PWCA for exposure and color consistency (refer to the third and fourth columns of Fig. 14 and the second column of Fig. 15). E. Computational Overhead The running time of our system depends on the length of the input video. When we compute the transformation matrix for two neighboring frames, the images are resized to a fixed small size, so the size of frame does not significantly affect the running time. For the 540-frame video (1920 × 1080 in size) shown in Fig. 1, it took about 511 seconds to complete the entire stabilization process. Computing the color transformation matrix for each pair of successive frames took approximately 0.88 seconds, and the system spent 7.28 seconds on optimizing the color states. Our system is implemented in C++ without code optimization, running on a Dell Vostro 420 PC with an Intel Core2 Quad 2.40GHz CPU and 8GB of memory. The running time of our method can be shortened by parallelization. Approximately 90% of running time is spent on computing the color transformation matrix between neighboring frames, which is easily parallelizable because the frame registration and affine model estimation for each pair of neighboring frames can be carried out independently. On the other hand, when processing a long video, we can first cut the video into several short sub-videos and set the frame that contacts two sub-videos as a ‘preferred frame’; these subvideos can then be stabilized synchronously. In future work Our method depends on alignment of consecutive frames, so feature tracking failures will affect our stabilization results. There are several situations that may cause feature tracking failures. a) The neighboring frames are very homogeneous (e.g., wall, sky) and have too few matching feature points. In this situation, we assume these two source frames are aligned. Because the frames are homogeneous, this will not result in large errors from misalignment; our correspondence selection algorithm helps to discard the outliers from noise, edges, etc. Therefore, our method also performs well on this situation. b) Very sharp brightness changes between consecutive frames may cause feature tracking to fail. We can adopt an iteration-based method to solve this problem. The two neighboring frames are denoted as I and J . We first assume j the two images are aligned, i.e., Hi = I3×3 . We estimate the j color transfer model Ai that is applied to frame I to make the exposure and white balance of I and J closer. Then, a new j homography matrix Hi is estimated to align the modified j j frame I and J . We repeat to compute Ai and Hi ; usually two or three iterations are sufficient to generate a good result. Most of the neighboring frames in natural videos will not have brightness changes sharp enough to affect feature tracking. We tested approximately 150 videos and never encountered this problem. c) Most local features are located at non-rigid moving objects. This is the most challenging case, and a vast majority of feature-tracking-based vision algorithms cannot handle it, such as [38]. Because our method needs to track a feature for only two adjacent frames, if the dynamic objects do not move too quickly, serious artifacts will not result. Otherwise, our method will fail. In addition to feature tracking, our method has another limitation. If the camera is moved from one scene to a new scene and then returned to the former, the color of the same surface in different sections of a video stabilized by our method may be not coherent. An example is shown in Fig. 16, in which a large stone appears, is passed by, and then reappears. We can see from the figure that the color of the stone has changed slightly in our output frames. This is caused by the error arising during color state computation. When we estimate the color transformation model for two neighboring images, if a surface with a particular color exists only in one frame, then the computed model may be unsuitable for the region of this surface. Because the color state is the accumulation of color transformation matrices, the error of color transformation for two frames will be propagated to all later images. Another possible reason for this artifact is that our trajectory smoothing method cannot ensure that 4848 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 11, NOVEMBER 2014 two similar original color states remain similar after stabilization. We leave these unsolved problems for future work. VI. C ONCLUSION In this paper, we utilize a 4 × 4 matrix to model the exposure and white balance state of a video frame, which we refer to as the color state. PCA dimensionality reduction is then applied to find a linear subspace of color states, and an L1 optimization-based method is proposed to generate smooth color states in the linear subspace and achieve the goal of video tonal stabilization. Our experimental results and quantitative evaluation show the effectiveness of our stabilization method. Compared to previous work, our method performs better in looking for pixel correspondences in two neighboring frames. In addition, we use a new stabilization strategy that retains some tone changes due to sharp illumination and scene content changes, so our method can handle videos with an extreme dynamic range of exposure and color. Our system can remove tone jitter effectively and thus increase the visual quality of amateur videos. It also can be used as a pre-processing tool for other video editing methods. R EFERENCES [1] Z. Farbman and D. Lischinski, “Tonal stabilization of video,” ACM Trans. Graph., vol. 30, no. 4, pp. 1–89, 2011. [2] G. Y. Tian, D. Gledhill, D. Taylor, and D. Clarke, “Colour correction for panoramic imaging,” in Proc. 6th Int. Conf. Inf. Visualisat., 2002, pp. 483–488. [3] S. J. Ha, H. I. Koo, S. H. Lee, N. I. Cho, and S. K. Kim, “Panorama mosaic optimization for mobile camera systems,” IEEE Trans. Consum. Electron., vol. 53, no. 4, pp. 1217–1225, Nov. 2007. [4] Z. Maojun, X. Jingni, L. Yunhao, and W. Defeng, “Color histogram correction for panoramic images,” in Proc. 7th Int. Conf. Virtual Syst. Multimedia, Oct. 2001, pp. 328–331. [5] B. Pham and G. Pringle, “Color correction for an image sequence,” IEEE Comput. Graph. Appl., vol. 15, no. 3, pp. 38–42, May 1995. [6] Y. Xiong and K. Pulli, “Color matching of image sequences with combined gamma and linear corrections,” in Proc. Int. Conf. ACM Multimedia, 2010, pp. 261–270. [7] S. Mann and R. W. Picard, “On being ‘undigital’ with digital cameras: Extending dynamic range by combining differently exposed pictures,” in Proc. IS&T, 1995, pp. 442–448. [8] T. Mitsunaga and S. K. Nayar, “Radiometric self calibration,” in Proc. Comput. Vis. Pattern Recognit., vol. 1. Jun. 1999, pp. 374–380. [9] F. M. Candocia and D. A. Mandarino, “A semiparametric model for accurate camera response function modeling and exposure estimation from comparametric data,” IEEE Trans. Image Process., vol. 14, no. 8, pp. 1138–1150, Aug. 2005. [10] M. D. Grossberg and S. K. Nayar, “Modeling the space of camera response functions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 10, pp. 1272–1282, Oct. 2004. [11] S. J. Kim, J.-M. Frahm, and M. Pollefeys, “Joint feature tracking and radiometric calibration from auto-exposure video,” in Proc. IEEE 11th Int. Conf. Comput. Vis., Oct. 2007, pp. 1–8. [12] S. J. Kim and M. Pollefeys, “Robust radiometric calibration and vignetting correction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 4, pp. 562–576, Apr. 2008. [13] M. Grundmann, C. McClanahan, S. B. Kang, and I. Essa, “Postprocessing approach for radiometric self-calibration of video,” in Proc. IEEE Int. Conf. Comput. Photography, Apr. 2013, pp. 1–9. [14] E. Reinhard, M. Adhikhmin, B. Gooch, and P. Shirley, “Color transfer between images,” IEEE Comput. Graph. Appl., vol. 21, no. 5, pp. 34–41, Sep./Oct. 2001. [15] F. Pitié, A. C. Kokaram, and R. Dahyot, “Automated colour grading using colour distribution transfer,” Comput. Vis. Image Und., vol. 107, nos. 1–2, pp. 123–137, Jul. 2007. [16] T. Oskam, A. Hornung, R. W. Sumner, and M. Gross, “Fast and stable color balancing for images and augmented reality,” in Proc. 2nd Int. Conf. 3D Imag., Modeling, Process., Visualizat. Transmiss., Oct. 2012, pp. 49–56. [17] X. An and F. Pellacini, “User-controllable color transfer,” Comput. Graph. Forum, vol. 29, no. 2, pp. 263–271, May 2010. [18] A. Chakrabarti, D. Scharstein, and T. Zickler, “An empirical camera model for internet color vision,” in Proc. Brit. Mach. Vis. Conf., vol. 1. 2009, pp. 1–4. [19] G. Finlayson and R. Xu, “Illuminant and gamma comprehensive normalisation in log RGB space,” Pattern Recognit. Lett., vol. 24, no. 11, pp. 1679–1690, Jul. 2003. [20] S. Kagarlitsky, Y. Moses, and Y. Hel-Or, “Piecewise-consistent color mappings of images acquired under various conditions,” in Proc. IEEE 12th Int. Conf. Comput. Vis., Sep./Oct. 2009, pp. 2311–2318. [21] Y.-W. Tai, J. Jia, and C.-K. Tang, “Local color transfer via probabilistic segmentation by expectation-maximization,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 1. Jun. 2005, pp. 747–754. [22] D. Lischinski, Z. Farbman, M. Uyttendaele, and R. Szeliski, “Interactive local adjustment of tonal values,” ACM Trans. Graph., vol. 25, no. 3, pp. 646–653, Jul. 2006. [23] Y. Li, E. Adelson, and A. Agarwala, “ScribbleBoost: Adding classification to edge-aware interpolation of local image and video adjustments,” Comput. Graph. Forum, vol. 27, no. 4, pp. 1255–1264, 2008. [24] Q. Zhu, Z. Song, Y. Xie, and L. Wang, “A novel recursive Bayesian learning-based method for the efficient and accurate segmentation of video with dynamic background,” IEEE Trans. Image Process., vol. 21, no. 9, pp. 3865–3876, Sep. 2012. [25] S. Das, A. Kale, and N. Vaswani, “Particle filter with a mode tracker for visual tracking across illumination changes,” IEEE Trans. Image Process., vol. 21, no. 4, pp. 2340–2346, Apr. 2012. [26] M. Grundmann, V. Kwatra, and I. Essa, “Auto-directed video stabilization with robust L1 optimal camera paths,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2011, pp. 225–232. [27] H. Siddiqui and C. A. Bouman, “Hierarchical color correction for camera cell phone images,” IEEE Trans. Image Process., vol. 17, no. 11, pp. 2138–2155, Nov. 2008. [28] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. [29] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp. 971–987, Jul. 2002. [30] J. Shi and C. Tomasi, “Good features to track,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Jun. 1994, pp. 593–600. [31] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” in Proc. 6th Int. Conf. Comput. Vis., Jan. 1998, pp. 839–846. [32] S. D. Hordley, “Scene illuminant estimation: Past, present, and future,” Color Res. Appl., vol. 31, no. 4, pp. 303–314, Aug. 2006. [33] I. T. Jolliffe, Principal Component Analysis. New York, NY, USA: Springer-Verlag, 1986, p. 487. [34] B.-K. Bao, G. Liu, C. Xu, and S. Yan, “Inductive robust principal component analysis,” IEEE Trans. Image Process., vol. 21, no. 8, pp. 3794–3800, Aug. 2012. [35] M.-M. Cheng, G.-X. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu, “Global contrast based salient region detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2011, pp. 409–416. [36] C. Jung and C. Kim, “A unified spectral-domain approach for saliency detection and its application to automatic object segmentation,” IEEE Trans. Image Process., vol. 21, no. 3, pp. 1272–1283, Mar. 2012. [37] J. van de Weijer, T. Gevers, and A. Gijsenij, “Edge-based color constancy,” IEEE Trans. Image Process., vol. 16, no. 9, pp. 2207–2214, Sep. 2007. [38] F. Liu, M. Gleicher, J. Wang, H. Jin, and A. Agarwala, “Subspace video stabilization,” ACM Trans. Graph., vol. 30, no. 1, pp. 1–4, 2011. Yinting Wang received the B.E. degree in software engineering from Zhejiang University, Hangzhou, China, in 2008, where he is currently pursuing the Ph.D. degree in computer science. His research interests include computer vision and image/video enhancement. WANG et al.: VIDEO TONAL STABILIZATION VIA COLOR STATES SMOOTHING Dacheng Tao (M’07–SM’12) is currently a Professor of Computer Science with the Centre for Quantum Computation and Intelligent Systems, and the Faculty of Engineering and Information Technology, University of Technology, Sydney, NSW, Australia. He mainly applies statistics and mathematics for data analysis problems in data mining, computer vision, machine learning, multimedia, and video surveillance. He has authored over 100 scientific articles at top venues, including the IEEE IEEE T RANSACTIONS ON PATTERN A NALYSIS AND M ACHINE I NTELLIGENCE, the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS , the IEEE T RANSACTIONS ON I MAGE P ROCESSING , the IEEE Conference on Neural Information Processing Systems, the International Conference on Machine Learning, the International Conference on Artificial Intelligence and Statistics, the IEEE International Conference on Data Mining series (ICDM), the IEEE Conference on Computer Vision and Pattern Recognition, the International Conference on Computer Vision, the European Conference on Computer Vision, the ACM Transactions on Knowledge Discovery from Data, the ACM Multimedia Conference, and the ACM Conference on Knowledge Discovery and Data Mining, with the Best Theory/Algorithm Paper Runner Up Award in IEEE ICDM’07 and the Best Student Paper Award in IEEE ICDM’13. Xiang Li received the B.E. degree in computer Science from Zhejiang University, Hangzhou, China, in 2013. He is currently pursuing the M.S. degree in information technology with Carnegie Mellon University, Pittsburgh, PA, USA. His research interests include machine learning and computer vision. 4849 Mingli Song (M’06–SM’13) is currently an Associate Professor with the Microsoft Visual Perception Laboratory, Zhejiang University, Hangzhou, China. He received the Ph.D. degree in computer science from Zhejiang University in 2006. He was a recipient of the Microsoft Research Fellowship in 2004. His research interests include face modeling and facial expression analysis. Jiajun Bu is currently a Professor with the College of Computer Science, Zhejiang University, Hangzhou, China. His research interests include computer vision, computer graphics, and embedded technology. Ping Tan is currently an Assistant Professor with the School of Computing Science, Simon Fraser University, Burnaby, BC, Canada. He was an Associate Professor with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore. He received the Ph.D. degree in computer science and engineering from the Hong Kong University of Science and Technology, Hong Kong, in 2007, and the bachelor’s and master’s degrees from Shanghai Jiao Tong University, Shanghai, China, in 2000 and 2003, respectively. He has served as an Editorial Board Member of the International Journal of Computer Vision and the Machine Vision and Applications. He has served on the Program Committees of SIGGRAPH and SIGGRAPH Asia. He was a recipient of the inaugural MIT TR35@Singapore Award in 2012, and the Image and Vision Computing Outstanding Young Researcher Award and the Honorable Mention Award in 2012.