Assessing Gesture Subtlety Using Accelerometer and Gyroscope Introduction We consider a gesture as being subtle if it's not seen as being out of the norm by other people. So it should not be disruptive to the flow of any social event. In the context of wearable computing, subtle hand gestures play a very important role. They could be used to easily perform tasks such as looking at past notes, checking to see if you've got a new message, or even look up the meaning of a word which you just heard somebody say. More broadly, a well configured subtle gesture would be useful for controlling a variety of computing devices. In the related works we describe some research being done to design and identify such gestures and their application areas. It is important that such gestures must also be socially acceptable and comfortable to use. Loud and obtrusive gestures have the obvious flaw of being socially inappropriate and uncomfortable for the user. In addition, the user may be actively interacting with people around her and may find speech commands inefficient and disturbing. This problem is addressed in the literature from an HCI point of view where it is found that small, quiet gestures are most acceptable and comfortable. We approached the problem from an AI perspective for our first project. In this paper, we revisit the problem of designing a system which would be able to assign a subtlety score to any given hand gesture. As user studies are cumbersome and expensive, being able to assign such scores can make the task of quickly identifying gestures for different social situations convenient. Related Works Kern et. al [1] discuss the utility associated with being able to selectively retrieve segments of meetings recorded by a wearable computer. They deal with the problem by automatically annotating them according to the user's state as well as the situation. We believe that their technique could be augmented by annotating using subtle gestures, as this affords a user the choice to do it according to her preferences. In video conferences they could be used without causing visual distractions. For Julie and Stephen [3], even though their system is designed to accommodate a variety of sensors, they chose a wrist-accelerometer due to its wide availability and low cost. The work done on identifying subtlety of gestures is noticeably less and we believe this is where the literature is deficient. There is work to support that small, unobtrusive gestures are more socially acceptable and comfortable while controlling portable computing devices [2]. But training a computer to score gestures for subtlety remains an open problem. We continue from our previous project where we recorded accelerometer data using the app funfinabox[4] with an android phone. Previously, we had 15 gestures in our set and 20 participants for the survey. With Dynamic Time Warping (DTW), we applied one-Nearest Neighbor(1-NN), without it our approach for gesture analysis included k-NN with k from 1 to 10. Our feature vector consisted of the magnitude of the acceleration vector. We think our results can be improved with a larger training set, regular sampling, a multidimensional feature vector and using time independent features for classification. Approach Data Collection: Ground truth subtlety scores were assigned to our gesture set by conducting a web survey with 32 participants in all. Participants were shown 10, 5s videos of the two authors and after each viewing asked this question: "Did anyone make a gesture that he could be using to control a computer in the video above?" The ‘detectability score’ was then defined as the ratio of the number of users that identified the gesture and the number of users the were shown a video with the gesture performed. The gestures used and the corresponding score is provided in Table 1. Gesture Triangle Window Open Window Close Fire on Fire Off Throw Money Slap Phone W Swing Phone Multi Finger Snap Flick Air Door Close Door Open Detectability Score 0.778 0.625 0.5 0.636 0.667 0.818 0.9 0.556 0.909 0.625 0.636 0.538 0.889 Table 1 – Gestures and user scores The accelerometer data was collected by attaching a phidget to the wrist of the performer (both the authors), who then performed each gesture 10 times. We had learned that constant rate sampling is an important factor in reducing noise and the phidget performed that well. We collected 125 samples/s. To mark the beginning of a gesture we used a window mean based approach. We started storing the gesture when Here anorm , i is the acceleration norm at window i, σ is the standard deviation for the mean μ and t is a threshold that was between 0.5 and 3. A minimum gesture length of 3 – 4s was enforced and after this period the gesture was said to end whenever the above condition was violated. Analysis: We contrast two approaches for assigning subtlety score. Broadly, they can be classified as being either timedependent or time-independent. In the time dependent approach, each gesture is considered a time-series vector of acceleration norms. We then evaluate the Euclidean and DTW distance between a query vector and the entire training set. The gestures might be performed with varying speed and hence DTW was expected to perform better. Basically, this approach stretches or shrinks two signals in time, so that the closest distance after these transformations can be computed. A detectability score is assigned to the query as follows This was done separately for the DTW and Euclidean distances. K was then varied from 1 to 100 and the optimal k was picked according to the average absolute error. A gesture had multiple instances of it being performed and hence a score to the gesture was assigned as an average of each of the scores of its instances. In the second approach, we calculated a set of time independent features for each gesture instance (Table 2). These features were judged by the authors to be most relevant to subtlety by the authors. The gestures were now described by time independent feature vectors and we could use regression to predict the score for a query vector. Max and avg accelerations (x, y, z and Norm) Max and avg jerk Maximum vertical displacement validation technique that uses a single observation as the validation set and the remaining as the training set. This is done for every observation in the data and the validation error is calculated as the average of the error at each step. In our case, all instances of a particular gesture were used for testing at any step of LOOCV. Evaluation Figure 1 shows the average absolute error in prediction by the DTW approach against the value of k. Figure 1 – avg. abs error vs. k for DTW We picked the value of k as 9 to be optimal as it produced the lower error. For the Euclidean approach this was 3. The prediction results for these approaches are shown in figure 2. Max and avg velocities (x, y, z, and Norm) Total energy spent Total distance covered in gesture Table 2 – The accelerations were with gravity removed. This was done by rotating the gravitational vector at each time step using the gyroscope velocities at the previous step and subtracting its components from the accelerometer readings. The velocities were calculated by integrating out the resulting acceleration vectors. The jerk is the derivative of acceleration w.r.t time. The energy spent was calculating by first calculating the displacement at each step by integrating the velocities and then taking a dot product with the acceleration vector and integrating over the gesture duration To test the performance of the two approaches, we used the leave-one-out-cross-validation technique. LOOCV is a cross- Figure 2 – User scores vs. predicted scores. For perfect correlation, all points must lie on the straight line y = x (also shown) Similar results for the time independent approach is shown in figure 2. Also shown are predictions when the only feature used was energy. The purpose was to question whether only the total energy spent while performing a gesture was sufficient to predict its subtlety. Table 3 compares the four approaches using mean absolute error, RMSE and pearson’s correlation coefficient between the ground truth and predicted scores. gave much better predictions. We had listed this in future works previously. We also realize that we are a far ways from building an algorithm that can predict gesture subtlety. We have learnt throughout the project that the biggest hurdle is having enough data. Our findings here are limited to range of gestures we have picked for this study. Even in this range, a gesture set of 13 leads to very coarse buckets of subtlety scores. We need to perform a larger user study with more gestures (20-25) to reduce this granularity. We have also learned that it is important to come up with a precise definition of subtlety, one that can be agreed upon by a set of experts and that is also possible for a computer to interpret. We can simultaneously guarantee a precise definition of subtlety and test the performance of our algorithm by calculating user and algorithmic correlation to the ground truth and refine our definition till we achieve coefficients higher than a threshold. Figure 3 – User scores vs. predicted scores (regression) Approach DTW Euclidean Regression (All) Regression (Energy) Abs. error 0.1009 0.1325 0.0959 0.1159 MSE 0.1292 0.1689 0.1098 0.1344 Pearson coeff. 0.3896 -0.0841 0.6177 0.2243 Table 3 –Average Error rates. Detectability scores range from 0 to 1. We can also try better approaches to time series and time independent analysis such as HMMs, SMO classification, decision trees, etc. We can also try boosting our existing approaches to achieve better results. Given our limited time and data set we believe we have achieved good results and given a proof of concept that in the given range of gestures, subtlety scores can be assigned to an acceptable error using date from a accelerometer worn on the wrist of the user. Discussion The results clearly show a time independent approach as being superior one. The absolute error for the DTW approach is comparable to the regression one but the Pearson correlation coefficient is much higher for the latter. Of the time dependent approaches, the Euclidean way of measuring distance is not an appropriate one. It has the highest error rate and lowest correlation with the ground truth. This is as expected as it does not take the distortion of the gesture signal into account. Regression performed with only energy gives a moderately high error but a low correlation with the ground truth. The difference between the approaches can be seen from figures 2 and 3 as well. The closes the data points are to the line y = x, the better are the model predictions. Regression with time independent features seems to give the ‘best fit’ to the ground truth data as seen in Figure 3 (a). This is also reflected in its high correlation coefficient of 0.6177. We have a big improvement in our prediction ability from last time. We attribute this to two reasons. Firstly, we were able to sample our gestures much better by using the phidget. We had noted last time that frequent, uniform sampling was important and that prediction seems to be correct. Secondly, we implemented a time independent features approach which References [1]Kern N., Schiele B., Junker H., Lukowicz P., & Troster G. (2003). Wearable sensing to annotate meeting recordings. Journal: Personal and Ubiquitous Computing. [2]Lyons K., Skeels C., Starner T., Snoeck C. M., Wong B. A., &Ashbrook D. (2004). Augmenting conversations using dual-purpose speech. Proceedings of the 17th annual ACM symposium on User interface software and technology. [3] Julie R. & Stephen B. (2010). Usable gestures for mobile interfaces: evaluating social acceptability. Proceedings of the 28th international conference on Human factors in computing systems (CHI '10), 887-896. [4] Nadav Aharony, Wei Pan, Cory Ip, Inas Khayal, Alex Pentland, (2011) Social fMRI: Investigating and shaping social mechanisms in the real world, Pervasive and Mobile Computing [5]Starner T., Auxier J., Ashbrook D., and Gandy M (2000). The Gesture Pendant: A Self-illuminating, Wearable, Infrared Computer Vision System for Home Automation Control and Medical Monitoring. In IEEE International Symposium on Wearable Computers. [6] Ashbrook D., Starner T. (2010). MAGIC:a motion gesture design tool. Proceedingsof the 28th international conference on Human factors in computing systems. [7] Bailly G., Müller J., Rohs M., Wigdor D., Kratz S. (2012). ShoeSense: A New Perspective on Hand Gestures and Wearable Applications. Proceedings of the 30th international conference on Human factors in computing systems.