
Assessing Gesture Subtlety Using Accelerometer and Gyroscope
We consider a gesture as being subtle if it's not seen as being
out of the norm by other people. So it should not be disruptive
to the flow of any social event. In the context of wearable
computing, subtle hand gestures play a very important role.
They could be used to easily perform tasks such as looking at
past notes, checking to see if you've got a new message, or
even look up the meaning of a word which you just heard
somebody say. More broadly, a well configured subtle gesture
would be useful for controlling a variety of computing
devices. In the related works we describe some research being
done to design and identify such gestures and their application
areas. It is important that such gestures must also be socially
acceptable and comfortable to use. Loud and obtrusive
gestures have the obvious flaw of being socially inappropriate
and uncomfortable for the user. In addition, the user may be
actively interacting with people around her and may find
speech commands inefficient and disturbing. This problem is
addressed in the literature from an HCI point of view where it
is found that small, quiet gestures are most acceptable and
comfortable. We approached the problem from an AI
perspective for our first project. In this paper, we revisit the
problem of designing a system which would be able to assign
a subtlety score to any given hand gesture. As user studies are
cumbersome and expensive, being able to assign such scores
can make the task of quickly identifying gestures for different
social situations convenient.
Related Works
Kern et. al [1] discuss the utility associated with being able to
selectively retrieve segments of meetings recorded by a
wearable computer. They deal with the problem by
automatically annotating them according to the user's state as
well as the situation. We believe that their technique could be
augmented by annotating using subtle gestures, as this affords
a user the choice to do it according to her preferences. In video
conferences they could be used without causing visual
distractions. For Julie and Stephen [3], even though their
system is designed to accommodate a variety of sensors, they
chose a wrist-accelerometer due to its wide availability and
low cost. The work done on identifying subtlety of gestures is
noticeably less and we believe this is where the literature is
deficient. There is work to support that small, unobtrusive
gestures are more socially acceptable and comfortable while
controlling portable computing devices [2]. But training a
computer to score gestures for subtlety remains an open
We continue from our previous project where we recorded
accelerometer data using the app funfinabox[4] with an
android phone. Previously, we had 15 gestures in our set and
20 participants for the survey. With Dynamic Time Warping
(DTW), we applied one-Nearest Neighbor(1-NN), without it
our approach for gesture analysis included k-NN with k from
1 to 10. Our feature vector consisted of the magnitude of the
acceleration vector. We think our results can be improved
with a larger training set, regular sampling, a multidimensional feature vector and using time independent
features for classification.
Data Collection: Ground truth subtlety scores were assigned
to our gesture set by conducting a web survey with 32
participants in all. Participants were shown 10, 5s videos of
the two authors and after each viewing asked this
question: "Did anyone make a gesture that he could be using
to control a computer in the video above?" The ‘detectability
score’ was then defined as the ratio of the number of users that
identified the gesture and the number of users the were shown
a video with the gesture performed. The gestures used and the
corresponding score is provided in Table 1.
Window Open
Window Close
Fire on
Fire Off
Throw Money
Slap Phone
Swing Phone
Multi Finger Snap
Flick Air
Door Close
Door Open
Detectability Score
Table 1 – Gestures and user scores
The accelerometer data was collected by attaching a phidget to
the wrist of the performer (both the authors), who then
performed each gesture 10 times. We had learned that constant
rate sampling is an important factor in reducing noise and the
phidget performed that well. We collected 125 samples/s. To
mark the beginning of a gesture we used a window mean
based approach. We started storing the gesture when
Here anorm , i is the acceleration norm at window i, σ is the
standard deviation for the mean μ and t is a threshold that was
between 0.5 and 3. A minimum gesture length of 3 – 4s was
enforced and after this period the gesture was said to end
whenever the above condition was violated.
Analysis: We contrast two approaches for assigning subtlety
score. Broadly, they can be classified as being either timedependent or time-independent. In the time dependent
approach, each gesture is considered a time-series vector of
acceleration norms. We then evaluate the Euclidean and DTW
distance between a query vector and the entire training set.
The gestures might be performed with varying speed and
hence DTW was expected to perform better. Basically, this
approach stretches or shrinks two signals in time, so that the
closest distance after these transformations can be computed.
A detectability score is assigned to the query as follows
This was done separately for the DTW and Euclidean
distances. K was then varied from 1 to 100 and the optimal k
was picked according to the average absolute error. A gesture
had multiple instances of it being performed and hence a score
to the gesture was assigned as an average of each of the scores
of its instances.
In the second approach, we calculated a set of time
independent features for each gesture instance (Table 2).
These features were judged by the authors to be most relevant
to subtlety by the authors. The gestures were now described by
time independent feature vectors and we could use regression
to predict the score for a query vector.
Max and avg accelerations
(x, y, z and Norm)
Max and avg jerk
Maximum vertical
validation technique that uses a single observation as the
validation set and the remaining as the training set. This is
done for every observation in the data and the validation error
is calculated as the average of the error at each step. In our
case, all instances of a particular gesture were used for testing
at any step of LOOCV.
Figure 1 shows the average absolute error in prediction by the
DTW approach against the value of k.
Figure 1 – avg. abs error vs. k for DTW
We picked the value of k as 9 to be optimal as it produced the
lower error. For the Euclidean approach this was 3. The
prediction results for these approaches are shown in figure 2.
Max and avg velocities
(x, y, z, and Norm)
Total energy spent
Total distance covered in
Table 2 – The accelerations were with gravity removed. This was
done by rotating the gravitational vector at each time step using the
gyroscope velocities at the previous step and subtracting its
components from the accelerometer readings. The velocities were
calculated by integrating out the resulting acceleration vectors. The
jerk is the derivative of acceleration w.r.t time. The energy spent was
calculating by first calculating the displacement at each step by
integrating the velocities and then taking a dot product with the
acceleration vector and integrating over the gesture duration
To test the performance of the two approaches, we used the
leave-one-out-cross-validation technique. LOOCV is a cross-
Figure 2 – User scores vs. predicted scores. For perfect correlation,
all points must lie on the straight line y = x (also shown)
Similar results for the time independent approach is shown in
figure 2. Also shown are predictions when the only feature
used was energy. The purpose was to question whether only
the total energy spent while performing a gesture was
sufficient to predict its subtlety.
Table 3 compares the four approaches using mean absolute
error, RMSE and pearson’s correlation coefficient between the
ground truth and predicted scores.
gave much better predictions. We had listed this in future
works previously. We also realize that we are a far ways from
building an algorithm that can predict gesture subtlety. We
have learnt throughout the project that the biggest hurdle is
having enough data. Our findings here are limited to range of
gestures we have picked for this study. Even in this range, a
gesture set of 13 leads to very coarse buckets of subtlety
scores. We need to perform a larger user study with more
gestures (20-25) to reduce this granularity. We have also
learned that it is important to come up with a precise definition
of subtlety, one that can be agreed upon by a set of experts and
that is also possible for a computer to interpret. We can
simultaneously guarantee a precise definition of subtlety and
test the performance of our algorithm by calculating user and
algorithmic correlation to the ground truth and refine our
definition till we achieve coefficients higher than a threshold.
Figure 3 – User scores vs. predicted scores (regression)
Regression (All)
Regression (Energy)
Abs. error
Pearson coeff.
Table 3 –Average Error rates. Detectability scores range from 0 to 1.
We can also try better approaches to time series and time
independent analysis such as HMMs, SMO classification,
decision trees, etc. We can also try boosting our existing
approaches to achieve better results. Given our limited time
and data set we believe we have achieved good results and
given a proof of concept that in the given range of gestures,
subtlety scores can be assigned to an acceptable error using
date from a accelerometer worn on the wrist of the user.
The results clearly show a time independent approach as being
superior one. The absolute error for the DTW approach is
comparable to the regression one but the Pearson correlation
coefficient is much higher for the latter. Of the time dependent
approaches, the Euclidean way of measuring distance is not an
appropriate one. It has the highest error rate and lowest
correlation with the ground truth. This is as expected as it does
not take the distortion of the gesture signal into account.
Regression performed with only energy gives a moderately
high error but a low correlation with the ground truth. The
difference between the approaches can be seen from figures 2
and 3 as well. The closes the data points are to the line y = x,
the better are the model predictions. Regression with time
independent features seems to give the ‘best fit’ to the ground
truth data as seen in Figure 3 (a). This is also reflected in its
high correlation coefficient of 0.6177.
We have a big improvement in our prediction ability from last
time. We attribute this to two reasons. Firstly, we were able to
sample our gestures much better by using the phidget. We had
noted last time that frequent, uniform sampling was important
and that prediction seems to be correct. Secondly, we
implemented a time independent features approach which
[1]Kern N., Schiele B., Junker H., Lukowicz P., & Troster G. (2003).
Wearable sensing to annotate meeting recordings. Journal: Personal
and Ubiquitous Computing.
[2]Lyons K., Skeels C., Starner T., Snoeck C. M., Wong B. A.,
&Ashbrook D. (2004). Augmenting conversations using dual-purpose
speech. Proceedings of the 17th annual ACM symposium on User
interface software and technology.
[3] Julie R. & Stephen B. (2010). Usable gestures for mobile
interfaces: evaluating social acceptability. Proceedings of the 28th
international conference on Human factors in computing systems
(CHI '10), 887-896.
[4] Nadav Aharony, Wei Pan, Cory Ip, Inas Khayal, Alex Pentland,
(2011) Social fMRI: Investigating and shaping social mechanisms in
the real world, Pervasive and Mobile Computing
[5]Starner T., Auxier J., Ashbrook D., and Gandy M (2000). The
Gesture Pendant: A Self-illuminating, Wearable, Infrared Computer
Vision System for Home Automation Control and Medical
Monitoring. In IEEE International Symposium on Wearable
[6] Ashbrook D., Starner T. (2010). MAGIC:a motion gesture design
tool. Proceedingsof the 28th international conference on Human
factors in computing systems.
[7] Bailly G., Müller J., Rohs M., Wigdor D., Kratz S. (2012).
ShoeSense: A New Perspective on Hand Gestures and Wearable
Applications. Proceedings of the 30th international conference on
Human factors in computing systems.