Zero-Latency Tapping: Using Hover Information to Predict Touch

advertisement
Modeling and Prediction
UIST’14, October 5–8, 2014, Honolulu, HI, USA
Zero-Latency Tapping: Using Hover Information to Predict
Touch Locations and Eliminate Touchdown Latency
Haijun Xia1, Ricardo Jota1,2, Benjamin McCanny1,
Zhe Yu1, Clifton Forlines1,2, Karan Singh1, Daniel Wigdor1
1
2
University of Toronto
Tactual Labs
{haijunxia | jotacosta | bmccanny | zheyu | cforlines | karan | daniel}@dgp.toronto.edu
ABSTRACT
A method of reducing the perceived latency of touch input
by employing a model to predict touch events before the
finger reaches the touch surface is proposed. A corpus of 3D
finger movement data was collected, and used to develop a
model capable of three granularities at different phases of
movement: initial direction, final touch location, time of
touchdown. The model is validated for target distances >=
25.5cm, and demonstrated to have a mean accuracy of
1.05cm 128ms before the user touches the screen. A user
study of different levels of latency reveals a strong
preference for unperceivable latency touchdown feedback. A
form of ‘soft’ feedback is proposed, as well as other
performance-enhancing uses for this prediction model.
Figure 1: The model predicts the location and time of a touch.
Parameters of the model are tuned to the latency of the device
to maximize accuracy while guaranteeing performance.
INTRODUCTION
The time delay between user input and corresponding
graphical feedback, here classified as interaction latency,
has long been studied in computer science. Early latency
research indicated that the visual “response to input should
be immediate and perceived as part of the mechanical
action induced by the operator. Time delay: No more than
0.1 second (100ms)” [25]. More recent work has found that
this threshold is, in fact, too high, as humans are able to
perceive even lower levels of latency - for direct touch
systems, it has been measured as low as 24ms when tapping
the screen [20], and 6ms when dragging [27]. Furthermore,
input latencies well below 100ms have been shown to
impair a user’s ability to perform basic tasks [20, 27].
replaced a general-purpose processor and software, they
employ a high-speed projector rather than a display panel,
and each is capable of displaying only simple geometry.
While the touchdown latency of current commercial touch
devices can be as low as 75ms, this latency is still
perceptible to users. Eliminating latency, or at least
reducing it beyond the limits of human perception and
performance impairment, is highly desirable. Both Leigh et
al. and Ng et al. demonstrated direct-touch systems capable
of less than 1ms of latency [22, 27]. While compelling, these
are not commercially viable for most applications: an FPGA
This paper investigates methods for eliminating the
apparent latency of tapping actions on a large touchscreen
through the development and use of a model of finger
movement. We track the path of a user’s finger as it
approaches the display and predict the location and time of
its landing. We then signal the application of the impending
touch so that it can pre-buffer its response to the touchdown
event. In our demonstration system, we trigger a visual
response to the touch at the predicted point before the finger
lands on the screen. The timing of the trigger is tuned to the
system’s processing and display latency, so the feedback is
shown to the user at the moment they touch the display. The
result is an improvement in the apparent latency as touch
and feedback occur simultaneously.
While completely eliminating latency from traditional form
factors may ultimately prove to be impossible, we believe
that it is possible to reduce the apparent latency of an
interactive system. We define apparent latency as the time
between an input and the system’s soft feedback to that
input, which serves only to show a quick response to the
user (e.g.: pointer movement, UI buttons being depressed),
as distinct from the time required to show the hard feedback
of an application actually responding to that same input.
Permission to make digital or hard copies of all or part of this work for personal
or classroom use is granted without fee provided that copies are not made or
distributed for profit or commercial advantage and that copies bear this notice
and the full citation on the first page. Copyrights for components of this work
owned by others than the author(s) must be honored. Abstracting with credit is
permitted. To copy otherwise, or republish, to post on servers or to redistribute to
lists, requires prior specific permission and/or a fee. Request permissions from
Permissions@acm.org.
UIST '14, October 05 - 08 2014, Honolulu, HI, USA
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 978-1-4503-3069-5/14/10…$15.00.
http://dx.doi.org/10.1145/2642918.2647348
In order to predict the user’s landing point, we must first
understand the 3D spatial dynamics of how users perform
touch actions. To this end, we augmented a Samsung
SUR40 tabletop with a high fidelity 3D tracking system to
205
Modeling and Prediction
UIST’14, October 5–8, 2014, Honolulu, HI, USA
record the paths of user finger movements through space as
they performed basic touchscreen tasks. We collected data
on input paths by asking 15 participants to perform repeated
tapping tasks. We then analyzed this data using various
numerical and qualitative observations to develop a
prediction model of 3D finger motion for touch-table device
interaction. This model, which was validated by a
subsequent study for targets at least 25.5cm distant, enables
us to predict the movement direction, touch location, and
touch time prior to finger-device contact. Using our model,
we can achieve a touch-point prediction accuracy of 1.05cm
on average 128ms before the user touches the display. This
accuracy and prediction time horizon is sufficient to reduce
the time between the finger touch down and the system’s
apparent response to beneath the 24ms lower bound of
human perception, described by Jota et al [20].
Most touch sensors employed today are based on projective
capacitance. Fundamentally, the technique is capable of
sensing the user’s presence centimeters away from the
digitizer, as is done with the Theremin [31]. Such sensors
employed today are augmented with a ground plane,
purposefully added to eliminate their ability to detect a
user’s finger prior to touch [6]. More recently, sensors have
been further augmented to include the ability to not only
detect the user’s finger above the device, but also to detect
its distance from the digitizer [2, 14, 18, 34].
In this paper, we first describe relevant prior art in the areas of
hover sensing, input latency, and touch prediction. We then
describe a pair of studies that we used to formulate and then
validate our predictive model. Next, we describe a third study
in which participants’ preferences for low-latency touch input
were investigated. Finally, we describe a number of uses for
our model beyond simple feedback and outline future work
that continues the exploration of touch prediction.
Hover has long been the domain of pen-operated devices [9,
19]. Subramanian et al. suggest that the 3D position of a
pointing device affects the interaction on the surface [30].
The authors propose a multi-layer application, with an active
usage of the space above the display, where users
purposefully distance the pen from the display to activate
actions. Grossman et al. present a technique that utilizes the
hover state of pen-based systems to navigate through a
hover-only command layer [15]. Spindler et al. [28] propose
that the space above the surface be divided into stacked
layers, with layer specific interactions – this is echoed by
Grossman et al. [16], who divided the space around a
volumetric display into two spherical ‘layers’ with subtly
differentiated interaction. This is distinct from Wigdor et al.,
who argued for the use of the hover area as a ‘preview’
space for touch gestures [33], similar to Yang et al. who
used hover sensing to zoom on-screen targets [37]. In
contrast, Marquadt et al. recommend that the space above
the touch surface and the touch surface be considered one
continuous space, and not separate interaction spaces [24].
Use of Hover
Prior work has explored the use of sensing hover to enable
intentional user input. Our work, in contrast, effectively
hides the system’s ability to detect hover from the user,
using it only for prediction of touch location and timing,
and elimination of apparent latency.
RELATED WORK
We draw from several areas of related work in our present
research: the detection and use of hovering information in HCI,
the psychophysics of latency, the use of predictive models in
HCI, and the modeling of human motion in three dimensions.
Hover Sensing
A number of sensing techniques have been employed to
detect the position of the user prior to touching a display. In
HCI research, hover sensing is often simulated using optical
tracking tools such as the Vicon motion capture system, as
we have done in this work. The user is required to wear or
hold objects augmented with markers, as well as the need to
deploy stationary cameras. A more practical approach for
commercial products, markerless hover sensing has been
demonstrated using optical techniques, including through
the use of an array of time-of-flight based range finders [3]
as well as stereo and optical cameras [35].
These projects focused on differentiating the space around
the display, and using it as an explicit interaction volume.
Our approach is more similar to that taken by Hachisu and
Kajimoto [17], who demonstrate the use of a pair of photosensing layers to measure finger velocity and predict the
time of contact with the touch surface. We build on this
work through the addition of a model of motion that allows
the prediction of not only time, but also early indication of
direction, as well as later prediction of the location of the
user’s touch, enabling low-latency visual feedback in
addition to the audio feedback they provide.
Non-optical tracking has also been demonstrated using a
number of technologies. One example is the use of
acoustic-based sensors, such as the “Flock of Birds”
tracking employed by Fitzmaurice et al. [8], which enables
six degrees of freedom (DOF) position and orientation
sensing of physical handheld objects. Although popular in
research applications, widespread application of this sensor
has been elusive. More common are 5-DOF tools using
electro-magnetic resonance (EMR). EMR is commonly
used to track the position and orientation of styli in relation
to a digitizer, and employed in creating pen-based user
input. Although typically limited to a small range beyond
the digitizer in commercial applications, tracking with EMR
has been used in much larger volumes [12].
Latency
Ng et al. studied the user perception of latency for touch
input. For dragging actions with a direct touch device, users
were able to detect latency levels as low as 6ms [27]. Jota et
al. studied the user performance of latency for touch input
and found that dragging task performance is affected if
latency levels are above 25ms [20]. In the present work, we
focus on eliminating latency of the touchdown moment
206
Modeling and Prediction
UIST’14, October 5–8, 2014, Honolulu, HI, USA
when the user first touches the screen. Jota et al. found that
users are unable to perceive latency of responses to tapping
that occur in less than 24ms [20] – we use prediction of
touch location to provide soft touchdown feedback within
this critical time, effectively eliminating perceptible latency.
DATA COLLECTION
To form our predictive model of tap time and location, we
began by collecting data of tap actions on a touchscreen
display. Participants performed tap gestures with varying
target distance and direction of gesture. The data were then
used to build our model, which we subsequently validated
with a study we will later describe.
Predicting Input
Predicting users’ actions has been an active area of research
in the field of HCI. Mackenzie proposes the application of
Fitts's Law to predict movement time for standard touch
interfaces [23]. By building a Fitts's model for a particular
device, the movement time can be predicted given a known
target and cursor position. Wobbrock et al. complements
this approach with a model to predict pointing accuracy
[36]. Instead of predicting movement time, a given
movement time is used to predict error. In many pointing
experiments, the input device is manipulated by in-air
gestures, including Fitts’s original stylus-based apparatus
[7]. Murata proposes a method for predicting the intended
target based on the current mouse cursor trajectory [26].
The author reports movement time reductions when using
the predictive algorithm, but notes limited returns for dense
target regions. Baudisch et al. adopted this approach:
instead of jumping the cursor close to the target, this
technique wraps eligible targets around the cursor [4].
Participants
We recruited 15 right-handed participants (6 female)
aged 22-30 from the local community. Participants
reported owning 2 (mean) touch devices and spend 2-4
hours a day using them. Participants were paid $20 for a
half-hour session.
Apparatus
The study was implemented using two different sensors: to
sense touch, a Microsoft Surface table 2.0 was used
(Samsung SUR40 with PixelSense). Pre-touch data was
captured using a Vicon tracking system. Participants wore a
motion capture marker-instrumented ring on their index
fingertip, which was tracked in 3D at 120Hz.
The flow of the experiment was controlled by a separate
PC, which received sensing information from both the
Surface touch system and the Vicon tracking system, while
triggering visual feedback on the Surface display. The
experiment was implemented in python and shown to the
user on the Surface table. It was designed to (1) present
instructions and apparatus to the participant, (2) record the
position and rotation of the tracked finger, (3) receive
current touch events from the Surface, (4) issue commands
to the display, and (5) log all of the data.
We sought to build on these projects by developing a model
of hand motion while performing touch-input tapping tasks,
and apply this model to reducing apparent latency.
Models of Hand Motion
Biomechanists [32] and neuroscientists [10, 13] are actively
engaged in the capture and analysis of 3D human hand
motion. Their interest lies primarily in the understanding of
various kinematic features, such as muscle actuation and
joint torques, as well as cognitive planning during the hand
movement. Flash [10] modeled the unconstrained point-topoint arm movement by defining an objective function and
running an optimization algorithm. They found that the
minimization of hand jerk movements generates an
acceptable trajectory. Following the same approach, Uno
[32] optimizes for another kinematic feature, torque, to
generate the hand trajectory. While informative, these
models are unsuitable to our goal of reducing latency, as
they are computationally intensive and cannot be computed
in real-time (for our purposes, as in little as 30ms).
Task
Participants performed a series of target selection tasks,
modeled after traditional pointing experiments, with some
modifications made to ensure they knew their target before
beginning the gesture, thus avoiding contamination of
collected data with corrective movements. Target location
was randomized, rather than performed in sequentialcircle. Further, to begin each trial, participants were
required to touch and hold a visible starting point
(r=2.3cm), immediately after the target location was
shown. They were required to hold the starting point until
an audio cue was played (randomly between 0.7 and 1.0
seconds after touch). If the participant anticipated the
beginning of the trial and moved their finger early, the trial
would be marked as an error.
We propose a generic model focusing on the prediction of
landing location and touch time based on the pre-touch
movement to reduce the time between the finger landing on
the screen and the system’s apparent response.
Immediately after the participants touched the starting point,
at the opposite side of the circular arrangement a target point
would appear for participants to tap. The target size of
2.3cm was selected as a trade-off between our need to
specify end-position while minimizing corrective
movements. Once a successful trial was completed,
participants were instructed to return to another starting
point for the next trial. Erroneous tasks were indicated with
feedback on the Surface display and repeated.
Having examined this related work, we turned our attention
to the development of our predictive model of hand motion
when performing pointing tasks on a touchscreen display.
To that end, we first performed a data collection
experiment. The data from this experiment was then used to
develop our model.
207
Modeling and Prediction
UIST’14, October 5–8, 2014, Honolulu, HI, USA
Procedure
ANALYSIS & PREDICTING TOUCH
Participants were asked to complete a consent form and a
questionnaire to collect demographic information. They
then received instruction on how to interact with the
apparatus and successfully completed 30 training trials.
After the execution of each trial, a text block at the top right
corner of the screen would update the cumulative error rate
(shown as %). Participants were instructed to slow down if
the error rate was above 5%, but were not given any
instructions regarding their pre-touch movement.
Having collected these tapping gestures, we turned our
attention to modeling the trajectories with the primary goal
of predicting the time and location of the final finger touch.
Here we describe our approach, beginning with a discussion
of the attributes of the touch trajectories, followed by the
model we derived to describe them.
Note that our three-dimensional coordinate system is righthanded: x and y representing the Surface screen; the origin
at the bottom-left corner of the Surface display; and z, the
normal to the display.
Design
Tasks were designed according to two independent
variables: target direction (8 cardinal directions) and target
distance (20.8cm and 30.1cm). The combination of these
two variables produces 16 unique gestures. There were four
repetitions for each combination of direction and distance.
Therefore, a session included a total of 64 actions. The
ordering of the trials was randomized within each session.
Participants completed 3 sessions and were given a 5minute break between sessions.
Numerical and Qualitative Observations
Time & Goals: participants completed each trial with an
mean movement time of 416ms (std.: 121ms). Our system
had an average end-to-end latency of 80ms: 70ms from the
Vicon system, 8ms from the display, and 2ms of
processing. Thus, to drop touch-down latency below the
24ms threshold, our goal was to remove at least ~56ms via
prediction. Applying our work to other systems will require
additional tuning.
In summary, 15 participants performed 192 trials each, for a
total of 2880 trials.
Movement phases: Figure 3 shows that all the trajectories
have one peak, with a constant climb before, and a constant
decline after. However, we did not find the peak to be at the
same place in-between trajectories. Instead the majority of
trajectories are asymmetrical, 2.2% have a peak before 30%
of the total path, 47.9% have a peak between 30-50% of the
total path, 47.1% have a peak between 50-70% of the total
path, and 2.8% have a peak after 80% of the trajectory
completed path.
Measures and Analysis Methodology
For each successful trial we captured the total completion
time; finger position, rotation, and timestamp for every
point in the finger trajectory; as well as the time participants
touched the screen. Tracking data was analyzed for
significant tracking errors, with less than 0.3% of the trials
removed due to excessive noise in tracking data. Based on
the frequency of the tracking system (120Hz) and the speed
of the gestures, any tracking event that was more than
3.5cm away from its previous neighbor was considered an
outlier and filtered (0.6%). The raw data (including outliers)
for a particular target location are shown in Figure 2.
We have found it useful to divide the movement into three
phases: lift-off, which is characterized by a positive change in
height, continuation, which begins as the user’s finger starts to
dip vertically, and drop-down, the final plunge towards the
screen. Each of the lift-off and drop-down phases has interesting
characteristics, which we will examine.
After removing 8 trials due to tracking noise, we had 2872
trials available for the development of our predictive model.
Figure 2: Overlay of all the pre-touch approaches to a
northwest target. The blue rectangle represents the
interactive surface used in the study.
Figure 3: Side view overlay of all trials, normalized to start
and end positions.
208
Modeling and Prediction
UIST’14, October 5–8, 2014, Honolulu, HI, USA
Figure 4: Trajectory for the eight directions of movement,
normalized to start at the same location (center). The blue lines
represent the straight-line approach to each target.
Figure 6: Trajectory prediction for line, parabola, circle and
vertical fits. Future points of the actual trajectory (black dots)
fit a parabola best.
Lift-off direction: As might be expected, the direction of
movement of the user’s hand above the plane of the
screen is roughly co-linear to the target direction, as
shown in Figure 4. Fitting a straight line to this
movement, the angle of that line to a straight line from
starting point to the target is, on average, 4.78°, with a
standard deviation of 4.51°. Depending on the desired
degree of certainty, this information alone is sufficient to
eliminate several potential touch targets.
Predictive Touch Model
Drop-down direction: Figure 5 and Figure 7 show the
trajectory of final approach towards the screen. As can be
seen, the direction of movement in the drop-down phase
roughly fits a vertical drop to the screen. We also note that,
as can be seen in Figure 7, the final approach when viewed
from the side is roughly parabolic. It is clear when
examining Figure 7 that a curve, constrained to intersect on
a normal to the plane, will provide a rough fit. We
examined several options, shown in Figure 6, and found
that a parabola, constrained to intersect the screen at a
normal, and fit to the hover path, would provide the best fit.
Prediction 1: Direction of Movement
Lift-off begins with a user lifting a finger off the touch
surface and ends at the highest point of the trajectory
(peak). As we discussed, above, this often ends before the
user has reached the halfway point towards their desired
target. As is also described, the direction of movement
along the plane of the screen can be used to coarsely
predict a line along which their intended target is likely to
fall. At this early stage, our model provides this line,
allowing elimination of targets outside of its bounds.
Figure 5: Final finger approach, as seen from the approaching
direction
Figure 7: Final finger approach, as seem from the side of the
approaching direction
Based on these observations, we present a prediction
model, which makes three different predictions at three
different stages in the user’s gesture. They are initial
direction, final touch location, and final touch time.
Making predictions at three different moments allows our
model to provide progressively more accurate information,
allowing the UI to react as early as possible.
209
Modeling and Prediction
UIST’14, October 5–8, 2014, Honolulu, HI, USA
The timing of this phase is tuned based on the overall
latency of the system, including that of the hover sensor:
the later the prediction is made, the more accurate it will be,
but the less time will be available for the system to respond.
The goal is to tune the system so that the prediction arrives
at the application so that it can respond immediately, and
have its response shown on the screen at the precise
moment the user touches. Through iterative testing, we
found that, for the latency of our system (display + Vicon,
approximately 80ms) setting thresholds of 4cm (distance to
display) and 23o (angle to plane) yielded the best results.
Given these unusually high latencies values, a more typical
system would see even better results.
With these thresholds, our model predicts a touchdown
location with an average error (distance to actual touch point)
of 1.18cm and standard deviation of 1.09cm, on average, 91
milliseconds (std.: 72ms) before touchdown and at an
average distance of 3.22cm (std.: 1.30cm) above the display.
For the same set of trials, the errors for other curves (see
Figure 6): circular fit (avg.: 1.72cm, std.: 1.62cm), vertical
drop (avg.: 2.43cm, std.: 2.04cm) and a linear fit (avg.:
9.3cm, std.: 4.83cm) are larger than the parabolic fit.
The visual results and statistics indicate that pre-touch data
has the potential to predict touch location long before the
user touches the display. We validate the parabolic
prediction model in a secondary study by using it to predict
touch location in real time.
Figure 8: The parabola is fitted in the drop-down plane with
(1) an initial point, (2) the angle of movement, (3) and the
intersection is orthogonal with the display
Prediction 2: Final Touch Location
A prediction of the final location of the touch, represented
as an x/y point, is computed by fitting a parabola to the
approach trajectory. This parabola (Figure 8) is constrained
as follows: (1) the plane is fit to the (nearly planar) dropdown trajectory of the touch; (2) the position of the finger
at the time of the fit is on the parabola; (3) the angle of
movement at the time of the fit is made a tangent to the
parabola; (4) the angle of intersection with the display is
orthogonal. Once the parabola is fit to the data, and
constrained by these parameters, its intersection with the
display comprises the predicted touch point. The fit is made
when the drop-down phase begins. This is characterized by
two conditions: (1) the finger’s proximity to the screen; and
(2) the angle to xy plane is higher than a threshold.
Prediction 3: Final Touch Time
Given that the timing of the prediction of final touch
location is tuned to the latency of the system on which it is
running, the time that it is delivered ahead of the actual
touch is reliable. The goal of this final step is to provide a
highly-accurate prediction of the time the user will touch,
which necessitates waiting until the final approach to the
display. We observed that the final ‘drop’ action, beyond
the final 1.8cm of a touch gesture, experiences almost no
deceleration. Thus, when the finger reaches 1.8cm from the
display, a simple linear extrapolation is applied assuming a
constant velocity.
We are able to predict within 2.0ms (mean; std.: 19.5ms),
51ms (mean; std.: 42ms) before touchdown. Note that, due
to the 80ms latency of our Vicon sensor, this prediction is
typically generated after the user has actually touched. We
include it here for use with systems not based on computer
vision and subject to network latency.
For each new point i, when the conditions are satisfied, the
tapping location is predicted. To calculate the tapping
location, we first fit a vertical plane to the trajectory.
Given the angle d and (𝑥! , 𝑧! ), we predict the landing
point, (𝑥! , 𝑧! ), by fitting a parabola:
𝑥 = 𝑎𝑧 ! + 𝑏𝑧 + 𝑐
Based on the derivatives at (𝑥! , 𝑧! ) and (𝑥! , 𝑧! ):
𝑥!! = !!
!"#(!)
𝑥!! = 0
we calculate a, b, and c as follows:
𝑎 = 𝑥!! − 𝑥!!
2 𝑧! – 𝑧!
MODEL EVALUATION
𝑏 = 𝑥!! − 2𝑎𝑧! Having developed our model using the collected data, we
sought to validate the model outside the condition of the
first study. We recruited 15 new right-handed participants
from the local community (7 female) that had not
participated in the first study with ages ranging from 20 to
30. On average, our participants own two touch devices
and spend two to four hours a day using them.
Participants were paid $10 for a half-hour session.
𝑐 = 𝑥! – 𝑎𝑧!! – 𝑏𝑧!
The landing point in this plane is defined as:
𝑥! , 𝑧! = (𝑐, 0)
Converting 𝑥! , 𝑧! back to the original 3D Vicon tracking
coordinate system yields the landing position.
210
Modeling and Prediction
UIST’14, October 5–8, 2014, Honolulu, HI, USA
From the first study we observed that arm joint movement
skews the trajectory. The longer the distance, the more
skewed the trajectory becomes. Secondly, people
dynamically correct the trajectory. The smaller the target,
the more corrections were observed. To further study these
effects, we included target distance and size as independent
variables. Therefore, our validation study was designed
according to three different independent variables: target
direction (8 cardinal directions), target distance (25.5cm,
32.4cm, and 39.4cm), and target size (1.6cm, 2.1cm, and
2.6cm). The combination of these three variables produces
72 unique tasks. The order of target size and distance was
randomized, with target direction always starting with the
south position, and going clockwise for each combination
of target size and distance. Participants completed 3
sessions and were given a break after each session.
no understanding if unperceivable latency UI is, indeed,
preferred by users. Using our predictive model, we
generated widgets with different levels of latency and
evaluated what amount of latency participants prefer. We
were particularly curious about participants’ responses to
negative latency – that is, having a UI element respond
before they finish reaching for it.
Participants
We recruited 16 right-handed participants from the local
community (8 male, 8 female) with ages ranging from 20 to
31. On average, our participants own two touch devices and
spend three to four hours a day using them. We paid
participants $10 for a half-hour session.
Task
The participants were shown a screen with two buttons,
each with different response latency. Before tapping
each button once, they were asked to touch and hold a
visible starting point until audio feedback, which would
occur randomly between 0.7 and 1.0 seconds later, was
given. They then were asked to indicate which button
they preferred.
The procedure and apparatus were identical to the first
study, with the exception of the prediction model running in
the background in real time. The prediction model did not
provide any feedback to the participants. For each trial we
captured the trajectories and logged the prediction results.
Results
Design
Prediction 1: On average, the final touch point was within
4.25° of the straight-line prediction provided by our model
(std.: 4.61°). On average, this was made available 186ms
(mean; std.: 77ms) before the user touched the display. We
found no significant effect for target size, direction, or
distance on prediction accuracy.
Tasks were designed with one independent variable,
response latency. To limit combinatorial explosion, we
decided to provide widget feedback under five different
conditions: immediately as a finger prediction is made (0ms
after prediction) and then artificially added latencies of 40,
80, 120, and 160ms to the predicted time, resulting in 10
unique pairs of latency. To remove the possible preference
for buttons placed to the left or right, we also flipped the
order of the buttons, resulting in 20 total pairs. The ordering
of the 20 pairs was randomized within each session.
Latency level was also randomly generated. Participants
completed 7 sessions of 20 pairs and were given a 1-minute
break between sessions, for a total of 2240 total trials.
Prediction 2: On average, our model predicted a touch
location with an accuracy of 1.05cm (std.: 0.81cm). The
finger was, on average, 2.87cm (std.: 1.37cm) away from
the display when the prediction was made. The model is
able to predict, on average, 128ms (std.: 63ms) before
touching the display, allowing us to significantly reduce
latency. We found no significant effect for target size,
direction, or distance on prediction accuracy.
Methodology
To calculate the effective latency we first calculate the
response time and the touch time. The response time is
calculated by artificially adding to the time of prediction
some latency (between 0 and 160ms). For touch time, we
consider when the Surface detected the touch and subtract a
known Surface latency of 137ms, measured using the
methodology described in [27]. The effective latency is the
difference between the response time and the touch time.
Prediction 3: On average, our model predicted the time of
the touch within 1.6ms (std.: 20.7ms). This prediction was
made, on average, 49ms before the touch was made (std.:
38ms). We found no significant effect for target size,
direction, or distance on prediction accuracy.
These results indicate that our prediction model can be
generalized to different target distances, sizes, and
directions, with an average drift from the touchdown
location of 1.05cm, 128ms prior to the finger touching the
device. To provide context, given that our mean trial
completion time for the experiment was approximately
447ms, this means that we were able to predict the
location of the final touch before 29% of the approach
action was completed.
Results
After pressing both buttons in one trial, participants
indicated which button they preferred. Each trial resulted in
2 points (not shown) in Figure 9; one at (L1, 1) for the
preferred latency L1, and one at (L2, 0) for the other
latency L2. For each participant, a curve is fit to 280 data
points. Three possible curves emerged, increasing,
decreasing, and peaked. During debriefing, we questioned
participants regarding how they select the preferred latency,
and identified three strategies (Faster is Always Better, On
Touch, Visible Latency), aligned with the curve of each
participant. Three corresponding curves in Figure 9 were
PREFERRED LATENCY LEVEL
Armed with our prediction model, we are able to provide
tapping feedback with a latency range from -100ms to
100ms. From previous work, we know that latencies below
24ms are unperceivable by humans [20], however we have
211
Modeling and Prediction
UIST’14, October 5–8, 2014, Honolulu, HI, USA
NEW OPPORTUNITIES AND CONSIDERATIONS
In this section, we detail a number of new interaction
opportunities that our prediction model provides and
discuss some of the considerations that system designers
must address when employing these techniques.
Reducing Apparent Latency
Our motivating use case is the reduction of visual latency in
order to provide the user with a more reactive touch-input
experience. Based on our validation study, our model can
predict touch location accurately enough at a sufficient time
horizon to support simultaneous touch and visual response.
A prediction 128ms prior to the finger touching the device
is sufficient to pre-buffer and display the visual response to
the input action. We believe that this work validates the
assertion that computer systems can be made to provide
immediate, real-world-like responses to touch input.
Figure 9: Preference curve for each observed trend and
average latency preference for all participants.
Beyond accelerating traditional visual feedback, our
approach enables a new model of feedback based on
predicted and actual input. With the prediction data from
this model, soft feedback can be designed to provide an
immediate response to tapping, eliminating the perception
of latency. After the touch sensor captures the touch event,
a transition from the previous soft feedback to the next
user interface (UI) state can be designed to provide a
responsive and fluent experience, instead of showing the
corresponding UI state directly.
generated from the participants in each of these three
groups. The dotted line is a curve fit to all data points,
indicating that overall participants preferred latencies
around 40ms.
Faster is Always Better. Four participants that preferred
negative latency were aware that the system was providing
feedback before the actual touch, but are confident that the
prediction is always accurate and therefore, the system
should respond as soon as a prediction is possible.
Reducing Programmatic Latency
On Touch. Eight participants preferred a system where
effective latency is between 0ms and 40ms. Participants
commented that they liked that the system reacted exactly
when their finger touched, but not before. When asked why
they did not prefer negative latency, participants mentioned
loss of control and lack of trust regarding the predictive
accuracy of the system as reasons for this preference.
Beyond changes to the visual appearance of GUI elements,
touch-controlled applications execute arbitrary application
logic in response to input. A 128-200ms prediction horizon
provides system designers with the intriguing possibility of
kicking-off time consuming programmatic responses to
input before the input occurs.
As an example, consider the widely adopted practice of precaching web content based on the hyperlinks present in the
page being currently viewed. Pre-caching has been shown
to significantly reduce page-loading times. However, it
comes at the expense of increasing both bandwidth usage
and the loads on the web servers themselves, as content is
often cached but not always consumed. Additionally, with
the potential for many referenced URLs on any one page, it
is not always clear to algorithm designers which links to
pre-fetch, meaning that clicked-on links may not have
already been cached.
Visible Latency. Four participants preferred visible latency.
When asked about the feeling of immediate response, they
expressed that they were not yet confident regarding the
predictive model and felt that an immediate response wasn’t
indicative of a successful recognition. Visible latency gave
them a feeling of being in control of the system and,
therefore, they preferred it to immediate response. This was
true even for trials where prediction was employed.
Our results show that there is a strong preference for
latencies that are only achievable through the use of
prediction. Overall, our participants indicated that they
preferred the lower-latency button in 62% of the study’s
trials. We ran a Wilcoxon Signed-Rank test comparing the
percent of trials where the lower latency was preferred to
the percent of trials where the higher latency was
preferred, and found a significant difference between the
two percentages (Z = 2.78 p = 0.003). 12 out of 16
participants preferred effective latencies below 40ms,
which was concluded to be unperceivable for 85% of the
participants [20].
Figure 10: Transitions between 3 states of touch
input that model the starting and stopping of actions, based on
prediction input.
212
Modeling and Prediction
UIST’14, October 5–8, 2014, Honolulu, HI, USA
A web-browser coupled with our input prediction model
would gain a 128-200ms head-start on loading linked
pages. Recent analysis has suggested that the median webpage loading time for desktop systems is 2.45s [1]. As such,
a head-start could represent a 5-8% improvement in page
loading time, without increasing bandwidth usage or server
burden. Similar examples include the loading of launched
applications and the caching of the contents of a directory.
The model relies on a high fidelity 3D tracking system,
currently unavailable for most commercial products. Here
we provide a detailed discussion about how to enable it in
everyday life. We used a Vicon tracking system, running at
120Hz, to capture the pre-touch data. As this high
frequency tracking is not realistic for most commercial
products, we tested the model at 60Hz, slower than most
commercial sensors. Although prediction is delayed 8ms on
average, the later fit has the benefit of increasing prediction
accuracy, because the finger is closer to the display.
To fully take advantage of predicted input, we propose a
modification to the traditional 3-state model of graphical
input, proposed by Buxton [5], that allows for
programmatic responses to be started and aborted as
appropriate as the input system updates its understanding of
the user’s intent. Figure 10 shows this model: in State 1,
related actions can be issued by the input system as
predictions (direction, location, and time) of a possible
action are received. When no actual input is being
performed (e.g. the user retracts hand), the input system
will stop all actions. When the actual touch target turns out
not to be the predicted one, the system may also stop all
actions but this will not add extra latency compared to the
traditional 3-state model. On the other hand, if the touch
sensor confirms the predicted action, the latency of the
touch sensor, network, rendering, and all the procedure
related parts will be reduced.
Some commercial products already include accurate hover
sensing technique, such as Wacom Intuos with EMR-based
sensor and Leap Motion with vision-based sensor; both are
able to run at 200Hz, with sub-millimeter accuracy.
Moreover, the model predicts tapping location when the
finger is 2.87cm and 3.22cm away from the screen in our
studies; these results are within capabilities of EMR [12] and
vision. Additionally, a number of plausible technologies for
achieving hover sensing appeared recently in HCI research.
HACHIStack [17] has a sensing height of 1.05cm above a
screen with 31µs latency. Retrodepth [21] can track hand
motion in a large 3D physical input space of 30x30x30cm.
Therefore, we believe an accurate, low-latency hover sensing
is on its way soon. We also envision that, when faster touch
sensor and CPU finally bring the nearly zero tapping latency,
this model will remain useful for achieving negative latency,
impossible even for a zero-latency touch sensor.
Recognizing unintended input
Another possible application of our prediction model is the
reduction of accidental input by masking unintended areas.
Based on our data analysis, the lift-off itself affords a
coarse prediction of target direction, as the majority of
touches we recorded were roughly planar. In addition, as
the prediction target is updated, the potential area for
touchdown will shrink. Therefore, the input system can
label the touch events in the areas where touchdown is not
likely as accidental events and ignore them.
In this paper, we built a prediction model and evaluate long
ballistic pointing tasks. However, in realistic tasks, the
finger motion will be much more complex, with pauses,
hesitation, and short tracking distances. To make the model
robust to these changes, we propose the fine-tuning of two
variables that determine when the system starts predicting:
the vertical distance, tuned at 4cm (in Z) to avoid direction
changes normal to touch approaches, and approach angle
tuned at 23° (for our system) to confirm that the finger
entered a drop down phase. With this tuning, the model
predicts location and time in the last 29% of the entire
trajectory. Other kinematic features, such as the
approaching velocity and direction can also be integrated
into the model to make it more robust. Still, there is no
doubt that the model would benefit from evaluation with
real tasks, and we encourage the effort to make the model
work perfectly in the real world.
DISCUSSION
Our results indicate that solving the problem of latency has
clear implications about how users perceive system
performance. If the predicted touchdown point is not
accurate users can detect the difference, not always
favorably, especially when presented with negative
latency. On the other hand, it seems that if we are capable
of eliminating perceived latency, with time, users will
adapt and expect an immediate response out of their
interactive systems.
CONCLUSION
Our prediction model is not constrained to only solving
latency. The approach is rich in motion data and can be
used to enrich many UIs. For example, the velocity of a
finger can be mapped to pressure, or the approach direction
can be mapped to different gestures. Equally important,
perhaps, is the possibility to predict when a finger is leaving
a display but not landing again inside the interaction
surface, effectively indicating that the user is stopping
interaction. This can be useful, for example, to remove UI
elements from a video application when the user is leaving
the interaction region.
We present a prediction model for direction, location, and
contact time of a tapping action on touch devices. With this
model, the feedback is shown to the user at the moment
they touch the display, eliminating the touchdown latency.
Results from the user study reveal a strong preference for
unperceived latency feedback. Also, predicting the touch
input long before the actual touch brings the opportunity to
reduce not only the visual latency but also latency of
various parts of a system that are involved in the response
to the predicted touch input.
213
Modeling and Prediction
UIST’14, October 5–8, 2014, Honolulu, HI, USA
ACKNOWLEDGEMENTS
19. Hinckley, K. et al.1998. Interaction and modeling
techniques for desktop two-handed input. UIST '98, 49-58.
20. Jota R. et al. 2013. How fast is fast enough? A study of the
effects of latency in direct-touch pointing tasks. CHI '13.
21. Kim D. et al. 2014. RetroDepth: 3D silhouette sensing for
high-precision input on and above physical surfaces. CHI
'13, 1377-1386.
22. Leigh, D. et al. 2014. High-rate, low-Latency multi-touch
sensing with simultaneous orthogonal multiplexing. UIST
‘14
23. MacKenzie, I. S.. 1995. Movement time prediction in
human-computer interfaces. Readings in HCI, 483-493.
24. Marquardt N. et al. 2011. The continuous interaction space:
interaction techniques unifying touch and gesture on and
above a digital surface. INTERACT '11, 461-476.
25. Miller, R. B.. 1968. Response time in man-computer
conversational transactions. Joint Comp. 1968, 267-277.
26. Murata, A.. 1998. Improvement of pointing time by
predicting targets in pointing with a PC mouse. IJHCI '98.
10, 1, 23–32.
27. Ng, A., Lepinski, J., Wigdor, D., Sanders, S., and Dietz, P..
2012. Designing for Low-Latency Direct-Touch Input.
UIST '12, 453-464.
28. Spindler, M., Martsch, M., and Dachselt, R.. 2012 Going
Beyond the Surface: Studying Multi-Layer Interaction
Above the Tabletop. CHI '12, 1277-1286.
29. Steed, A. 2008. A simple method for estimating the latency
of interactive, real-time graphics simulations. VRST
'08,123-129.
30. Subramanian, S., et al. 2006. Multi-layer interaction for
digital tables. UIST '06, 269-272.
31. Theremin, L.. 1928. Method of and apparatus for the
generation of sounds. U.S. Patent No. 1,661,058..
32. Uno, Y., Mitsuo, K., and Rika S., 1989. Formation and
control of optimal trajectory in human multijoint arm
movement. Biological Cybernetics, 61, 89-101.
33. Wigdor, D., and Wixon, D.. Brave NUI World: Designing
Natural User Interfaces for Touch and Gesture, 1 ed.
Morgan Kaufmann, Apr. 2011.
34. Wilson A.. 2010. Using a depth camera as a touch sensor.
ITS '10, 69-72.
35. Wilson, A.. 2004. TouchLight: An imaging touch screen
and display for gesture-based interaction. ICMI '04,69-76.
36. Wobbrock, J., Cutrell, E., Harada, S., and MacKenzie, S..
2008. An error model for pointing based on Fitts' law. CHI
'08, 1613-1622.
37. Yang, X., Grossman, T., Irani, P., and Fitzmaurice,G.2011.
TouchCuts and TouchZoom: Enhanced Target Selection
for Touch Displays using Finger Proximity Sensing. CHI
'11, 2585-2594.
We would like to thank members of DGP and Tactual Labs
for their support of the project. We also thank Kate Dowd
for her writing assistance.
REFERENCES
1. http://news.softpedia.com/news/The-Average-Web-PageLoads-in-2-45-Seconds-Google-Reveals-265446.shtml
2. http://www.samsung.com/global/microsite/galaxys5/
Samsung Galaxy S5
3. Annett, M., et al. 2011. Medusa: a proximity-aware multitouch tabletop. UIST '11, 337-346.
4. Baudisch, P., et al. 2003. Drag-and-Pop and Drag-andPick: techniques for accessing remote screen content on
touch- and pen-operated systems. INTERACT '03, 57-64.
5. Buxton, W. 1990. A Three-State Model of Graphical Input.
INTERACT '11, 449-456.
6. Dietz, P., and Leigh, D.. 2001. DiamondTouch: a multiuser touch technology. UIST '01, 219-226.
7. Fitts M.. 1954. The information capacity of the human
motor system in controlling the amplitude of movement. J.
of Experimental Psychology, Vol 47, N6,. 381–391.
8. Fitzmaurice, G., Ishii, H., and Buxton, W. 1995.. Bricks:
laying the foundations for graspable user interfaces.
CHI’95, 442-449.
9. Fitzmaurice, G., Khan, A., Pieké, R., Buxton, B.,
Kurtenbach, G..2003. Tracking Menus. UIST '03, 71-79.
10. Flash T., and Hogan N.. 1985. The Coordination of Arm
Movements: An Experimentally Confirmed Mathematical
Model. Journal of Neurosciences. Vol 5. N7, 1688-1703.
11. Forlines, C. and Balakrishnan, R.. 2008. Evaluating tactile
feedback and direct vs. indirect stylus input in pointing and
crossing selection tasks. CHI '08. 1563-1572.
12. Funahashi, T., Toshiaki S., and Tsuguya Y.. 1989. Position
detecting apparatus. U.S. Patent No. 4,878,553.
13. Galloway, J. & Koshland, G.. 2002. General coordination
of shoulder, elbow and wrist dynamics during multijoint
arm movements. Exp. Brain Rsch, Vol 142,163-180.
14. Grosse-Puppendahl, T. et al. 2013. Swiss-cheese extended:
an object recognition method for ubiquitous interfaces
based on capacitive proximity sensing. CHI '13.
15. Grossman, T., Hinckley, K., Baudisch, P., Agrawala, N.,
and Balakrishnan, R.. 2006. Hover widgets: using the
tracking state to extend the capabilities of pen-operated
devices. CHI '06, 861-870.
16. Grossman, T., Wigdor, D., and Balakrishnan, R. 2004
Multi finger gestural interaction with 3D volumetric
displays. UIST '04, 61-70.
17. Hachisu, T. and Kajimoto, H.. 2013. HACHIStack: duallayer photo touch sensing for haptic and auditory tapping
interaction. CHI '13, 1411-1420.
18. Hilliges O. et al. 2009. Interactions in the air: adding
further depth to interactive tabletops. UIST '09, 139-148.
214
Download