Match-Up & Conquer: A Two-Step Technique for Recognizing Unconstrained Bimanual and Multi-Finger Touch Input Yosra Rekik Radu-Daniel Vatavu Laurent Grisoni Inria Lille, University Lille 1, LIFL, University Stefan cel Mare of Suceava University Lille 1, LIFL, Inria Lille yosra.rekik@inria.fr vatavu@eed.usv.ro ABSTRACT We present a simple, two-step technique for recognizing multitouch gesture input independently of how users articulate gestures, i.e., using one or two hands, one or multiple fingers, synchronous or asynchronous stroke input. To this end, and for the first time in the gesture literature, we introduce a preprocessing step specifically for multi-touch gestures (Match-Up) that clusters together similar strokes produced by different fingers, before running a gesture recognizer (Conquer). We report gains in recognition accuracy of up to 10% leveraged by our new preprocessing step, which manages to construct a more adequate representation for multi-touch gestures in terms of key strokes. It is our hope that the Match-Up technique will add to the practitioners’ toolkit of gesture preprocessing techniques, as a first step toward filling today’s lack of algorithmic knowledge to process multi-touch input and leading toward the design of more efficient and accurate recognizers for touch surfaces. Categories and Subject Descriptors H.5.2. [Information Interfaces and Presentation]: Information Interfaces and Presentation; I.5.2. [Pattern Recognition]: Design Methodology - Classifier design and evaluation General Terms Algorithms; Human Factors; Design; Experimentation Keywords laurent.grisoni@lifl.fr toward the success and adoption of this technology. However, the versatility of multi-touch input makes prototyping multi-touch gesture recognizers a difficult task that requires dedicated effort because, in many cases, “the programming of these [multi-touch] gestures remains an art” [16] (p. 1). Up-to-date, there are no simple techniques in the style of the $-family [3,27,31] to be employed by designers for recognizing multi-touch gestures under the large variety of users’ articulation behaviors [19] (see Figure 1 for some examples). Consequently, designers have recurred to multi-touch declarative formalisms [13] that require dealing with sometimes complex regular expressions; to using toolkits that may be limited to specific platforms only [16]; and to adapting single-stroke recognizers for multi-touch [11], in which case the expressibility of multi-touch input might be affected. Such a lack of simple techniques for multi-touch gesture recognition has many causes, such as the complexity of multi-touch input with many degrees of freedom, our today’s limited understanding of how multi-touch gestures are articulated, and, we believe, a lack of algorithmic knowledge to preprocess multi-touch input by considering its specific characteristics. For example, there are many useful techniques available to preprocess raw gestures, such as scale normalization, motion resampling, translation to origin, and rotation to indicative angle that make stroke gesture recognizers invariant to translation, scale, and rotation (see the pseudocode available in [3,4,15,27,31]). Although these techniques are general enough and may be applied to multi-touch gestures as well, they cannot address the specific variability that occurs during multitouch articulation, such as the use of different number of fingers, 1 Multi-touch gestures; structuring fingers movement; gesture recognition; $P recognizer; unconstrained multi-touch input. 1 2 1 1. 2 INTRODUCTION Multi-touch interfaces are increasingly popular, with applications from personal devices [22] to interactive displays installed in public settings [10] and, lately, applications that span between these two interactive spaces [14]. The broad range of interaction styles offered by multi-touch, as well as the large number of degrees of freedom available to users to articulate gestures have contributed 3 (a) 1 (b) 1 1 (e) 1 (f) (d) 1 1 2 2 3 3 (g) 3 4 (c) 2 1 2 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. 1 1 2 (h) Figure 1: Various multi-touch articulation patterns for the “square” symbol produced with different number of strokes (b-h), fingers (b, d-h), sequential (c, d), and parallel movements (e-h). N OTE : Numbers on strokes indicate stroke ordering. The same number on top of multiple strokes indicates that all the strokes were produced at the same time by different fingers. strokes, and bimanual input [19], leaving multi-touch gesture recognizers non-invariant to these aspects of articulation. To overcome this lack of algorithmic knowledge to assist in recognizing multi-touch input, we propose in this work a new preprocessing step that is specific to multi-touch gesture articulation. We demonstrate the usefulness of our new preprocessing technique for recognizing multi-touch gestures independently of how they were articulated. To deal with the complex problem of handling users’ variations in articulating multi-touch input [2,19], we inspire from the “Divide & Conquer” paradigm from algorithm design [7] (p. 28), in which a complex problem is broke down successively into two or more subproblems that are expected to be easier to solve. We group together strokes produced by different fingers with similar articulation patterns and identify key strokes as a new representation for the multi-touch gesture (the “Match-Up” step) that we found to improve the accuracy of subsequent recognition procedures (the “Conquer” step). Match-Up & Conquer (M&C) is thus a two-step technique able to recognize multi-touch gestures independently of how they are articulated, i.e., using one or two hands, one or multiple fingers, and synchronous and asynchronous stroke input. The contributions of this work include: (1) a new preprocessing step, Match-Up, specific to multi-touch gestures (a first in the gesture literature) that structures finger movements consistently into clusters of similar strokes, which we add to the practitioners’ toolkit of gesture processing techniques, next to scale normalization, resampling, and rotation to indicative angle [3,4,15,27,31]; (2) an application of Match-Up to recognize multi-touch input under unconstrained articulation (Match-Up & Conquer), for which we show an improvement in recognition accuracy of up to 10% over an existing technique; and (3) pseudocode for assisting practitioners in implementing Match-Up into their gesture interface prototypes. 2. RELATED WORK We review previous works on gesture articulation and relate to existing gesture recognition techniques. We also present recognition results for the $P gesture recognizer [27] applied to multitouch gestures and point to reasons explaining its performance. 2.1 Variability of gesture articulation Supporting the users’ wide range of articulating multi-touch gestures has been previously noted as an important design criterion for delivering increased flexibility and high-quality pleasurable interaction experiences [10]. This fact has led to a number of orthogonal design implications [9,18,30]. For example, principled approaches attempt to provide basic building blocks from which gesture commands can be derived, such as gesture relaxation and reuse [32], fluidity of interaction techniques [20], and cooperative gesture work [17]. Other researchers have advocated for assisting users in the process of learning multi-touch gestures during actual interaction and application usage by proposing gesture visualizations [9], dynamic guides [6], and multi-touch menus [5,12]. At the same time, other user-centric approaches advocate for enrolling users since the early stages of gesture set design [14,22,30]. Such participatory studies have led to interesting findings on users’ behaviors in articulating gestures as well as on users’ conceptual models of the interaction. These works also recommend flexible design of gesture commands to accommodate variations in how users articulate gestures. Oh et al. [18] go further and describe an investigation of the feasibility of user-customizable gesture commands, with findings showing users focusing on familiar gestures and being influenced by misconceptions about the performance of gesture recognizers. To deal with users’ variations in multi-touch input, Rekik et al. [19] introduced a taxonomy harnessing symbolic multi-touch gestural variations at mental, physical, and movement levels, as a result of a user-centric study. 2.2 Gesture recognition techniques For symbolic gestures, the popular $-family of gesture recognizers [3,4,27,31] deliver robust recognition accuracy with a set of simple techniques, in contrast with more complex recognizers, such as Hidden Markov Models [23] and statistical classifiers [21]. However, the $-family was mainly designed and validated for singletouch gestures. Jiang et al. [11] extended $1 to recognize multifinger gestures by aggregating the touch paths into a single stroke. However, the reduction developed in [11] to transform all touch points in a single stroke is incompatible with the fact that multiple strokes can interleave in time, such as drawing a square with two symmetric movements in parallel (Figure 1e). Several tools were proposed to help designers create gestures easier, such as specifying multi-touch as regular expressions [13] or supporting gesture programming by demonstration [16]. However, despite skillful design, these tools do not deal well with input variability. For example, the number of fingers and strokes as well as their space-time combination need to be predefined by the designer. The only possible variation allowed to users is to move fingers along the same direction in an infinite loop. “Concepture” [8] is such a framework based on regular language grammars for authoring and recognition of sketched gestures with infinitely varying and repetitive patterns. 2.3 Multi-touch gestures and $P We employ in this work the $P recognizer [27] because previous works found it accurate for classifying multi-stroke gesture input under various conditions [1,27]. $P employs point clouds to represent gestures and, therefore, it manages to ignore users’ articulation variations in terms of number of strokes, stroke types, and stroke ordering. $P has been validated so far on gestures articulated with single-touch strokes, which is typically the case for pen and single-finger input [1,27]. However, multi-touch gestures exhibit considerably more degrees of freedom, with users articulating gestures with one or both hands, variable number of fingers, and following synchronous and asynchronous gesture production mechanisms. For example, users frequently employ multiple fingers to produce gestures, and even use different fingers to simultaneously articulate strokes with different shapes [19]. Figure 1 provides an illustration of various articulation patterns captured from participants in our study when asked to produce a square. For such articulations, extracting the key strokes (e.g., the four strokes that make up the square) is a challenging task, and techniques like those discussed in [11] for reusing the $1 recognizer [31] on multi-touch gestures are not applicable when fingers move in parallel (e.g., each finger drawing half the square as in Figure 1e). As $P does not depend on the notion of stroke [27], a potential approach to recognize multi-touch gestures would be to directly apply the matching technique of $P on the point clouds resulted from sampling the multi-touch input. However, we argue that such an approach has limitations endorsed by the very nature of multi-touch gesture articulation. For example, Figure 2 shows a specific situation in which two distinct gesture types, “spiral” and “circle”, have similar point cloud representations because of additional points introduced in the cloud when employing multiple fingers. This limitation was confirmed by running the $P recognizer on a set of 22 multi-touch gesture types, for which the recognition accuracy, between 82% and 98%, was lower than in previous works that evaluated $P on similar, but single-touch gestures [27] (we provide complete details on the evaluation procedure under the Results section). tamp, p = (xp , yp , tp ) ∈ R3 . A multi-touch gesture isrepresented by a set of points, P = pi = xip , ypi , tip |i = 1..n . The displacement vector of point p between two consecutive timestamps i i−1 ~pi = xip − xi−1 and the angle ti−1 <ti is defined as D p , y p − yp ~pi and D ~qi of points p and q is given by: between vectors D ! ~pi · D ~qi D θp,q = arccos (1) ~pi k kD ~qi k kD (a) (b) (c) (d) Figure 2: Adopting a direct point cloud representation for gestures produced with multiple fingers can lead to situations in which different classes of gestures have similar point clouds, such as the “spiral” (a) and the two-finger “circle” (c), which was not a problem before for single-touch input (b). More fingers lead to point clouds that are even more problematic to discriminate (d vs. a). The accuracy problem could be partially handled by sampling more points from the multi-touch input that would provide more resolution for representing gestures and, potentially, more opportunity for $P to discriminate between different gesture types. However, previous works have shown that the performance of many gesture metrics1 does not necessarily improve when more sampling resolution is available [25,27]. Also, increasing the size of the point cloud increases the time required to compute the $P cost function, which depends quadratically with the sampling rate O(n2.5 ) [27] (p. 278). Consequently, such an approach may prove problematic for low-resource devices that need particular attention in terms of dimensionality representation in order to be able to sense and recognize gestures with real-time performance [26]. 3. MATCH-UP: PREPROCESSING FOR MULTI-TOUCH INPUT The first step of the M&C technique consists in running a clustering procedure to group touch points that belong to finger movements that are similar in direction and path shape. Two finger movements are considered similar if they are produced simultaneously and are relatively “close” to each other. The goal of this procedure is to identify key strokes, which are uniquely identifiable movements in the multi-touch gesture (see Figure 3). A key stroke may be composed of one stroke only (i.e., only one finger touches the surface), or it may consist in multiple strokes when multiple fingers touch the surface simultaneously and they all move in the same direction and follow the same path shape. We consider two points p and q as part of two similar finger movements (p≈q) if their displacement vectors are approximately collinear and p and q are sufficiently close together: p ≈ q ⇔ θp,q ≤ εθ and kp − qk ≤ εd where kp − qk represents the Euclidean distance between points p and q, and εθ and εd are two thresholds. With these considerations, we describe our clustering procedure that groups together touch points of similar strokes at each timestamp of the articulation timeline. The procedure is based on the agglomerative hierarchical classification algorithm [29] (p. 363): 1. Construct clusters for each touch point p available at timestamp t, Cj = {p ∈ P | tp = t}. Initially, all |Cj | = 1. If a touch is detected for the first time, delay its cluster assignment until next timestamp. 2. For each pair of clusters (Cj , Ck ), compute their minimum angle θj,k =min{θp,q | p ∈ Cj , q ∈ Ck } and minimum distance δj,k =min{kp − qk | p ∈ Cj , q ∈ Ck }. 3. Find the pair of clusters (Cj , Ck ) for which θj,k and δj,k satisfy equation 2 and θj,k is minimized. 4. If no such pair exists, stop. Otherwise, merge Cj and Ck . 5. If there is only one cluster left, stop. Otherwise, go to 2. The result of the clustering process clearly depends on εθ and εd , for which we provide a detailed analysis later in the paper. Once clusters are computed for each timestamp ti , they are analyzed to derive the key strokes of the multi-touch gesture. To consistently group clusters into key strokes, we track cluster evolution over time and assign clusters of time ti with the same stroke identifiers used for the previous timestamp ti−1 <ti . If no clusters exist at ti−1 , all the clusters are assigned new stroke identifiers, which corresponds to the case in which users touch the surface for the first time or when they momentarily release the finger from the surface. Otherwise, we examine the touch points of every cluster and compare their point structure between times ti−1 and ti . Each cluster Ci at moment ti will take the identifier of a previous cluster Ci−1 at time ti−1 <ti if the following conditions are met: (i) there exists a subset of points from Ci that also appears in Ci−1 , i.e., Ci ∩ Ci−1 6= ∅, (ii) all the other points of Ci appeared for the first time at moment t, and (iii) all the other points from Ci−1 were released from the surface. Otherwise, a new stroke identifier is assigned to Ci . The result of this process is a set of clus ters Cik | k = 1..K that reflect the key strokes of the multi-touch gesture (e.g., K=2 for the example in Figure 3). 4. Figure 3: A multi-touch gesture articulated with two hands and six fingers (left) that has two key strokes (right). We describe a touch point p by its 2-D coordinates and times1 As $P was not covered by these works [25,26], we cannot estimate its behavior with increased sampling resolution. However, the peaking phenomenon was often observed in the pattern recognition community, according to which adding more features up from one point does not improve, but actually increases classification error [24]. (2) CONQUER: RECOGNIZING MULTITOUCH GESTURE INPUT By computing clusters of points that belong to similar strokes, we extract a representative set of points for each key stroke. To do that, we use the trail of centroids across all timestamps t for clusters that were assigned the same stroke identifier k ∈ {1..K}: 1 X p (3) ck,t = |Ck | p∈C k 1 1 2 1 2 3 3 1 2 (a) (b) (c) Figure 4: Gesture resampling with the Match-Up step (b) compared to direct resampling (c) for several symbols in our set (a). with key stroke k represented by the set {ck,t }. Key strokes are normalized and resampled with standard gesture preprocessing procedures [3,27,31]. The resulted key stroke representation is then fed into a gesture recognizer, which is the $P recognizer [27] in this work. Figure 4 shows the result of the point resampling procedure before and after running the Match-Up technique for three multitouch gestures from our set: “circle”, “N”, and “asterisk”. 5. EVALUATION We conducted an experiment to collect multi-touch gestures in unconstrained conditions in terms of number of strokes, fingers, and single-handed and bimanual input. Our goal was to collect as many variations as possible for articulating multi-touch gestures in order to test the effectiveness of the Match-Up technique to make recognizers invariant to such aspects of multi-touch articulation. 5.1 Participants Sixteen participants (5 females) took part in the experiment (mean age 27.5 years, SD=4.1 years, all right-handed). Half of participants were regular users of smart phones and tablet devices, and two were regular users of an interactive tabletop. 5.2 Apparatus Gestures were collected on a 32 inch (81.3 cm) multi-touch display, 3MTM C3266PW, supporting up to 40 simultaneous touches. The display was connected to a computer running Windows and our custom data collection software application. The interface of the experiment application showed a gesture creation area covering approximately the entire screen and the name of the gesture symbol to articulate displayed at the top of the screen. Three buttons were available to control each trial: Start, Save and Next. Before pressing Start, participants could experiment multi-touch input with their fingers in the gesture creation area. Once Start was pressed, the Save button was activated enabling participants to save their most recent gesture. The application logged touch coordinates with associated timestamps and identification numbers. Once a gesture was saved, participants could proceed to the next trial in the experiment. 5.3 selected to be general enough so that participants could reproduce them without a visual representation and thus encourage unconstrained articulation behavior. The application asked participants to produce gestures with trials presented in a random order. Only the gesture name was presented to participants. For each symbol, participants were asked to create as many different articulation variations as they were able to, given the requirement that executions are realistic for practical scenarios, i.e., easy to produce and reproduce later. We asked five repetitions for each proposed variation of each gesture type in order to dispose of a sufficient amount of training samples to assess the recognition accuracy of M&C. To reduce bias [18], there was no recognition feedback. Also, to prevent any visual content from influencing participants into how they articulate gestures, no visual feedback was provided other than showing light red circles under each finger to acknowledge surface contact. 6. RESULTS To validate the effectiveness of Match-Up, we conducted a recognition experiment in which we compared the M&C technique employing the $P recognizer in the second step (Conquer) with the unmodified $P [27]. We chose $P as previous works found it accurate for classifying multi-stroke gesture input under various conditions [1,27]. In this work we follow the same experiment methodology as in [27] (p. 275) by running both user-dependent and userindependent tests on our dataset of 22 distinct gesture types and 5,155 total samples. As the results of Match-Up depend on the values of εθ and εd (see equation 2), we conducted a preliminary study to compute the optimal values of these parameters, which we found εθ =30◦ and εd =12.5%. Later in the paper we discuss in detail the impact of these two parameters on the performance of M&C. 6.1 User-dependent training Recognition rates were computed individually for each participant with the same methodology as in [27] (p. 275). We report results from ≈2.5×106 recognition tests by controlling the following factors: (1) the recognition condition, M&C with $P versus unmodified $P, (2) the number of training samples per gesture type, T =1..9, and (3) the size of the cloud, n = 8, 16, 32, and 64 points. Results showed an average recognition accuracy of 90.7% for M&C, significantly larger than that delivered by the $P recognizer running without the Match-Up step, which was 86.1% (Z=5.23, p<.001). For n=32 points (recommended in [27]), M&C delivered 95.5% accuracy with 4 training samples per gesture type and reached 98.4% with 9 samples, while $P delivered 93.5% and 97.4% under the same conditions. Figure 6 illustrates the influence of both number of training samples T per gesture type and the size of the cloud n on recognition accuracy. M&C was significantly more accurate than $P for all T ∈ {1...9} and n ∈ {8, 16, 32, 64} (p < .05), except for T =9 and n=64. The influence of the Match-Up step was larger for smaller point clouds (9.4% for n=8) and smaller for fine-grained representations of the point clouds (0.9% for n=64). To investigate further this phenomenon, we divided gestures into four categories according Procedure The experiment application asked participants to enter gestures from a dataset containing 22 different symbols (see Figure 5): letters, geometric shapes (triangle, square, horizontal line, circle), symbols (five-point star, spiral, heart, zig-zag), and algebra (stepdown, asterisk, null, infinite). The gesture set is based on those found in other interactive systems [3,19,28,31]. Gestures were also Triangle Square Heart Asterisk Horizontale Line Zig-Zag Circle Infinite Five-Point-Star Step-Down Spiral Null Figure 5: The set of 22 gestures used in our experiment. 2 3 4 5 6 7 8 50 9 1 2 Number of training samples 5 6 7 8 80 70 n = 32 60 1 2 3 4 5 6 7 8 Number of training samples Single touch sequential 60 50 9 1 9 90 80 70 n = 64 60 50 2 3 1 2 3 4 5 6 7 8 90 80 70 50 4 Single touch parallel 60 1 Number of training samples 80 70 Multi touch sequential 60 1 Number of training samples Figure 6: Recognition rates for M&C versus unmodified $P. to (1) the number of fingers employed by participants (i.e., one or multiple) and (2) the relative order of stroke articulation (sequential or parallel), as per the taxonomy of Rekik et al. [19]. We took this approach as we hypothesized large number of points resulting from multi-finger articulations cause $P to deliver decreased performance just because point clouds are not representative enough (see the “circle” and “spiral” in Figure 2). We identified four categories: SS (Single-touch Sequential), SP (Single-touch Parallel), MS (Multi-touch Sequential), and MP (Multi-touch Parallel). Sequential means that strokes are produced in a row. Parallel means that some strokes (typically two) are drawn by different fingers at the same time. Recognition rates were computed for each gesture category under two conditions: 1. Gesture-category-dependent testing: training samples for each gesture type were selected independently of category, while testing samples from one category at a time, e.g., test with single and multi-finger gestures separately. 2. Gesture-category-independent testing: training samples for each gesture type were selected from one category at a time, while testing samples were selected independently of the category, e.g., we train with single-finger articulations, but test with both single and multi-finger articulations. For these tests we report recognition rates for T in {1, 2, 3, 4} because not all participants produced gestures in all categories for all symbols. However, participants did produce at least five samples for each variation they proposed. Also, some symbols were omitted if there were not enough samples to support this requirement (e.g., gestures from the parallel categories were rarely produced for symbols not symmetrical in shape, such as “spiral”, “S”, “P”, etc.). Figures 7 and 8 show the recognition performance of M&C and $P under both conditions. For category-dependent testing, results showed M&C more accurate than $P for cases in which participants articulated gestures with multiple fingers: 93.4% vs. 89.2% for the multi-touch sequential category (Z=6.23, p<.001), and 89.9% vs. 84.9% for multi-touch parallel (Z=5.35, p<.001). Results were not significant any longer for single-finger articulations: 92.7% vs. 92.9% for single-finger sequential and 93.1% vs. 92.6% for singlefinger parallel, n.s. For the gesture-category-independent scenario, results showed M&C more accurate than the unmodified $P for three out of four categories: 87.4% vs. 84.5% for SS (Z=5.98, p<.001), 87.1% vs. 86.5% for SP (n.s), 87.8% vs. 82.5% for MS (Z=6.07, p<.001), and 85.6% vs. 81.1% (Z=2.94, p<.005). As we found that M&C scored higher for participants that articulated gestures with more fingers, we investigated this finding further. Figure 9 illustrates the percentage of gesture categories per participant, as well as the corresponding recognition rates com- 3 4 100 90 50 9 2 Number of training samples 100 Recognition rate [%] Recognition rate [%] Recognition rate [%] 4 100 90 50 3 70 Number of training samples 100 2 3 90 80 70 50 4 Multi touch parallel 60 1 Number of training samples 2 3 4 Number of training samples Figure 7: Recognition rates for gesture-category-dependent testing. N OTE: Rates are averaged for all gesture types and participants; n = 32 points; error bars show 95% CI. 100 Recognition rate [%] 1 n = 16 60 100 90 80 70 Single touch sequential 60 50 1 2 3 90 80 70 50 4 Single touch parallel 60 1 Number of training samples 3 4 100 90 80 70 Multi touch sequential 60 50 2 Number of training samples 100 Recognition rate [%] 50 70 80 Recognition rate [%] n=8 60 80 100 90 Recognition rate [%] 70 90 Recognition rate [%] 80 100 Recognition rate [%] 90 Recognition rate [%] 100 Recognition rate [%] Recognition rate [%] 100 1 2 3 Number of training samples 4 90 80 70 Multi touch parallel 60 50 1 2 3 4 Number of training samples Figure 8: Recognition rates for gesture-category-independent testing. N OTE: Rates are averaged for all gesture types and participants, n=32 points, and error bars show 95% CI. puted for T =4 and n=32. The figure reveals the fact that recognition rates for M&C are higher than those delivered by the unmodified $P for those participants who employed more fingers during gesture articulation (i.e., higher percentage for the MS and MP categories). A Spearman test confirmed this observation, showing significant correlation between the difference in recognition rates of M&C and $P (for all T ) and the percentage of the MS and MP categories employed by participants (ρ=.630, p<.001). These results confirm our hypothesis that multi-finger gestures (with more points from more fingers) significantly influence the point cloud resampling of $P, which affects recognition accuracy. However, Match-Up manages to group similar strokes together, and constructs a more representative cloud for multi-touch gestures. 6.2 User-independent training In this scenario, we computed recognition rates for each gesture type using data from different participants as per the methodology in [27] (p. 275). We report results from ≈2.6×107 recognition tests by controlling the following factors: (1) the recognition condition, M&C with $P versus unmodified $P, (2) number of training participants, P =1..15, and (3) number of training samples per gesture type, T =1..4. The size of the cloud was n=32 points. Results show M&C significantly outperforming $P without the Match-Up step, with 88.9% versus 82.5% (Z=3.400, p<.001). For the maximum number of tested participants (P =15) and training samples (T =4), M&C achieved 94.2%, significantly higher than 90.8% delivered by $P (Z=8.14, p<.001). The number of 100 60 40 9: Distribution of gesture categories (left) and recognition rates per participant (right). Note how the recognition rates for M&C are higher for participants employing multiple fingers. N OTE: An asterisk ∗ next to participant number shows the difference is significant at p<.05; error bars show 95% CI. 90 80 70 60 20 0 P1 P2 P3 P4 SS P5 P6 P7 P8 MS P1* P2* P3 P4 P5* P6 P7* P8* P9* P10*P11* P12 P13 P14 P15 P16* MP M&C $P 100 Recognition rate [%] Recognition rate [%] 100 90 80 70 60 50 50 P9 P10 P11 P12 P13 P14 P15 P16 SP 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of training participants 90 80 Distance εd [%] % of total 80 70 60 50 1 2 3 4 Number of training samples 7. 7.1 DISCUSSION Effect of Match-Up parameters We mentioned before that the key stroke extraction results of the Match-Up technique depend on the values of the εθ and εd parameters of equation 2. The recognition experiment reported in the previous section employed optimal values for these parameters that we found to maximize recognition accuracy for our set of gestures, i.e., εθ =30◦ and εd =12.5%. In this section we present an in-depth analysis of the effect of these parameters on the recognition performance of M&C, for which we ran the user-dependent recognition experiment and controlled the values of εθ and εd (T =4 training samples per gesture type and n=32 points were used this time to keep the running time manageable). We varied the values of the angle εθ in the range [0, 90] by a step of 5◦ . For εd , we normalized the Euclidean distance to the input area and varied εd in the range [0, 50] by 2.5%. Overall, we report results for the user-dependent recognition accuracy of M&C under 19 × 21 = 399 different combinations for εθ and εd (see Figure 11). As expected, accuracy is low when the angle parameter εθ is small (≤10◦ ), which makes nearby to be incorrectly assigned to distinct clusters. As εθ increases, points are clustered correctly, even though they do not follow a perfectly collinear stroke path. For εθ ≥ 20◦ , recognition accuracy is impacted by the distance parameter, εd . For example, when εd is low (≤5%), points are incorrectly considered as individual clusters. As εd increases, points that are both collinear and close together are more likely to be clustered correctly. For εd ≥15%, recognition rates start to decrease significantly (p<.001). This result is explained by the fact that distant points that move in approximately the same direction can be 0.96 45 0.95 40 0.94 35 0.93 30 0.92 25 0.91 20 0.9 15 Figure 10: Recognition rates for the user-independent scenario. N OTE : n = 32 points; error bars show 95% CI. 10 0.89 5 0.88 0 0.87 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 Angle εΘ [degree] Recognition rate [%] participants had a significant effect on accuracy (χ2 (14)=55.82, p<.001), with M&C delivering higher recognition rates than $P (Figure 10, left). As expected, both the performance of M&C and $P improved as the number of training participants increased, from 71.9% and 65.5% with one participant to 93.1% and 89.1% respectively with 15 training participants. The number of training samples had a significant effect on recognition accuracy (χ2 (3)=45, p<.001), with both M&C and $P delivering higher rates with more samples: 85.0% and 78.3% with one sample to 90.8% and 85.2% with 4 samples per gesture type (Figure 10, right). 50 96 95 94 93 92 91 90 89 88 87 εd=0 εd=2.5 εd=5 εd=7.5 εd=10 εd=12.5 εd=15 εd=17.5 εd=20 εd=22.5 εd=25 εd=27.5 εd=30 εd=32.5 εd=35 εd=37.5 εd=40 εd=42.5 εd=45 εd=47.5 εd=50 0 10 20 30 40 50 60 Angle εΘ [%] 70 80 Recognition rate [%] Recognition rate [%] 100 90 96 95 94 93 92 91 90 89 88 87 εΘ=0 εΘ=5 εΘ=10 εΘ=15 εΘ=20 εΘ=25 εΘ=30 εΘ=35 εΘ=40 εΘ=45 εΘ=50 εΘ=55 εΘ=60 εΘ=65 εΘ=70 εΘ=75 εΘ=80 εΘ=85 εΘ=90 0 5 10 15 20 25 30 35 40 45 50 Distance εd[%] Figure 11: Effect of parameters εθ and εd on the recognition accuracy of M&C. Combined (top) and individual effects (bottom). incorrectly clustered together for large εd values. The highest recognition accuracy was obtained for εθ ≥30◦ and εd ∈[10, 15]. For εθ =30◦ and εd =12.5%, we observed that stroke extraction matched the structure of the articulated gesture. These values were therefore selected to evaluate the M&C technique in the previous section. Note however that the optimal values might need slight adjustment for other datasets. However, we estimate [10, 45] and [10, 15] as safe intervals for εθ and εd . 7.2 Confusable gestures We found gesture type to have a significant effect on recognition accuracy (χ2 (21)=113.81, p<.001), with some gestures exhibiting lower recognition rates. Remember that we did not constrain participants to mind the number of fingers, nor did participants follow any training procedure before entering gestures. As a result, our multi-touch dataset includes versatile gestures that expose the intrinsic variability of articulating multi-touch input, and, consequently, some of the articulations are challenging to recognize. For example, Figure 12 illustrates several ways in how participants articulated circles. Table 1 shows the top-5 most frequently misclassified gesture pairs by M&C. As the stroke structure of these gestures is similar, small deviations in their articulations likely leads to recognition conflicts. However, maximum error rates were 0.56% for confused pairs and 0.65% for individual gestures. 7.3 Execution time The execution time of M&C is composed of the time required to identify key strokes (Match-Up) and the time to run the recognizer (Conquer). In practice, the execution time of the Match-Up step averaged over all participants and gesture types from our dataset was Figure 12: Several articulations for the “circle” symbol captured in our dataset. Confused pairs Occurrence Circle × Square N×H Circle × Heart D × Circle X × Asterisk 0.56% 0.28% 0.28% 0.24% 0.23% Gestures Square Circle N D X Occurrence 0.65% 0.58% 0.32% 0.31% 0.28% Table 1: Top-5 most confusable pairs (left) and gestures (right). N OTE: user-dependent testing, T =4 samples, and n=32 points. R Xeon R CPU 2.67 GHz), which 22.8 ms (measured on a Intel makes the Match-Up technique suitable for real-time processing. 7.4 On-line computation of Match-Up According to the definition of the clustering procedure, the MatchUp technique may run on-line during actual articulation. In fact, the procedure only needs to keep track of points detected at the previous timestamp and their corresponding cluster IDs to compute identifiers for current points. The on-line computability feature of Match-Up has interesting implications. First, key strokes may be computed in real-time and, thus, compute the articulation type the user is performing (e.g., Match-Up can detect that two strokes are being performed at the same time when the user performs a gesture with two parallel movements). This goes beyond gesture recognition toward discriminating between gesture variation categories. Second, computing key strokes at runtime opens new interaction possibilities. For instance, Match-Up makes possible to deliver user feedback by displaying the key stroke movements made by fingers as they occur. Such feedback may serve to help users learn gestures in a flexible and consistent manner. 7.5 Any gesture recognizer In this work we employed the $P recognizer [27] in the second step (Conquer) of M&C and showed how Match-Up improves its classification accuracy for multi-finger gestures. We note however that any other recognizer may be used in conjunction with MatchUp. As a result, we deliver Match-Up as a new add-on to the practitioners’ toolkit of gesture processing techniques, leading toward new algorithmic knowledge for processing multi-touch gestures. 8. CONCLUSION We introduced in this work, for the first time in the gesture literature, a preprocessing technique specific for multi-touch gestures that clusters similar finger strokes. We used Match-Up in the M&C technique and showed improved recognition accuracy of multi-touch gestures under unconstrained articulation. Following the practice of previous works in the literature, we make our gesture dataset freely available to other practitioners, at http://anonymous. Finally, it is our hope that this work will advance our algorithmic knowledge for multi-touch gestures, leading to the design of more efficient and accurate recognizers for touch sensitive surfaces. 9. REFERENCES [1] L. Anthony, Q. Brown, B. Tate, J. Nias, R. Brewer, and G. Irwin. Designing smarter touch-based interfaces for educational contexts. Journal of Personal and Ubiquitous Computing, November 2013. [2] L. Anthony, R.-D. Vatavu, and J. O. Wobbrock. Understanding the consistency of users’ pen and finger stroke gesture articulation. proceedings of graphics interface. In Proc. of GI ’13, pages 87–94. [3] L. Anthony and J. O. Wobbrock. A lightweight multistroke recognizer for user interface prototypes. In Proc. of GI ’10, pages 245–252. Canadian Information Processing Society. [4] L. Anthony and J. O. Wobbrock. $N-Protractor: a fast and accurate multistroke recognizer. In Proc. of GI ’12, pages 117–120. [5] G. Bailly, J. MüLler, and E. Lecolinet. Design and evaluation of finger-count interaction: Combining multitouch gestures and menus. Int. J. Hum.-Comput. Stud., 70(10):673–689. [6] O. Bau and W. E. Mackay. Octopocus: a dynamic guide for learning gesture-based command sets. In Proc. of UIST ’08, pages 37–46. [7] T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms, 2nd Ed. MIT Press. [8] N. Donmez and K. Singh. Concepture: a regular language based framework for recognizing gestures with varying and repetitive patterns. In Proc. of SBIM ’12, pages 29–37. [9] D. Freeman, H. Benko, M. R. Morris, and D. Wigdor. Shadowguides: visualizations for in-situ learning of multi-touch and whole-hand gestures. In Proc. of ITS ’09, pages 165–172. ACM Press. [10] U. Hinrichs and S. Carpendale. Gestures in the wild: studying multi-touch gesture sequences on interactive tabletop exhibits. In Proc. of CHI ’11, pages 3023–3032. [11] Y. Jiang, F. Tian, X. Zhang, W. Liu, G. Dai, and H. Wang. Unistroke gestures on multi-touch interaction: supporting flexible touches with key stroke extraction. In Proc. of IUI ’12, pages 85–88. ACM Press. [12] K. Kin, B. Hartmann, and M. Agrawala. Two-handed marking menus for multitouch devices. ACM Trans. Comput.-Hum. Interact., 18(3):16:1–16:23. [13] K. Kin, B. Hartmann, T. DeRose, and M. Agrawala. Proton: multitouch gestures as regular expressions. In Proc. of CHI ’12, pages 2885–2894. ACM Press. [14] C. Kray, D. Nesbitt, J. Dawson, and M. Rohs. User-defined gestures for connecting mobile phones, public displays, and tabletops. In Proc. of MobileHCI ’10, pages 239–248. [15] Y. Li. Protractor: A fast and accurate gesture recognizer. In Proc. of CHI ’10, pages 2169–2172, New York, NY, USA, 2010. ACM Press. [16] H. Lü and Y. Li. Gesture coder: a tool for programming [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] multi-touch gestures by demonstration. In Proc. of CHI ’12, pages 2875–2884. M. R. Morris, A. Huang, A. Paepcke, and T. Winograd. Cooperative gestures: Multi-user gestural interactions for co-located groupware. In Proc. of CHI ’06, pages 1201–1210. ACM Press. U. Oh and L. Findlater. The challenges and potential of end-user gesture customization. In Proc. of CHI ’13, pages 1129–1138. Y. Rekik, L. Grisoni, and N. Roussel. Towards many gestures to one command: A user study for tabletops. In Proc. of INTERACT ’13. Springer-Verlag. M. Ringel, K. Ryall, C. Shen, C. Forlines, and F. Vernier. Release, relocate, reorient, resize: fluid techniques for document sharing on multi-user interactive tables. In CHI EA ’04, pages 1441–1444. D. Rubine. Specifying gestures by example. In Proc. of SIGGRAPH ’91, pages 329–337. ACM Press. J. Ruiz, Y. Li, and E. Lank. User-defined motion gestures for mobile interaction. In Proc. of CHI ’11, pages 197–206. ACM Press. T. M. Sezgin and R. Davis. Hmm-based efficient sketch recognition. In Proc. of IUI ’05, pages 281–283. ACM Press. C. Sima and E. Dougherty. The peaking phenomenon in the presence of feature-selection. Pattern Recognition Letters, 29:1667–1674. R.-D. Vatavu. The effect of sampling rate on the performance of template-based gesture recognizers. In Proc. of ICMI ’11, pages 271–278. ACM Press. R.-D. Vatavu. The impact of motion dimensionality and bit cardinality on the design of 3D gesture recognizers. Int. Journ. of Human-Computer Studies, 71(4):387–409. R.-D. Vatavu, L. Anthony, and J. O. Wobbrock. Gestures as point clouds: a $P recognizer for user interface prototypes. In Proc. of ICMI ’12, pages 273–280. ACM Press. R.-D. Vatavu, D. Vogel, G. Casiez, and L. Grisoni. Estimating the perceived difficulty of pen gestures. In Proc. of INTERACT’11, pages 89–106. Springer-Verlag. A. Webb. Statistical pattern recognition. John Wiley & Sons, Ltd. J. O. Wobbrock, M. R. Morris, and A. D. Wilson. User-defined gestures for surface computing. In Proc. of CHI ’09, pages 1083–1092. ACM Press. J. O. Wobbrock, A. D. Wilson, and Y. Li. Gestures without libraries, toolkits or training: a $1 recognizer for user interface prototypes. In Proc. of UIST ’07, pages 159–168. ACM Press. M. Wu, C. Shen, K. Ryall, C. Forlines, and R. Balakrishnan. Gesture registration, relaxation, and reuse for multi-point direct-touch surfaces. In Proc. of TABLETOP ’06, pages 185–192. M ATCH -U P (P OINTS P, C LUSTERS previous) 1 2 3 4 5 6 7 8 9 We provide pseudocode for the Match-Up technique that will run at every timestamp t during the entire duration of articulation of the multi-touch gesture and deliver its representation as key strokes. A key stroke is represented by a cluster of points resulted from grouping together points that belong to similar strokes. P OINT is a structure that defines a touch point with position coordinates (x, y), the two previous positions dp and ddp and identification id. P OINTS is a list of points. C LUSTER is a structure that contains a list of points and an id. C LUSTERS is a list of clusters. while |clusters| ≥ 2 do (A, B) ← M OST-S IMILAR -C LUSTERS(clusters) if (A, B) == null then break A.points ← A.points ∪ B.points R EMOVE(clusters, B) M ATCH -ID S(clusters, previous) Return clusters M OST-S IMILAR -C LUSTERS(C LUSTERS clusters) 1 2 3 4 5 6 7 8 9 εθ ← 30, εd ← 0.125 · I NPUT-S IZE θmin ← ∞, (Amin , Bmin ) ← null for each A ∈ clusters do for each B ∈ clusters do θ ← M INIMUM -A NGLE(A, B); δ ← M INIMUM -D ISTANCE(A, B); if (θ ≤ εθ and δ ≤ εd ) then if (θ < θmin ) then θmin ← θ, Amin ← A, Bmin ← B M INIMUM -A NGLE(C LUSTER A, C LUSTER B) 7 θmin ← ∞ for each p ∈ A.points do for each q ∈ B.points do a ← S CALAR -P RODUCT(p − dp, q − dq); b ← N ORM(p − dp) · N ORM(q − dq) θ ← ACOS(a/b); if (θ<θmin ) then θmin ← θ 8 Return θmin 1 2 3 4 5 6 M INIMUM -D ISTANCE(C LUSTER A, C LUSTER B) 5 δmin ← ∞ for each p ∈ A.points do for each q ∈ B.points do δ ← E UCLIDEAN -D ISTANCE(p, q) if (δ < δmin ) then δmin ← δ 6 Return δmin ; 1 2 3 4 M ATCH -ID S(C LUSTERS current, C LUSTERS previous) 1 2 3 4 5 6 7 8 9 APPENDIX clusters ← new C LUSTERS for each p ∈ P such as dp 6= null do I NSERT(clusters, new C LUSTER(p)) 10 11 12 13 14 15 16 17 18 for each C ∈ current do matched ← f alse; for each K ∈ previous do copy ← C OPY-P OINTS(K.points) for each p ∈ C.points such as ddp 6= null do matchedP t ← f alse; for each q ∈ K.points do if (pi .id == q.id) then R EMOVE(copy, q) matchedP t ← true; break if (not matchedP t) then continue K. /*Go to Line 3*/ if (S IZE(copy) 6= null) then continue K C.id ← K.id matched ← true break if (not matched) then C.id ← new id Return current