Combining Motion Sensors and Ultrasonic Hands Tracking for Continuous Activity Recognition in a Maintenance Scenario Thomas Stiefmeier1 , Georg Ogris2 , Holger Junker1 , Paul Lukowicz1,2, Gerhard Tröster1 1 Wearable Computing Lab, ETH Zürich, Switzerland {stiefmeier,junker,troester}@ife.ee.ethz.ch 2 Institute for Computer Systems and Networks, UMIT Innsbruck, Austria {georg.ogris,paul.lukowicz}@umit.at Abstract We present a novel method for continuous activity recognition based on ultrasonic hand tracking and motion sensors attached to the user’s arms. It builds on previous work in which we have shown such a sensor combination to be effective for isolated recognition in manually segmented data. We describe the hand tracking based segmentation, show how classification is done on both the ultrasonic and the motion data and discuss different classifier fusion methods. The performance of our method is investigated in a large scale experiment in which typical bicycle repair actions are performed by 6 different subjects. The experiment contains a test set with 1008 activities from 21 classes encompassing 115 minutes randomly mixed with 252 minutes of ’NULL’ class. To come as close as possible to a real life continuous scenario we have ensured a diverse and complex ’NULL’ class, diverse and often similar activities, inter person training/testing and an additional data set only for training (299 extra minutes of data). A key result of the paper is that our method can handle the user independent testing (testing on users that were not seen in training) nearly as well as the user dependent case. 1 Introduction Activity recognition is well established as a key functionality of wearable systems. It enables a variety of applications such as ’just in time’, proactive information delivery, contextual annotation of video recordings, simplification of the user interface, and assisted living systems (e.g. [5]). In a large industrial project (WearIT@Work1 ), our groups for example deal with activity recognition based support of assembly and maintenance tasks. The aim is to track the progress of an assembly or maintenance task using simple body-worn sensors and provide targeted assistance in form of manuals, warnings and advice. The specific domains ad1 Sponsored by the European Union under contract EC IP 004216 dressed are aircraft maintenance and car assembly, however, similar applications exist in many other areas. The work described in this paper is part of an ongoing effort to develop reliable methods to track and recognize relevant parts of such assembly or maintenance tasks. It focuses on continuous recognition from an unsegmented data stream using a combination of arm mounted motion sensors and ultrasonic hand tracking. The above sensor combination is motivated by the fact, that maintenance and assembly tasks are largely determined by the interaction between the user’s hands and specific parts of the machinery that is being maintained or assembled. Thus motion sensors monitor gestures performed by the user’s hands in combination with a method for tracking of the hands with respect to the machinery. In a previous publication, we have described the results of an initial experiment showing that such a sensor combination is indeed effective in classifying maintenance activities [11]. The experiment was performed on isolated gestures whereas hand partitioned segments containing the relevant activities were presented to the classifier. In this paper we extend the previous work to include automatic spotting of relevant segments in a continuous data stream. We validate our methods on a new much larger, multi person data set consisting a total of 115 minutes of sequences from 6 subjects, each with 168 relevant activities and about 252 minutes of non relevant (’NULL’ class) activity. The sequences are all used for validation and a separate set of 120 instances of each relevant activity (20 from each subject) is used for training. 1.1 Related Work and Design Choices The three main approaches to activity recognition are video analysis (e.g. [17, 18, 21]), augmentation of the environment (e.g. [1, 12]), and the use of wearable sensors (e.g. [2, 13, 14]). The above are neither mutually exclusive nor can one be said to be in general superior. Instead, choice of a method or method combination depends on a specific application. Due to computing power limitation, varying light condition and issues with occlusion and clutter, we have decided against the use of video analysis. Interaction with Objects The use of environment augmentation is also problematic. Instrumenting each single part of an aircraft with RFIDs (as done in [12] for interaction with household objects) or switches (as demonstrated in [1] for furniture assembly) or other sensors is not applicable in every scenario. On the other hand purely wearable sensors have only limited capability of detecting which part of an object the user is interacting with. As shown in our previous work [8, 16] analysis of the sound made by interaction with the object using a wearable microphone is an exception and provides a considerable amount of information. However, it has a number of problems. In particular, it only provides information about those tasks that actually cause a characteristic sound and does not work in noisy environments. As a consequence, tracking hands position with respect to the object of the maintenance/assembly task is a promising approach. Assuming that plans of the object exist in an electronic format, all that needs to be done is to tie the frame of reference of the tracking system to the object in question. Hands Tracking Different approaches can be taken to tracking body parts. In biomechanics applications such as high performance sports or rehabilitation, magnetic systems (e.g. from ascension2 ) are widely used. Such systems use a stationary source of a predefined magnetic field to track body mounted magnetic sensors. The main problem with magnetic tracking is that it is easily disturbed by metal objects, which are common in our application domain. Another alternative is the use of optical (often IR) markers together with appropriate cameras. Here problems like background lighting (especially for IR systems) and occlusion need to be dealt with. The main disadvantage of both magnetic and optical tracking systems is that they are optimized for ultra high spatial resolution and thus expensive and bulky. Based on the above considerations we have opted for an ultrasonic tracking system. Such systems are widely used for indoor location [10, 15] and relative positioning [4]. They are relatively cheap and require only little infrastructure. In general, placing three or four beacons in predefined locations in the environment is sufficient. Due to physical properties of ultrasound (see also 2.1), it has a number of problems when used for hands tracking. In particular, it is subject to reflection and occlusions and has limited (1 to 5Hz) sampling rates. However, in previous work we have been able to show that despite those problems it is a useful source of information for the classification of maintenance activities [11]. Continuous Recognition Building on this results we now demonstrate the recognition from a continuous, unsegmented data stream. Independently of the sensor modalities used, this is known to be a hard problem. It is particularly difficult in 2 http://ascension-tech.com the so called spotting scenario, where the relevant activities are mixed with a large number of arbitrary other actions. In our case this means that in between activities related to the maintenance task the worker might do things like scratching his head, taking a phone call, drinking or searching for tools. Much work on the spotting problem comes from the gesture recognition area. In [3] different variants of HMMs were used. In [7] a two level approach is proposed. Our groups have also investigated different approaches such as novel segmentation methods for motion sensors [6] and sound based segmentation methods [8]. One of the most successful continuous recognition results is [12] where RFIDs were used to track which household objects the user interacted with. Paper Contributions Despite progress made by the above work, spotting of activities in a continuous data stream is still an open problem. We present a novel approach that represents a significant step towards finding a solution, that is based on a novel sensor combination. We provide a detailed description of our method including an in-depth evaluation of different classifier fusion methods. We evaluate the performance of our method in a realistic setting with 21 diverse, often very similar classes of activities and a rich, randomly inserted ’NULL’ class. The experiment involves a total of nearly 10 hours of data with around 3500 instances of relevant activities (test and training set) performed by 6 subjects. One of the most significant results is the fact, that our method can handle user independent training nearly as well as the user dependent case. 2 Approach As described in the introduction, the basic idea behind our approach to continuous recognition is to correlate arm gestures with hand location with respect to the object being maintained/assembled. The assumption is that the probability of a gesture resembling certain maintenance activity to be accidentally performed at the location corresponding to this activity is very low. Figure 1 gives an overview of our implementation of this idea. We use the ultrasonic position information to select data segments containing potentially interesting activities. In each segment we then separately perform one classification based on the position information and one based on the motion signals. The resulting classifications are then combined using an appropriate classifier fusion method. 2.1 Ultrasonic Analysis Positioning Ultrasonic positioning systems rely on time of flight measurements between a mobile device and at least three reference devices fixed at known positions in the environment. We are using the same hardware platform3 as in [11]. The main difference in this work concerning the position acquisition is that we are using 4 fixed devices instead of 3 to be able 3 http://www.hexamite.com Location Based Segmentation Motion Based Classification Classification Fusion Location Based Classification using the Mahalanobis distance. A majority vote over all samples assigns then a final gesture class to the subsequence. The feature vector consists of the x, y and z coordinates of both wrist-worn ultrasonic devices as depicted by gray boxes on the right in Figure 2. 2.2 Motion Analysis Trained Thresholds 4 Left Hand Distances 4 Left Hand Distances 6 Hand Coordinates Mahalanobis Distance Segmentation Figure 1. Recognition Architecture to adopt a Least Squares Optimization (LSQ) more precisely the Levenberg-Marquardt algorithm [9]. Since a LSQ is not able to deal with asynchronous distance readings very well we used ultrasonic transmitters instead of listeners at the user’s hands. During the experiments, the transmitters are therefore body-worn and the listeners are fixed devices. Segmentation A position based segmentation seems to be promising because the user location or even more the location of the hands is a strong indicator for starts or stops of specific gestures. The positions of interest have to be defined in advance. This is done in a semiautomatic way during the training. The gestures are manually grouped into a set of 11 locations as defined in Table 1 column 4. For both hands, mean and variances are modeled for these locations according to the training data. During the gesture spotting task, the Mahalanobis distance is used to estimate the probabilities for each sample to be part of a specific location. This distance measure has been chosen because it takes the variances of the location with respect to the particular dimensions into account. For each location i, a separate threshold is trained as θi = µi + f · σi , where µi is the mean value of all Mahalanobis distances calculated during the training of location i. σi is the corresponding standard deviation. The constant factor f is optimized during the spotting task itself by applying the evaluation metric defined by Ward et. al. in [19]. The Mahalanobis distance is then calculated for each sample and each position of interest. In case di < θi , the sample is assumed to be close to location i. Groups of samples with di < θi which are longer than a certain threshold θlength are assumed to be connected segments containing a possible gesture candidate. θlength is defined and trained in an analogous manner to θ. In the end this results in a parallel segmentation for each of the 11 positions of interest. Classification As a next step the position data is classified. Possible approaches to do position based gesture classification are shown in the previous paper [11]. For now we decided to go for a Mahalanobis classification similar to the position segmentation itself, despite that all gestures are trained separately. For each presegmented subsequence, each sample is classified For the motion classification Hidden Markov Models (HMMs) have been chosen, which had proven to be a good approach for time sequential motion modeling in previous experiments and studies. Each manipulative gesture in our experiment corresponds to an individually trained HMM model. Thorough analysis and evaluation of the number of states per model ranging from 5 to 12 resulted in determining the number of states from 7 to 9. The number of states reflects the complexity of the respective manipulative gesture. We exclusively used so called left-right models. As features for the HMMs, we used raw inertial sensor data on the one hand. On the other hand, we derived orientation information from the set of inertial sensors in form of Euler angles to complement the raw sensor data features. The deployed set of features comprises the following subset of available sensor signals and derived quantities: two acceleration and one gyroscope signal from the right hand, pitch angles from right lower and upper arm, two acceleration signals from the left hand and the pitch angle of the left upper arm. The observations of the used HMMs correspond to the raw sensor signals or derived angle features. Their continuous nature is modeled by a single Gaussian distribution for each state in all models. 2.3 Fused Classification Plausibility Analysis (PA) The most obvious fusion method is the use of wrist position information to constrain the search space of the motion based classifier. Both the frame based and the HMM classifier result in a ranking for either the whole set of gestures (HMM) or for a subset (frame based) of gestures. For the HMM classifier we chose this subset manually by taking the three most likely gesture classes. Beginning with the most likely gesture concerning the motion result we analyze the plausibility concerning the position of this gesture class. If the result fits with the position the result is assumed to be the correct class, otherwise the next candidate is tested. If the whole set of possible candidates is tested and we end up with no plausible class concerning the position, the current subsequence is classified as ’NULL’ event. Whether gesture i is plausible according to the position readings is decided by calculating the Mahalanobis distances to the trained means and variances for gesture i for all frames of the subsequence. If the median of these distances is below a certain threshold level it is assumed to be a plausible gesture result. The threshold is defined and trained in an analogous manner to θ (see 2.1). Classifier Fusion The most complex fusion method is a true classifier fusion where a separate classification is performed on the ultrasonic and the motion signals. In general, both classifications produce a ranking starting with the most likely and ending with the least likely class. The final classification is then based on a combination of those two rankings and the associated probabilities. In more advanced schemes, confusion matrices from a training set are taken into account. To compare the plausibility analysis with other fusion methods, the position based Mahalanobis classifier was fused with the HMM motion classifier. This was done using two approaches (1) comparing the average ranking of the top choices of both classifiers, we will refer to this as AVG and (2) considering the confusion matrix (CM) that is produced when testing the training sets with the classifier. From this confusion matrix we get an estimation for the probability that the classifier recognizes class Gi although class Gj is true. Given class GA as result of classifier CA and class GB as result of classifier CB , we consider the probabilities P (CA = GA |GB ) and P (CB = GB |GA ) and consider this to be an estimation for the reliability of the classifiers CA and CB . The most reliable result is believed to be true. Another possibility is to do a plausibility analysis of the motion classification and fuse then the position classification with the remaining motion classification using either the AVG or CM fusion method. 3 Experimental Setup The setup for this work is based on the bicycle repair experiment described in [11]. As mentioned in the introduction, it has been extended to reflect as realistically as possible a real life continuous task tracking scenario. This includes frequent, random insertion of complex ’NULL’ events (68.7% of the total data length), the recording of a very large data set (367 min with 1008 gestures), a separate, ’non in sequence’ training set (additional 258 min with 2520 gestures), and consideration of inter subject training and recognition (6 subjects). The details of the experiment are summarized below. 3.1 Experimental Environment No instrumentation or sensors of any kind have been attached to the bike. It has been mounted on a special repair stand for ease of reaching the different parts. In order to use the ultrasonic system, the room has been equipped with four ultrasonic listeners. They have been placed at exactly predefined places and serve as the reference for the distance measurement using the ultrasonic sensors. Two types of sensors have been placed on the user. For one, ultrasonic transmitters (Hexamite HX900SIO) are mounted on both wrists to track the hands’ position with respect to the bicycle. Second, a set of 5 inertial sensor modules (MT9 from Xsens containing accelerometers, gyroscopes and magnetic field sensors) have been attached to the user’s hands, lower and upper arms as depicted in Figure 2. Figure 2. Sensor placement 3.2 The Task We adapted and extended the gesture set used in [11]. The result is a set of 21 manipulative gestures that are part of a regular bicycle repair task. As described below, they have been chosen to provide as much information as possible about the suitability of our approach to the recognition of different types of activities. There are gestures that contain very characteristic motions as well as ones that are highly unstructured. Similarly, there are activities that take place at different, well defined locations as well as such that are performed at (nearly) the same locations or are associated with vague locations only. Table 1 gives a full overview of the used gestures. The key properties in terms of recognition challenges can be summarized as follows. pumping (gestures 1 and 2) Pumping begins with unscrewing the valve. Thus, it consists of more than just the characteristic periodic motion. Pumping the front and the back wheel differs in terms of location, however, depending on where the valve is during pumping the location is rather vaguely defined. People tend to use different valve positions for the front and the back wheel, which means that statistically there is difference in the acceleration signal as well. screws (gestures 3 to 8) The sequence contains the screwing and unscrewing of three screws at different, clearly separable locations. Screws B and C require a screwdriver and screw A requires a special wrench. Combined with different arm positions required to handle each screw, this provides some acceleration information to distinguish between the screws (in addition to location information). pedals (gestures 9 to 12) The set contains four pedals related gestures: just turning the pedals, turning the pedals and oiling the chain, switching gears (with the other hand) while turning the pedals and turning the pedals while marking unbalances of the back-wheel with chalk. The pedal turning is a reasonably well defined gesture. wheel spinning (gestures 13 and 14) The wheel spinning gestures involve hand turning the front or the back wheel. The gestures contain a reasonably well defined motion (the actual spinning). However there is also a considerable amount of freedom in terms of overall gesture. Front and back can be easily distinguished by location. In most cases different hand positions were used for turning the front and the back wheel. bell (gesture 15) Another challenging gesture is the testing of the bell. The time for ringing the bell up to 5 times is so short that only few location samples are available. seat (gestures 16 and 17) These gestures alter the position of the seat. The first increases the seating position by twisting the seat within its mounting using both hands. In addition to the twisting, the degrading gesture requires the pounding with a fist to drive the seat into its mounting. (dis)assembly (gestures 18 to 21) Among the most difficult to recognize gestures in the set are the assembly and disassembly of the pedals and the back light. All of them can be performed in many different ways, while the hand seldom remains at the same location for a significant time. ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Description Pumping at front wheel Pumping at back wheel Unscrew screw A Tighten screw A Unscrew screw B Tighten screw B Unscrew screw C Tighten screw C Turning pedals Turning pedals and oiling Turning pedals and switching gears Turning pedals and marking unbalances Turn front wheel Turn back wheel Test bell Increase seating position Degrade seating position Disassemble pedal Assemble pedal Remove bulb Insert bulb Periodic Location Reduced Nature √ - /√ -√ / √ √ √ √ √ √ √ - /√ -/ Class Gesture Set a b c c d d e e f f f A B C C D D E E F G H f I g h i j j f f k k L M N O O P P Q Q √ -/ √ √ -√ - /√ -√ / √ - Table 1. Set of Manipulative Gestures 3.3 Reduced Gesture Set The above gesture set contains many pairs that differ only in a small detail. This includes fastening and unfastening a given screw, assembling and disassembling the pedal/light and lowering and rising the seat. In every pair, both gestures are performed at the same location. The motions differ only slightly. Thus both fastening and unfastening a screw involves rotational motion in both directions. The difference is that while turning in one direction the screwdriver is engaged with the screw, while in the other it is not. We have labeled such pairs as two distinct gestures, since the objective of this work is to test the limits of recognition performance. However, it is also interesting to see the overall system performance without such extreme cases. To this end, we have defined a second, so called reduced gesture set, in which such nearly identical pairs are treated as a single activity (see Table 1 column 5). 3.4 The NULL Class The difficulty of continuous recognition depends on the complexity of the ’NULL’ events that separate the task related gestures. Recognition would be fairly easy if the user could be relied on to start and finish each gesture in a well defined position. It would also be easier if the relevant activities were performed immediately after each other with little in between but moving between the different locations. Unfortunately, neither of the above is likely in a real life maintenance scenario. As a consequence, we have put great emphasis on having complex, random ’NULL’ events in our tests. To this end the following events were randomly included in the stream of manipulative gestures used to test our method. a. Assembling and disassembling the front wheel. This is a very complex activity that contains gestures very similar to our gesture set (e.g. unscrewing the wheel mount). b. Walking over to a notebook placed about 3 meters away from the bike to type a few characters. c. Cleaning a user selected part of the bike. Here the ’NULL’ class could be potentially close to some relevant gestures both in terms of location and motion. d. Holding on to a user selected part of the frame for a user selected period of time (a few seconds). Here we often had relevant location but no motions. In addition to the above random gestures the user had to pick up and put away tools. No instructions were given to the users what to do/not to do in between gestures. Overall the ’NULL’ class amounted to 68.7% of the recording time whereas no ’NULL’ class event interrupted a manipulative gesture. The random number generator was set to generate approx. as many ’NULL’ class events as there were real events in the sequence. 3.5 Data Recording We recorded two types of data sets. The first type comprises 20 repetitions of each of the 21 available manipulative gestures. Here, the repetitions are separated by a few seconds, in which the test subject returned into a defined ’home’ position. This data type is called train data. The second type of recordings involves all 21 manipulative gestures in a randomly generated order. We refer to this data type as sequence data. This ensures that even gestures with little complexity are carried out with a certain variability, giving the data real-life conditions. The gestures in the sequence data are separated by randomly inserted ’NULL’ class events as described above. The experiment has been performed by one female and five male test subjects. For each of the six test subjects, we recorded 20 train repetitions of each gesture and eight sequences containing all of the 21 gestures. This results in 2520 different gesture instances of type train (299 minutes) and 1260 gesture instances of type sequence (115 min). Training Set Rationale Splitting the data into a train and a test set in above way is motivated by practical consideration related to envisioned real-life deployment of such systems. On any given piece of machinery, the set of possible individual actions (manipulative gestures) that can be taken is likely to be by far smaller than the set of all possible maintenance sequences. In fact since the maintenance sequences are permutations of individual actions in theory, there can be exponentially more sequences than individual actions. As a consequence in any practical system, we will have to train gestures individually and not as part of a certain sequence. This also makes labeling of the training sequence much easier since predefined start and stop positions can be used. The down side of this method is that if the gestures are trained separately as described above the onset and the end phase of the gestures will likely to be different than in a real maintenance sequence. Also as a person is likely to repeat the same gesture a large number of times, the repetitions are likely to get ’sloppy’. Thus the training will be less effective. However, since the objective of the work was to get as close as possible to a real-life scenario, we have used this training strategy despite the above drawbacks. Ground Truth k j i Location Class h fragmentation Segmentation To evaluate and fine-tune the proposed segmentation method, the error categorizations approach from [19] was applied. It extends the standard insertion, deletion, substitution error types with three additional categories: Timing errors. This refers to instances where an event was correctly recognized, however, the timing of the event is not entirely right. Thus the segmentation output might be slightly shorter (underfill) or slightly longer (overfill) or shifted (underfill on one side and overfill on the other). In terms of quality of the segmentation as a basis for event based recognition timing errors are obviously much less important than ’true’ insertions or deletions. Fragmentations. This is in a way the opposite of merges where a single ground truth event is split into two or more. g f e The evaluation of the segmentation is based on a comparison with the ground truth as described in [19]. It contains both a frame by frame and an event based analysis. Similar evaluation is applied to the classification and fusion results. Merges. This class refers to instances when two ground truth events belonging to the same class have been merged into a single event by the segmentation algorithm. This type of error tends to be ignored by the conventional error description since strictly speaking there is neither a deletion nor an insertion. Segmentation Results insertion c. external: The algorithms are trained on data from a set of subjects which excludes the subjects which are used for testing. This can be referred to as user-independent. merge The above error categories are illustrated in Figure 3 which shows a typical segmentation output from our experiment. d c b 0 timing error (overfill) deletion a 100 200 300 400 500 600 Time [s] Figure 3. Segmentation example for a sequence 4 Results In this section we discuss the results of applying the approaches described in Section 2 to the data set obtained from the experiment as specified in Section 3. To evaluate the influence of different users on the segmentation and classification results we carried out three testing modalities: a. intra: The algorithms are trained on data originating from a single user and the segmentation and classification is applied to sequences of the same user. b. inter: Training is performed using data from a mixed set of subjects. Testing is done for all involved subjects. correct positive overfill underfill insertion substitution merge fragmentation deletion correct NULL f =1 74.26 27.79 11.34 17.8 8.77 1.05 1.05 4.58 55.51 f =2 84.71 40.24 5.67 33.75 8.24 4.98 0.56 0.82 45.70 f =3 87.54 54.58 2.67 52.57 8.87 8.4 0.49 0.46 34.60 f =4 88.63 67.47 1.42 69.51 9.36 13.12 0.43 0.2 24.12 Table 2. Segmentation results for different values of f ; frame based results, i.e. results are given in percentage of overall event frames, (overall NULL event frames for correct NULL). Table 2 summarizes the frame by frame segmentation results in the intra-modality for different values of the f parameter. Although for f = 1 74.26% of the ground truth event frames are correctly recognized, only 4.58% are deleted. The missing percentage is divided between underfill (11.34%), fragmentation (1.05%) and substitution (8.77%). Considering the fact that the segmentation is just an initial stage of the classification this is a reasonable result. Except for the deletions all the other errors can potentially be corrected by motion classification and classifier fusion. Higher values of f (less restrictive threshold θ) can reduce the number of deletions even further, however, at the cost of a corresponding decrease in the number of correct ’NULL’ class frames and an increase in the number of insertions and overfills. Inter and external modalities produced very similar results. Separate Classification The results of the event based classification for both location and motion classifiers are presented in the first two rows of each modality in Table 3. In each case the corresponding classification methods have been applied to the segments retrieved by the segmentation stage. The parameters of the preceding segmentation have been adjusted to retrieve as many actual events as possible in order not to generate deletions. Looking at Table 3, the following main observations can be made: 1. There is not much difference between the inter, intra and external testing modalities. In fact the inter results are the worst ones, due to the smallest training set. Position based external (tested on a person not trained for) results are just 2% worse than intra results. This is a very positive and relevant result. 2. Position classification is consistently much better than motion (mostly by around 15% max. 25%). 3. The number of insertions is intolerably high. On average out of 4 events reported by the system, between 3 and 3.5 are insertions. 4. The number of correctly recognized events is low in the full data set (55% to 73%). It is much better (65% to 91%), in the reduced set. In both sets the errors are mostly due not to deletions but to substitutions. 5. The performance gets dramatically better if we look at the first two picks of the classifier. There we are in the nineties even for the full set. Looking at the two top picks of motion and position classification combined, we get over 95% for all testing modalities (nearly 99 % for inter) even for the full set. Fusion The remaining rows of Table 3 present the results for the four classification fusion schemes as described in 2.3. For the plausibility analysis based schemes, the values for correct classified events based on the first two ranks are given. 1. As with separate classification, the results are fairly independent of the testing modality with intra user results often even the worst due to the smallest training set. 2. The CM fusion method performs worse than pure location classification on all counts. 3. Plausibility analysis (PA) drastically reduces the number of insertions (nearly by half). In combination with the Average fusion method (PA-Avg) we are now down to about 1 in 3 events being an insertion (bit more for the intra motion position mot&pos binary CM PA PA-Avg inter motion position mot&pos binary CM PA PA-Avg external motion position mot&pos binary CM PA PA-Avg insertions 176.94 (162.82) 115.41 (107.55) 9.64 (6.76) 154.57 (146.62) 64.71 (50.99) 58.65 (44.83) errors fragment. deletions 1.99 (1.99) 0.00 (0.00) 1.39 (1.89) 0.00 (0.00) 0.89 (1.09) 51.09 (51.09) 1.59 (1.99) 0.00 (0.00) 1.69 (1.69) 6.96 (6.96) 1.79 (1.89) 6.96 (6.96) substitutions 39.76 (26.24) 33.00 (13.22) 8.85 (2.09) 39.56 (22.96) 25.35 (9.15) 24.16 (8.55) correct correct events in 2nd ranked 58.25 (70.58) 76.14 (82.21) 65.71 (84.10) 91.55 (93.44) - 96.62 (97.61) 39.36 (45.63) 58.75 (73.66) 66.40 (81.41) 90.76 (92.35) 67.50 (82.11) 87.57 (93.24) insertions 190.36 (175.75) 113.72 (102.58) 22.86 (18.59) 149.01 (138.27) 78.33 (60.93) 69.68 (51.39) errors fragment. deletions 2.29 (2.19) 0.00 (0.00) 1.79 (1.99) 0.00 (0.00) 1.39 (1.49) 44.14 (44.14) 1.59 (2.09) 0.00 (0.00) 1.79 (1.99) 5.27 (5.17) 1.69 (1.99) 5.17 (5.17) substitutions 37.77 (25.25) 25.25 (5.47) 9.64 (5.47) 34.29 (17.59) 20.87 (7.06) 17.89 (4.87) correct correct events in 2nd ranked 60.54 (71.97) 76.64 (84.59) 73.46 (91.85) 97.81 (98.11) - 98.81 (98.91) 45.33 (52.19) 64.81 (79.22) 72.37 (85.19) 93.74 (96.22) 75.35 (87.38) 91.35 (97.32) insertions 193.64 (178.43) 118.99 (108.85) 21.67 (17.00) 156.76 (145.6) 73.86 (56.86) 65.61 (48.41) errors fragment. deletions 2.19 (2.58) 0.00 (0.00) 1.59 (2.29) 0.00 (0.00) 1.09 (1.49) 46.72 (46.72) 1.59 (2.39) 0.00 (0.00) 1.69 (2.19) 4.97 (4.97) 1.79 (2.19) 4.97 (4.97) substitutions 42.45 (31.51) 27.14 (6.36) 10.54 (1.89) 36.18 (18.89) 24.06 (10.24) 21.27 (8.05) correct correct events in 2nd ranked 55.57 (65.71) 72.56 (80.22) 71.77 (90.76) 96.82 (97.22) - 97.91 (98.41) 42.05 (49.11) 62.33 (77.73) 69.38 (82.01) 92.74 (94.63) 71.87 (84.10) 89.36 (96.02) Table 3. Event based results given in % of ground truth events. The number in brackets correspond with the evaluation of the reduced gesture set. The last column gives the percentage of cases where the correct result was one of the first two ranked classes. full, bit less for the reduced set). PA, in particular in combination with Average fusion, reduces substitutions by around 10%. The price for the reduction in insertions are between 5% and 6% deletions. 4. Looking at the first two choices of the classifiers, both the pure PA and PA-Avg produce results between about 87.6% (for intra, full set) and 97.3% (inter, reduced set). This is still not perfect, however, as will be argued later together with the reduced insertion rate it is sufficient for many applications. It is also an excellent starting point for further optimizations 5 Conclusion and Future Work Results Significance By the standards of established domains such as speech recognition or modes locomotion analysis one might be tempted to dismiss the results, in particular for the full set of 21 activities, as overly inaccurate. However, for the following reasons, we argue that the results presented in this paper are indeed quite significant. 1. The type of real-life activity spotting with wearable sensors is known to be a hard and so far unsolved problem. Thus even the comparatively low accuracy is a significant progress. 2. The above experiment has been set up to be realistic, contain hard to distinguish gesture pairs and a complex ’NULL’ class. 3. The deletion rates are very low and the correct answer is mostly in the two top picks. This is a very good basis for further work in particular the addition of further sensors and high level modeling. 4. The above has been demonstrated for the user independent case (testing users on which there has been no train- ing). This is an essential condition for such systems to find wide scale acceptance. 5. With the correct class being contained in the top two picks in well over 90% of the cases the system could be already used in some applications. As an example, consider the automatic selection of appropriate manual pages onto an HMD (head mounted display). Having the user select from two choices is not a problem. Since the user does not care about the displayed page when not doing anything significant, the insertions are also not a grave issue. In summary, while we have not presented a final solution to continuous task tracking, this work can be considered a significant step on the way towards such a solution. Future Work The next steps towards a better recognition performance are fairly obvious from the discussion above. For one we will integrate our previous work on using sound information in activity recognition. This will provide us with a third sensor modality that should significantly help with those activities, that are associated with a characteristic sound. Judging by the success of our previous work with a combination of sound and motion [20] in similar domains, this should significantly improve the accuracy. We will also investigate the integration of our previous results in purely motion based segmentation to reduce the number of insertions in the segmentation stage. Further improvements to be investigated include high level task modeling, using RFIDs for tools identification and pruning of overlapping segments. Finally, as part of the WearIT@Work project we are currently in the process of setting up an experiment in a real-life aircraft maintenance. References [1] S. Antifakos, F. Michahelles, and B. Schiele. Proactive instructions for furniture assembly. In 4th Intl. Symp. on Ubiquitous Computing. UbiComp 2002., page 351, Göteborg, Sweden, 2002. [2] L. Bao and S. S. Intille. Activity recognition from userannotated acceleration data. In Proceedings of the 2nd International Conference on Pervasive Computing, pages 1–17, April 2004. [3] J. Deng and H. Tsui. An HMM-based approach for gesture segmentation and recognition. In 15th International Conference on Pattern Recognition, volume 2, pages 679 – 682, September 2000. [4] M. Hazas, C. Kray, H. Gellersen, H. Agbota, G. Kortuem, and A. Krohn. A relative positioning system for co-located mobile devices. In In Proceedings of MobiSys 2005: Third International Conference on Mobile Systems, Applications, and Services, pages 177–190, Seattle, USA, June 2005. [5] S. Helal, B. Winkler, C. Lee, Y. Kaddourah, L. Ran, C. Giraldo, and W. Mann. Enabling location-aware pervasive computing applications for the elderly. In Proceedings of the First IEEE Pervasive Computing Conference. Fort Worth, Texas, June 2003. [6] H. Junker, P. Lukowicz, and G. Tröster. Continuous recognition of arm activities with body-worn inertial sensors. In Proceedings of the International Symposium on Wearable Computers, Oct. 2004. [7] C. Lee and X. Yangsheng. Online, interactive learning of gestures for human/robot interfaces. In IEEE International Conference on Robotics and Automation, volume 4, pages 2982 – 2987, April 1996. [8] P. Lukowicz, J. Ward, H. Junker, G. Tröster, A. Atrash, and T. Starner. Recognizing workshop activity using body worn microphones and accelerometers. In Pervasive Computing, 2004. [9] D. W. Marquardt. An Algorithm for Least-Squares Estimation of Nonlinear Parameters. Journal of the Society for Industrial and Applied Mathematics, 11(2):431–441, June 1963. [10] H. Muller, M. McCarthy, and C. Randell. Particle filters for position sensing with asynchronous ultrasonic beacons. In Proceedings of LoCA 2006, LNCS 3987, pages 1–13. Springer Verlag, May 2006. [11] G. Ogris, T. Stiefmeier, H. Junker, P. Lukowicz, and G. Tröster. Using ultrasonic hand tracking to augment motion analysis based recognition of manipulative gestures. In Proceedings of the IEEE International Symposium on Wearable Computing, pages 152–159, Oct. 2005. [12] D. J. Patterson, D. Fox, H. Kautz, and M. Philipose. FineGrained Activity Recognition by Aggregating Abstract Object Usage. In Proceedings of ISWC 2005: IEEE 9th International Symposium on Wearable Computers, October 2005. [13] C. Randell and H. Muller. Context awareness by analysing accelerometer data. In Proc. 4th International Symposium on Wearable Computers, pages 175–176, 2000. [14] L. Seon-Woo and K. Mase. Activity and location recognition using wearable sensors. IEEE Pervasive Computing, 1(3):24– 32, July 2002. [15] A. Smith, H. Balakrishnan, M. Goraczko, and N. Priyantha. Tracking moving devices with the cricket location system. In Proc. 2nd USENIX/ACM MOBISYS Conference, Boston, MA, June 2004. [16] M. Stäger, P. Lukowicz, N. Perera, T. von Büren, G. Tröster, and T. Starner. SoundButton: Design of a Low Power Wearable Audio Classification System. In ISWC 2003: Proc. of the 7th IEEE Int’l Symposium on Wearable Computers, pages 12–17, Oct. 2003. [17] T. Starner, B. Schiele, and A. Pentland. Visual contextual awareness in wearable computing. In IEEE Intl. Symp. on Wearable Computers, pages 50–57, Pittsburgh, PA, 1998. [18] C. Vogler and D. Metaxas. ASL recognition based on a coupling between HMMs and 3D motion analysis. In ICCV, Bombay, 1998. [19] J. Ward, P. Lukowicz, and G. Tröster. Evaluating performance in continuous context recognition using event-driven error characterisation. In Proceedings of LoCA 2006, LNCS 3987. Springer Verlag, May 2006. [20] J. A. Ward, P. Lukowicz, G. Tröster, and T. Starner. Activity recognition of assembly tasks using body-worn microphones and accelerometers. IEEE Transactions on Pattern Analysis and Machine Intelligence, accepted for publication 2006. [21] J. Yamato, J. Ohya, and K. Ishii. Recognizing human action in time-sequential images using hidden Markov models. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 379–385, 1992.