Combining Motion Sensors and Ultrasonic Hands Tracking

advertisement
Combining Motion Sensors and Ultrasonic Hands Tracking for Continuous
Activity Recognition in a Maintenance Scenario
Thomas Stiefmeier1 , Georg Ogris2 , Holger Junker1 , Paul Lukowicz1,2, Gerhard Tröster1
1
Wearable Computing Lab, ETH Zürich, Switzerland
{stiefmeier,junker,troester}@ife.ee.ethz.ch
2
Institute for Computer Systems and Networks, UMIT Innsbruck, Austria
{georg.ogris,paul.lukowicz}@umit.at
Abstract
We present a novel method for continuous activity recognition based on ultrasonic hand tracking and motion sensors attached to the user’s arms. It builds on previous work in which
we have shown such a sensor combination to be effective for
isolated recognition in manually segmented data.
We describe the hand tracking based segmentation, show
how classification is done on both the ultrasonic and the motion data and discuss different classifier fusion methods.
The performance of our method is investigated in a large
scale experiment in which typical bicycle repair actions are
performed by 6 different subjects. The experiment contains a
test set with 1008 activities from 21 classes encompassing 115
minutes randomly mixed with 252 minutes of ’NULL’ class.
To come as close as possible to a real life continuous scenario
we have ensured a diverse and complex ’NULL’ class, diverse
and often similar activities, inter person training/testing and
an additional data set only for training (299 extra minutes of
data). A key result of the paper is that our method can handle
the user independent testing (testing on users that were not
seen in training) nearly as well as the user dependent case.
1 Introduction
Activity recognition is well established as a key functionality of wearable systems. It enables a variety of applications
such as ’just in time’, proactive information delivery, contextual annotation of video recordings, simplification of the user
interface, and assisted living systems (e.g. [5]).
In a large industrial project (WearIT@Work1 ), our groups
for example deal with activity recognition based support of
assembly and maintenance tasks. The aim is to track the
progress of an assembly or maintenance task using simple
body-worn sensors and provide targeted assistance in form
of manuals, warnings and advice. The specific domains ad1 Sponsored
by the European Union under contract EC IP 004216
dressed are aircraft maintenance and car assembly, however,
similar applications exist in many other areas.
The work described in this paper is part of an ongoing effort to develop reliable methods to track and recognize relevant parts of such assembly or maintenance tasks. It focuses
on continuous recognition from an unsegmented data stream
using a combination of arm mounted motion sensors and ultrasonic hand tracking. The above sensor combination is motivated by the fact, that maintenance and assembly tasks are
largely determined by the interaction between the user’s hands
and specific parts of the machinery that is being maintained or
assembled. Thus motion sensors monitor gestures performed
by the user’s hands in combination with a method for tracking
of the hands with respect to the machinery.
In a previous publication, we have described the results of
an initial experiment showing that such a sensor combination
is indeed effective in classifying maintenance activities [11].
The experiment was performed on isolated gestures whereas
hand partitioned segments containing the relevant activities
were presented to the classifier. In this paper we extend the
previous work to include automatic spotting of relevant segments in a continuous data stream. We validate our methods
on a new much larger, multi person data set consisting a total of 115 minutes of sequences from 6 subjects, each with
168 relevant activities and about 252 minutes of non relevant
(’NULL’ class) activity. The sequences are all used for validation and a separate set of 120 instances of each relevant
activity (20 from each subject) is used for training.
1.1
Related Work and Design Choices
The three main approaches to activity recognition are video
analysis (e.g. [17, 18, 21]), augmentation of the environment
(e.g. [1, 12]), and the use of wearable sensors (e.g. [2, 13,
14]). The above are neither mutually exclusive nor can one be
said to be in general superior. Instead, choice of a method or
method combination depends on a specific application. Due to
computing power limitation, varying light condition and issues
with occlusion and clutter, we have decided against the use of
video analysis.
Interaction with Objects The use of environment augmentation is also problematic. Instrumenting each single part of an
aircraft with RFIDs (as done in [12] for interaction with household objects) or switches (as demonstrated in [1] for furniture
assembly) or other sensors is not applicable in every scenario.
On the other hand purely wearable sensors have only limited
capability of detecting which part of an object the user is interacting with. As shown in our previous work [8, 16] analysis
of the sound made by interaction with the object using a wearable microphone is an exception and provides a considerable
amount of information. However, it has a number of problems. In particular, it only provides information about those
tasks that actually cause a characteristic sound and does not
work in noisy environments.
As a consequence, tracking hands position with respect to
the object of the maintenance/assembly task is a promising
approach. Assuming that plans of the object exist in an electronic format, all that needs to be done is to tie the frame of
reference of the tracking system to the object in question.
Hands Tracking Different approaches can be taken to
tracking body parts. In biomechanics applications such as
high performance sports or rehabilitation, magnetic systems
(e.g. from ascension2 ) are widely used. Such systems use a
stationary source of a predefined magnetic field to track body
mounted magnetic sensors. The main problem with magnetic
tracking is that it is easily disturbed by metal objects, which
are common in our application domain. Another alternative is
the use of optical (often IR) markers together with appropriate
cameras. Here problems like background lighting (especially
for IR systems) and occlusion need to be dealt with. The main
disadvantage of both magnetic and optical tracking systems
is that they are optimized for ultra high spatial resolution and
thus expensive and bulky.
Based on the above considerations we have opted for an
ultrasonic tracking system. Such systems are widely used for
indoor location [10, 15] and relative positioning [4]. They are
relatively cheap and require only little infrastructure. In general, placing three or four beacons in predefined locations in
the environment is sufficient. Due to physical properties of
ultrasound (see also 2.1), it has a number of problems when
used for hands tracking. In particular, it is subject to reflection and occlusions and has limited (1 to 5Hz) sampling rates.
However, in previous work we have been able to show that
despite those problems it is a useful source of information for
the classification of maintenance activities [11].
Continuous Recognition Building on this results we now
demonstrate the recognition from a continuous, unsegmented
data stream. Independently of the sensor modalities used, this
is known to be a hard problem. It is particularly difficult in
2 http://ascension-tech.com
the so called spotting scenario, where the relevant activities
are mixed with a large number of arbitrary other actions. In
our case this means that in between activities related to the
maintenance task the worker might do things like scratching
his head, taking a phone call, drinking or searching for tools.
Much work on the spotting problem comes from the gesture
recognition area. In [3] different variants of HMMs were used.
In [7] a two level approach is proposed. Our groups have also
investigated different approaches such as novel segmentation
methods for motion sensors [6] and sound based segmentation
methods [8]. One of the most successful continuous recognition results is [12] where RFIDs were used to track which
household objects the user interacted with.
Paper Contributions Despite progress made by the above
work, spotting of activities in a continuous data stream is still
an open problem. We present a novel approach that represents
a significant step towards finding a solution, that is based on a
novel sensor combination. We provide a detailed description
of our method including an in-depth evaluation of different
classifier fusion methods. We evaluate the performance of our
method in a realistic setting with 21 diverse, often very similar classes of activities and a rich, randomly inserted ’NULL’
class. The experiment involves a total of nearly 10 hours of
data with around 3500 instances of relevant activities (test and
training set) performed by 6 subjects. One of the most significant results is the fact, that our method can handle user independent training nearly as well as the user dependent case.
2 Approach
As described in the introduction, the basic idea behind our
approach to continuous recognition is to correlate arm gestures with hand location with respect to the object being maintained/assembled. The assumption is that the probability of a
gesture resembling certain maintenance activity to be accidentally performed at the location corresponding to this activity is
very low.
Figure 1 gives an overview of our implementation of this
idea. We use the ultrasonic position information to select data
segments containing potentially interesting activities. In each
segment we then separately perform one classification based
on the position information and one based on the motion signals. The resulting classifications are then combined using an
appropriate classifier fusion method.
2.1
Ultrasonic Analysis
Positioning Ultrasonic positioning systems rely on time of
flight measurements between a mobile device and at least three
reference devices fixed at known positions in the environment.
We are using the same hardware platform3 as in [11]. The
main difference in this work concerning the position acquisition is that we are using 4 fixed devices instead of 3 to be able
3 http://www.hexamite.com
Location
Based
Segmentation
Motion
Based
Classification
Classification
Fusion
Location
Based
Classification
using the Mahalanobis distance. A majority vote over all samples assigns then a final gesture class to the subsequence. The
feature vector consists of the x, y and z coordinates of both
wrist-worn ultrasonic devices as depicted by gray boxes on
the right in Figure 2.
2.2
Motion Analysis
Trained
Thresholds
4 Left Hand
Distances
4 Left Hand
Distances
6 Hand
Coordinates
Mahalanobis
Distance
Segmentation
Figure 1. Recognition Architecture
to adopt a Least Squares Optimization (LSQ) more precisely
the Levenberg-Marquardt algorithm [9]. Since a LSQ is not
able to deal with asynchronous distance readings very well we
used ultrasonic transmitters instead of listeners at the user’s
hands. During the experiments, the transmitters are therefore
body-worn and the listeners are fixed devices.
Segmentation A position based segmentation seems to be
promising because the user location or even more the location
of the hands is a strong indicator for starts or stops of specific
gestures.
The positions of interest have to be defined in advance.
This is done in a semiautomatic way during the training. The
gestures are manually grouped into a set of 11 locations as defined in Table 1 column 4. For both hands, mean and variances
are modeled for these locations according to the training data.
During the gesture spotting task, the Mahalanobis distance
is used to estimate the probabilities for each sample to be part
of a specific location. This distance measure has been chosen
because it takes the variances of the location with respect to
the particular dimensions into account. For each location i, a
separate threshold is trained as θi = µi + f · σi , where µi is
the mean value of all Mahalanobis distances calculated during the training of location i. σi is the corresponding standard
deviation. The constant factor f is optimized during the spotting task itself by applying the evaluation metric defined by
Ward et. al. in [19].
The Mahalanobis distance is then calculated for each sample and each position of interest. In case di < θi , the sample
is assumed to be close to location i. Groups of samples with
di < θi which are longer than a certain threshold θlength are
assumed to be connected segments containing a possible gesture candidate. θlength is defined and trained in an analogous
manner to θ. In the end this results in a parallel segmentation
for each of the 11 positions of interest.
Classification As a next step the position data is classified.
Possible approaches to do position based gesture classification
are shown in the previous paper [11]. For now we decided to
go for a Mahalanobis classification similar to the position segmentation itself, despite that all gestures are trained separately.
For each presegmented subsequence, each sample is classified
For the motion classification Hidden Markov Models
(HMMs) have been chosen, which had proven to be a good
approach for time sequential motion modeling in previous experiments and studies.
Each manipulative gesture in our experiment corresponds
to an individually trained HMM model. Thorough analysis
and evaluation of the number of states per model ranging from
5 to 12 resulted in determining the number of states from 7
to 9. The number of states reflects the complexity of the respective manipulative gesture. We exclusively used so called
left-right models.
As features for the HMMs, we used raw inertial sensor data
on the one hand. On the other hand, we derived orientation
information from the set of inertial sensors in form of Euler
angles to complement the raw sensor data features. The deployed set of features comprises the following subset of available sensor signals and derived quantities: two acceleration
and one gyroscope signal from the right hand, pitch angles
from right lower and upper arm, two acceleration signals from
the left hand and the pitch angle of the left upper arm. The
observations of the used HMMs correspond to the raw sensor
signals or derived angle features. Their continuous nature is
modeled by a single Gaussian distribution for each state in all
models.
2.3
Fused Classification
Plausibility Analysis (PA) The most obvious fusion method
is the use of wrist position information to constrain the search
space of the motion based classifier. Both the frame based and
the HMM classifier result in a ranking for either the whole set
of gestures (HMM) or for a subset (frame based) of gestures.
For the HMM classifier we chose this subset manually by taking the three most likely gesture classes. Beginning with the
most likely gesture concerning the motion result we analyze
the plausibility concerning the position of this gesture class.
If the result fits with the position the result is assumed to be
the correct class, otherwise the next candidate is tested. If the
whole set of possible candidates is tested and we end up with
no plausible class concerning the position, the current subsequence is classified as ’NULL’ event.
Whether gesture i is plausible according to the position
readings is decided by calculating the Mahalanobis distances
to the trained means and variances for gesture i for all frames
of the subsequence. If the median of these distances is below
a certain threshold level it is assumed to be a plausible gesture
result. The threshold is defined and trained in an analogous
manner to θ (see 2.1).
Classifier Fusion The most complex fusion method is a true
classifier fusion where a separate classification is performed
on the ultrasonic and the motion signals. In general, both classifications produce a ranking starting with the most likely and
ending with the least likely class. The final classification is
then based on a combination of those two rankings and the associated probabilities. In more advanced schemes, confusion
matrices from a training set are taken into account.
To compare the plausibility analysis with other fusion
methods, the position based Mahalanobis classifier was fused
with the HMM motion classifier. This was done using two approaches (1) comparing the average ranking of the top choices
of both classifiers, we will refer to this as AVG and (2) considering the confusion matrix (CM) that is produced when testing
the training sets with the classifier. From this confusion matrix we get an estimation for the probability that the classifier
recognizes class Gi although class Gj is true. Given class
GA as result of classifier CA and class GB as result of classifier CB , we consider the probabilities P (CA = GA |GB ) and
P (CB = GB |GA ) and consider this to be an estimation for
the reliability of the classifiers CA and CB . The most reliable
result is believed to be true.
Another possibility is to do a plausibility analysis of the
motion classification and fuse then the position classification
with the remaining motion classification using either the AVG
or CM fusion method.
3 Experimental Setup
The setup for this work is based on the bicycle repair experiment described in [11]. As mentioned in the introduction,
it has been extended to reflect as realistically as possible a real
life continuous task tracking scenario. This includes frequent,
random insertion of complex ’NULL’ events (68.7% of the total data length), the recording of a very large data set (367 min
with 1008 gestures), a separate, ’non in sequence’ training set
(additional 258 min with 2520 gestures), and consideration of
inter subject training and recognition (6 subjects).
The details of the experiment are summarized below.
3.1
Experimental Environment
No instrumentation or sensors of any kind have been attached to the bike. It has been mounted on a special repair
stand for ease of reaching the different parts.
In order to use the ultrasonic system, the room has been
equipped with four ultrasonic listeners. They have been placed
at exactly predefined places and serve as the reference for the
distance measurement using the ultrasonic sensors.
Two types of sensors have been placed on the user. For one,
ultrasonic transmitters (Hexamite HX900SIO) are mounted on
both wrists to track the hands’ position with respect to the bicycle. Second, a set of 5 inertial sensor modules (MT9 from
Xsens containing accelerometers, gyroscopes and magnetic
field sensors) have been attached to the user’s hands, lower
and upper arms as depicted in Figure 2.
Figure 2. Sensor placement
3.2
The Task
We adapted and extended the gesture set used in [11]. The
result is a set of 21 manipulative gestures that are part of a regular bicycle repair task. As described below, they have been
chosen to provide as much information as possible about the
suitability of our approach to the recognition of different types
of activities. There are gestures that contain very characteristic
motions as well as ones that are highly unstructured. Similarly,
there are activities that take place at different, well defined locations as well as such that are performed at (nearly) the same
locations or are associated with vague locations only. Table 1
gives a full overview of the used gestures. The key properties in terms of recognition challenges can be summarized as
follows.
pumping (gestures 1 and 2) Pumping begins with unscrewing the valve. Thus, it consists of more than just the characteristic periodic motion. Pumping the front and the back wheel
differs in terms of location, however, depending on where the
valve is during pumping the location is rather vaguely defined.
People tend to use different valve positions for the front and
the back wheel, which means that statistically there is difference in the acceleration signal as well.
screws (gestures 3 to 8) The sequence contains the screwing
and unscrewing of three screws at different, clearly separable
locations. Screws B and C require a screwdriver and screw
A requires a special wrench. Combined with different arm
positions required to handle each screw, this provides some
acceleration information to distinguish between the screws (in
addition to location information).
pedals (gestures 9 to 12) The set contains four pedals related gestures: just turning the pedals, turning the pedals and
oiling the chain, switching gears (with the other hand) while
turning the pedals and turning the pedals while marking unbalances of the back-wheel with chalk. The pedal turning is a
reasonably well defined gesture.
wheel spinning (gestures 13 and 14) The wheel spinning
gestures involve hand turning the front or the back wheel. The
gestures contain a reasonably well defined motion (the actual
spinning). However there is also a considerable amount of
freedom in terms of overall gesture. Front and back can be
easily distinguished by location. In most cases different hand
positions were used for turning the front and the back wheel.
bell (gesture 15) Another challenging gesture is the testing
of the bell. The time for ringing the bell up to 5 times is so
short that only few location samples are available.
seat (gestures 16 and 17) These gestures alter the position
of the seat. The first increases the seating position by twisting
the seat within its mounting using both hands. In addition to
the twisting, the degrading gesture requires the pounding with
a fist to drive the seat into its mounting.
(dis)assembly (gestures 18 to 21) Among the most difficult
to recognize gestures in the set are the assembly and disassembly of the pedals and the back light. All of them can be
performed in many different ways, while the hand seldom remains at the same location for a significant time.
ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Description
Pumping at front wheel
Pumping at back wheel
Unscrew screw A
Tighten screw A
Unscrew screw B
Tighten screw B
Unscrew screw C
Tighten screw C
Turning pedals
Turning pedals and oiling
Turning pedals and switching
gears
Turning pedals and marking
unbalances
Turn front wheel
Turn back wheel
Test bell
Increase seating position
Degrade seating position
Disassemble pedal
Assemble pedal
Remove bulb
Insert bulb
Periodic
Location
Reduced
Nature
√
- /√
-√
/
√
√
√
√
√
√
√
- /√
-/
Class
Gesture Set
a
b
c
c
d
d
e
e
f
f
f
A
B
C
C
D
D
E
E
F
G
H
f
I
g
h
i
j
j
f
f
k
k
L
M
N
O
O
P
P
Q
Q
√
-/
√
√
-√
- /√
-√
/
√
-
Table 1. Set of Manipulative Gestures
3.3
Reduced Gesture Set
The above gesture set contains many pairs that differ only
in a small detail. This includes fastening and unfastening
a given screw, assembling and disassembling the pedal/light
and lowering and rising the seat. In every pair, both gestures
are performed at the same location. The motions differ only
slightly. Thus both fastening and unfastening a screw involves
rotational motion in both directions. The difference is that
while turning in one direction the screwdriver is engaged with
the screw, while in the other it is not. We have labeled such
pairs as two distinct gestures, since the objective of this work
is to test the limits of recognition performance. However, it is
also interesting to see the overall system performance without
such extreme cases. To this end, we have defined a second, so
called reduced gesture set, in which such nearly identical pairs
are treated as a single activity (see Table 1 column 5).
3.4
The NULL Class
The difficulty of continuous recognition depends on the
complexity of the ’NULL’ events that separate the task related
gestures. Recognition would be fairly easy if the user could be
relied on to start and finish each gesture in a well defined position. It would also be easier if the relevant activities were performed immediately after each other with little in between but
moving between the different locations. Unfortunately, neither
of the above is likely in a real life maintenance scenario. As a
consequence, we have put great emphasis on having complex,
random ’NULL’ events in our tests. To this end the following
events were randomly included in the stream of manipulative
gestures used to test our method.
a. Assembling and disassembling the front wheel. This is a
very complex activity that contains gestures very similar
to our gesture set (e.g. unscrewing the wheel mount).
b. Walking over to a notebook placed about 3 meters away
from the bike to type a few characters.
c. Cleaning a user selected part of the bike. Here the
’NULL’ class could be potentially close to some relevant
gestures both in terms of location and motion.
d. Holding on to a user selected part of the frame for a user
selected period of time (a few seconds). Here we often
had relevant location but no motions.
In addition to the above random gestures the user had to pick
up and put away tools. No instructions were given to the users
what to do/not to do in between gestures. Overall the ’NULL’
class amounted to 68.7% of the recording time whereas no
’NULL’ class event interrupted a manipulative gesture. The
random number generator was set to generate approx. as many
’NULL’ class events as there were real events in the sequence.
3.5
Data Recording
We recorded two types of data sets. The first type comprises 20 repetitions of each of the 21 available manipulative
gestures. Here, the repetitions are separated by a few seconds,
in which the test subject returned into a defined ’home’ position. This data type is called train data. The second type
of recordings involves all 21 manipulative gestures in a randomly generated order. We refer to this data type as sequence
data. This ensures that even gestures with little complexity are
carried out with a certain variability, giving the data real-life
conditions. The gestures in the sequence data are separated by
randomly inserted ’NULL’ class events as described above.
The experiment has been performed by one female and five
male test subjects.
For each of the six test subjects, we recorded 20 train repetitions of each gesture and eight sequences containing all of
the 21 gestures. This results in 2520 different gesture instances
of type train (299 minutes) and 1260 gesture instances of type
sequence (115 min).
Training Set Rationale Splitting the data into a train and
a test set in above way is motivated by practical consideration
related to envisioned real-life deployment of such systems. On
any given piece of machinery, the set of possible individual
actions (manipulative gestures) that can be taken is likely to
be by far smaller than the set of all possible maintenance sequences. In fact since the maintenance sequences are permutations of individual actions in theory, there can be exponentially
more sequences than individual actions. As a consequence in
any practical system, we will have to train gestures individually and not as part of a certain sequence. This also makes
labeling of the training sequence much easier since predefined
start and stop positions can be used.
The down side of this method is that if the gestures are
trained separately as described above the onset and the end
phase of the gestures will likely to be different than in a real
maintenance sequence. Also as a person is likely to repeat
the same gesture a large number of times, the repetitions are
likely to get ’sloppy’. Thus the training will be less effective.
However, since the objective of the work was to get as close
as possible to a real-life scenario, we have used this training
strategy despite the above drawbacks.
Ground Truth
k
j
i
Location Class
h
fragmentation
Segmentation To evaluate and fine-tune the proposed segmentation method, the error categorizations approach from
[19] was applied. It extends the standard insertion, deletion,
substitution error types with three additional categories:
Timing errors. This refers to instances where an event was
correctly recognized, however, the timing of the event is
not entirely right. Thus the segmentation output might be
slightly shorter (underfill) or slightly longer (overfill) or
shifted (underfill on one side and overfill on the other). In
terms of quality of the segmentation as a basis for event
based recognition timing errors are obviously much less
important than ’true’ insertions or deletions.
Fragmentations. This is in a way the opposite of merges
where a single ground truth event is split into two or
more.
g
f
e
The evaluation of the segmentation is based on a comparison with the ground truth as described in [19]. It contains
both a frame by frame and an event based analysis. Similar
evaluation is applied to the classification and fusion results.
Merges. This class refers to instances when two ground truth
events belonging to the same class have been merged into
a single event by the segmentation algorithm. This type
of error tends to be ignored by the conventional error description since strictly speaking there is neither a deletion
nor an insertion.
Segmentation Results
insertion
c. external: The algorithms are trained on data from a set of
subjects which excludes the subjects which are used for
testing. This can be referred to as user-independent.
merge
The above error categories are illustrated in Figure 3 which
shows a typical segmentation output from our experiment.
d
c
b
0
timing error
(overfill)
deletion
a
100
200
300
400
500
600
Time [s]
Figure 3. Segmentation example for a sequence
4 Results
In this section we discuss the results of applying the approaches described in Section 2 to the data set obtained from
the experiment as specified in Section 3. To evaluate the influence of different users on the segmentation and classification
results we carried out three testing modalities:
a. intra: The algorithms are trained on data originating from
a single user and the segmentation and classification is
applied to sequences of the same user.
b. inter: Training is performed using data from a mixed set
of subjects. Testing is done for all involved subjects.
correct positive
overfill
underfill
insertion
substitution
merge
fragmentation
deletion
correct NULL
f =1
74.26
27.79
11.34
17.8
8.77
1.05
1.05
4.58
55.51
f =2
84.71
40.24
5.67
33.75
8.24
4.98
0.56
0.82
45.70
f =3
87.54
54.58
2.67
52.57
8.87
8.4
0.49
0.46
34.60
f =4
88.63
67.47
1.42
69.51
9.36
13.12
0.43
0.2
24.12
Table 2. Segmentation results for different values of f ;
frame based results, i.e. results are given in percentage of
overall event frames, (overall NULL event frames for correct
NULL).
Table 2 summarizes the frame by frame segmentation results in the intra-modality for different values of the f parameter. Although for f = 1 74.26% of the ground truth event
frames are correctly recognized, only 4.58% are deleted. The
missing percentage is divided between underfill (11.34%),
fragmentation (1.05%) and substitution (8.77%). Considering the fact that the segmentation is just an initial stage of the
classification this is a reasonable result. Except for the deletions all the other errors can potentially be corrected by motion
classification and classifier fusion. Higher values of f (less restrictive threshold θ) can reduce the number of deletions even
further, however, at the cost of a corresponding decrease in the
number of correct ’NULL’ class frames and an increase in the
number of insertions and overfills.
Inter and external modalities produced very similar results.
Separate Classification The results of the event based classification for both location and motion classifiers are presented
in the first two rows of each modality in Table 3. In each
case the corresponding classification methods have been applied to the segments retrieved by the segmentation stage. The
parameters of the preceding segmentation have been adjusted
to retrieve as many actual events as possible in order not to
generate deletions. Looking at Table 3, the following main
observations can be made:
1. There is not much difference between the inter, intra and
external testing modalities. In fact the inter results are
the worst ones, due to the smallest training set. Position
based external (tested on a person not trained for) results
are just 2% worse than intra results. This is a very positive and relevant result.
2. Position classification is consistently much better than
motion (mostly by around 15% max. 25%).
3. The number of insertions is intolerably high. On average
out of 4 events reported by the system, between 3 and 3.5
are insertions.
4. The number of correctly recognized events is low in the
full data set (55% to 73%). It is much better (65% to
91%), in the reduced set. In both sets the errors are
mostly due not to deletions but to substitutions.
5. The performance gets dramatically better if we look at
the first two picks of the classifier. There we are in the
nineties even for the full set. Looking at the two top picks
of motion and position classification combined, we get
over 95% for all testing modalities (nearly 99 % for inter)
even for the full set.
Fusion The remaining rows of Table 3 present the results for
the four classification fusion schemes as described in 2.3. For
the plausibility analysis based schemes, the values for correct
classified events based on the first two ranks are given.
1. As with separate classification, the results are fairly independent of the testing modality with intra user results
often even the worst due to the smallest training set.
2. The CM fusion method performs worse than pure location classification on all counts.
3. Plausibility analysis (PA) drastically reduces the number
of insertions (nearly by half). In combination with the
Average fusion method (PA-Avg) we are now down to
about 1 in 3 events being an insertion (bit more for the
intra
motion
position
mot&pos
binary
CM
PA
PA-Avg
inter
motion
position
mot&pos
binary
CM
PA
PA-Avg
external
motion
position
mot&pos
binary
CM
PA
PA-Avg
insertions
176.94 (162.82)
115.41 (107.55)
9.64
(6.76)
154.57 (146.62)
64.71 (50.99)
58.65 (44.83)
errors
fragment.
deletions
1.99 (1.99) 0.00 (0.00)
1.39 (1.89) 0.00 (0.00)
0.89 (1.09) 51.09 (51.09)
1.59 (1.99) 0.00 (0.00)
1.69 (1.69) 6.96 (6.96)
1.79 (1.89) 6.96 (6.96)
substitutions
39.76 (26.24)
33.00 (13.22)
8.85 (2.09)
39.56 (22.96)
25.35 (9.15)
24.16 (8.55)
correct
correct events in 2nd ranked
58.25 (70.58) 76.14 (82.21)
65.71 (84.10) 91.55 (93.44)
- 96.62 (97.61)
39.36 (45.63)
58.75 (73.66)
66.40 (81.41) 90.76 (92.35)
67.50 (82.11) 87.57 (93.24)
insertions
190.36 (175.75)
113.72 (102.58)
22.86 (18.59)
149.01 (138.27)
78.33 (60.93)
69.68 (51.39)
errors
fragment.
deletions
2.29 (2.19) 0.00 (0.00)
1.79 (1.99) 0.00 (0.00)
1.39 (1.49) 44.14 (44.14)
1.59 (2.09) 0.00 (0.00)
1.79 (1.99) 5.27 (5.17)
1.69 (1.99) 5.17 (5.17)
substitutions
37.77 (25.25)
25.25 (5.47)
9.64 (5.47)
34.29 (17.59)
20.87 (7.06)
17.89 (4.87)
correct
correct events in 2nd ranked
60.54 (71.97) 76.64 (84.59)
73.46 (91.85) 97.81 (98.11)
- 98.81 (98.91)
45.33 (52.19)
64.81 (79.22)
72.37 (85.19) 93.74 (96.22)
75.35 (87.38) 91.35 (97.32)
insertions
193.64 (178.43)
118.99 (108.85)
21.67 (17.00)
156.76 (145.6)
73.86 (56.86)
65.61 (48.41)
errors
fragment.
deletions
2.19 (2.58) 0.00 (0.00)
1.59 (2.29) 0.00 (0.00)
1.09 (1.49) 46.72 (46.72)
1.59 (2.39) 0.00 (0.00)
1.69 (2.19) 4.97 (4.97)
1.79 (2.19) 4.97 (4.97)
substitutions
42.45 (31.51)
27.14 (6.36)
10.54 (1.89)
36.18 (18.89)
24.06 (10.24)
21.27 (8.05)
correct
correct events in 2nd ranked
55.57 (65.71) 72.56 (80.22)
71.77 (90.76) 96.82 (97.22)
- 97.91 (98.41)
42.05 (49.11)
62.33 (77.73)
69.38 (82.01) 92.74 (94.63)
71.87 (84.10) 89.36 (96.02)
Table 3. Event based results given in % of ground truth
events. The number in brackets correspond with the evaluation
of the reduced gesture set. The last column gives the percentage of cases where the correct result was one of the first two
ranked classes.
full, bit less for the reduced set). PA, in particular in
combination with Average fusion, reduces substitutions
by around 10%. The price for the reduction in insertions
are between 5% and 6% deletions.
4. Looking at the first two choices of the classifiers, both
the pure PA and PA-Avg produce results between about
87.6% (for intra, full set) and 97.3% (inter, reduced set).
This is still not perfect, however, as will be argued later
together with the reduced insertion rate it is sufficient for
many applications. It is also an excellent starting point
for further optimizations
5 Conclusion and Future Work
Results Significance By the standards of established domains such as speech recognition or modes locomotion analysis one might be tempted to dismiss the results, in particular
for the full set of 21 activities, as overly inaccurate. However,
for the following reasons, we argue that the results presented
in this paper are indeed quite significant.
1. The type of real-life activity spotting with wearable sensors is known to be a hard and so far unsolved problem.
Thus even the comparatively low accuracy is a significant
progress.
2. The above experiment has been set up to be realistic,
contain hard to distinguish gesture pairs and a complex
’NULL’ class.
3. The deletion rates are very low and the correct answer is
mostly in the two top picks. This is a very good basis for
further work in particular the addition of further sensors
and high level modeling.
4. The above has been demonstrated for the user independent case (testing users on which there has been no train-
ing). This is an essential condition for such systems to
find wide scale acceptance.
5. With the correct class being contained in the top two
picks in well over 90% of the cases the system could be
already used in some applications. As an example, consider the automatic selection of appropriate manual pages
onto an HMD (head mounted display). Having the user
select from two choices is not a problem. Since the user
does not care about the displayed page when not doing
anything significant, the insertions are also not a grave
issue.
In summary, while we have not presented a final solution to
continuous task tracking, this work can be considered a significant step on the way towards such a solution.
Future Work The next steps towards a better recognition
performance are fairly obvious from the discussion above. For
one we will integrate our previous work on using sound information in activity recognition. This will provide us with
a third sensor modality that should significantly help with
those activities, that are associated with a characteristic sound.
Judging by the success of our previous work with a combination of sound and motion [20] in similar domains, this should
significantly improve the accuracy. We will also investigate
the integration of our previous results in purely motion based
segmentation to reduce the number of insertions in the segmentation stage. Further improvements to be investigated include high level task modeling, using RFIDs for tools identification and pruning of overlapping segments. Finally, as part
of the WearIT@Work project we are currently in the process
of setting up an experiment in a real-life aircraft maintenance.
References
[1] S. Antifakos, F. Michahelles, and B. Schiele. Proactive instructions for furniture assembly. In 4th Intl. Symp. on Ubiquitous Computing. UbiComp 2002., page 351, Göteborg, Sweden, 2002.
[2] L. Bao and S. S. Intille. Activity recognition from userannotated acceleration data. In Proceedings of the 2nd International Conference on Pervasive Computing, pages 1–17, April
2004.
[3] J. Deng and H. Tsui. An HMM-based approach for gesture segmentation and recognition. In 15th International Conference
on Pattern Recognition, volume 2, pages 679 – 682, September
2000.
[4] M. Hazas, C. Kray, H. Gellersen, H. Agbota, G. Kortuem, and
A. Krohn. A relative positioning system for co-located mobile
devices. In In Proceedings of MobiSys 2005: Third International Conference on Mobile Systems, Applications, and Services, pages 177–190, Seattle, USA, June 2005.
[5] S. Helal, B. Winkler, C. Lee, Y. Kaddourah, L. Ran, C. Giraldo, and W. Mann. Enabling location-aware pervasive computing applications for the elderly. In Proceedings of the First
IEEE Pervasive Computing Conference. Fort Worth, Texas,
June 2003.
[6] H. Junker, P. Lukowicz, and G. Tröster. Continuous recognition
of arm activities with body-worn inertial sensors. In Proceedings of the International Symposium on Wearable Computers,
Oct. 2004.
[7] C. Lee and X. Yangsheng. Online, interactive learning of gestures for human/robot interfaces. In IEEE International Conference on Robotics and Automation, volume 4, pages 2982 –
2987, April 1996.
[8] P. Lukowicz, J. Ward, H. Junker, G. Tröster, A. Atrash, and
T. Starner. Recognizing workshop activity using body worn microphones and accelerometers. In Pervasive Computing, 2004.
[9] D. W. Marquardt. An Algorithm for Least-Squares Estimation
of Nonlinear Parameters. Journal of the Society for Industrial
and Applied Mathematics, 11(2):431–441, June 1963.
[10] H. Muller, M. McCarthy, and C. Randell. Particle filters for
position sensing with asynchronous ultrasonic beacons. In Proceedings of LoCA 2006, LNCS 3987, pages 1–13. Springer Verlag, May 2006.
[11] G. Ogris, T. Stiefmeier, H. Junker, P. Lukowicz, and G. Tröster.
Using ultrasonic hand tracking to augment motion analysis
based recognition of manipulative gestures. In Proceedings of
the IEEE International Symposium on Wearable Computing,
pages 152–159, Oct. 2005.
[12] D. J. Patterson, D. Fox, H. Kautz, and M. Philipose. FineGrained Activity Recognition by Aggregating Abstract Object
Usage. In Proceedings of ISWC 2005: IEEE 9th International
Symposium on Wearable Computers, October 2005.
[13] C. Randell and H. Muller. Context awareness by analysing
accelerometer data. In Proc. 4th International Symposium on
Wearable Computers, pages 175–176, 2000.
[14] L. Seon-Woo and K. Mase. Activity and location recognition
using wearable sensors. IEEE Pervasive Computing, 1(3):24–
32, July 2002.
[15] A. Smith, H. Balakrishnan, M. Goraczko, and N. Priyantha.
Tracking moving devices with the cricket location system. In
Proc. 2nd USENIX/ACM MOBISYS Conference, Boston, MA,
June 2004.
[16] M. Stäger, P. Lukowicz, N. Perera, T. von Büren, G. Tröster,
and T. Starner. SoundButton: Design of a Low Power Wearable
Audio Classification System. In ISWC 2003: Proc. of the 7th
IEEE Int’l Symposium on Wearable Computers, pages 12–17,
Oct. 2003.
[17] T. Starner, B. Schiele, and A. Pentland. Visual contextual
awareness in wearable computing. In IEEE Intl. Symp. on
Wearable Computers, pages 50–57, Pittsburgh, PA, 1998.
[18] C. Vogler and D. Metaxas. ASL recognition based on a coupling between HMMs and 3D motion analysis. In ICCV, Bombay, 1998.
[19] J. Ward, P. Lukowicz, and G. Tröster. Evaluating performance in continuous context recognition using event-driven
error characterisation. In Proceedings of LoCA 2006, LNCS
3987. Springer Verlag, May 2006.
[20] J. A. Ward, P. Lukowicz, G. Tröster, and T. Starner. Activity
recognition of assembly tasks using body-worn microphones
and accelerometers. IEEE Transactions on Pattern Analysis
and Machine Intelligence, accepted for publication 2006.
[21] J. Yamato, J. Ohya, and K. Ishii. Recognizing human action in
time-sequential images using hidden Markov models. In Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, pages 379–385, 1992.
Download