Quantifying Gymnast Performance using a 3D Camera Contents Brian Reily

advertisement
Quantifying Gymnast Performance using a 3D
Camera
CSCI512 Spring 2015
Brian Reily
Colorado School of Mines
Golden, Colorado
breily@mines.edu
May 4, 2015
Contents
1 Introduction
1.1 Motivation and Aims . . . . . . . . . . . . . . . . . . . . . . .
1.2 Problem Setting and Assumptions . . . . . . . . . . . . . . . .
3
3
3
2 Discussion of Previous Work
4
3 Approach
8
3.1 Extracting the Gymnast Figure . . . . . . . . . . . . . . . . . 8
3.2 Fitting an Axis to the Gymnast’s Body . . . . . . . . . . . . . 9
3.3 Calculating Spin Extrema . . . . . . . . . . . . . . . . . . . . 11
4 Results
4.1 Evaluation Method . . . .
4.2 Accuracy . . . . . . . . . .
4.3 Data Available to Coaches
4.4 Performance . . . . . . . .
5 Discussion
5.1 Achievements
5.2 Limitations .
5.3 Future Work .
5.4 Conclusion . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
12
13
13
15
.
.
.
.
16
16
16
16
17
References
18
Code
19
List of Figures
1
2
3
4
5
6
7
8
9
10
Gymnastics Dataset . . . . . .
Kinect Skeleton . . . . . . . . .
Mean Background . . . . . . . .
Isolating the Gymnast . . . . .
Contour Around the Gymnast. .
Axis Fitted to the Gymnast. . .
Fitting Cubic Splines . . . . . .
Error Histograms . . . . . . . .
Foot Position over Time . . . .
Bar Graphs of Spin Times . . .
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
6
9
10
11
11
12
14
15
15
1
Introduction
The pommel horse is an event in male gymnastics competitions, where a
gymnast swings around in a circular motion on all parts of the apparatus.
The United States Olympic Committee is interested in using computer vision
techniques to quantify their gymnasts’ performance on this event by calculating things that may be too minute for the coaches to directly percieve but
may influence the judges’ scores.
1.1
Motivation and Aims
A pommel horse routine is comprised of different gymnastics moves. While
some moves can involve just one leg (e.g. swinging a leg vertically), the
majority of moves involving performing double leg circles, increasing difficulty
by using just one hand or performing circles on different parts of the horse.
An important part of the score can be determined by how consistently a
gymnast executes these spins - ideally one spin should take the exact same
amount of time as the next spin. Additionally, knowing tendencies about how
fast a gymnast is spinning can provide important information in training - if a
gymnast consistently slows down as he goes through his routine, conditioning
may be an issue. The Olympic Committee is interested in using computer
vision to quantify these spin speeds, in order to contribute to these issues
and possibly find other ways to use the data.
The aim of my project is to develop an algorithm that can identify a spin
and calculate the timing between spins. It will use a single depth camera,
and will be developed in C++ with OpenCV - though testing was done with
an identical algorithm implemented in Matlab. While I’ve continued past
this objective into beginning to identify an entire body pose, I won’t discuss
that work here.
1.2
Problem Setting and Assumptions
The data set (Figure 1) I built for this problem was collected entirely at the
United States Olympic Committee’s training center in Colorado Springs. The
gymnasts captured in the data set are potential members of the Olympic team
3
that will compete in the 2016 Summer Olympic Games in Rio de Janeiro,
Brazil. The data set was captured using a Microsoft Kinect 2 camera, placed
on a tripod approximately 2.3 meters in front of the pommel horse. Because
the Kinect is a depth camera that uses infrared light to calculate the depth at
each pixel, the lighting is mostly irrelevant. The word ’mostly’ is used since
a second Kinect camera was placed to the side of the pommel horse in order
to capture footage of the gymnast from the side (not utilized in this experiment). Because this second Kinect also emits infrared light, there is possible
interference between the two cameras. While visible noise can be seen in the
Kinect footage, there is no evidence that it effected the results. The dataset
consists of 39 different uses of the apparatus, separated by segments of other
activity that wasn’t used (people walking back and forth, adjusting the pommel horse, or simple a blank scene). It was hand processed to extract the 39
scenes from the Kinect 2 file format and store them as PNG image sequences.
These sequences are in the process of being annotated by hand, to mark the
position of the head, hands, and feet. Additionally, a start frame (when the
gymnast starts using the apparatus) and an end frame (when the gymnast
dismounts the apparatus) were specified.
(b) Gymnast on the Pommel Horse
(a) Pommel Horse Without Gymnast
Figure 1: Examples of the dataset generated for the project.
2
Discussion of Previous Work
When reviewing related work for this project, I read a variety of papers.
While creating an algorithm to track the gymnasts’ feet did not require a full
4
body estimated pose, I felt that it would be good background as I worked
on the project. Tracking and recognizing body parts is a widely researched
area, as is estimating a pose from the resulting body part positions. In
the case of pose estimation for human bodies, pose estimation typically has
a different meaning than traditional 3D pose estimation. While typical 3D
pose estimation attempts to reconstruct the 6 degrees of freedom of an object
in 3D space, for a human body it refers to the actual pose of the person - the
layout of the extremities, limbs, joints, etc. This process is also referred to
as skeletonization, as it results in a skeleton model fitted to the image. The
majority of this research has been done on RGB images, as affordable depth
cameras have only arrived recently. The arrival of the Kinect initiated a wide
variety of research using depth data, and in fact the device includes body
part tracking and pose estimation out of the box. However, this included
method does not work well for this problem setting.
The Kinect camera includes a pose estimation algorithm based on the
depth data it collects, developed by Shotton et al. [7] at Microsoft Research.
Their method classifies each of a body’s pixels individually as belonging to
one of 31 distinct parts. This classification is based on different features of
the pixel - e.g. it’s relation to the depth of a pixel above it or to two pixels
below and to the side. These features are used to train a decision forest,
where each branch is one of these features - approximately 2000 different
features. Their method clusters the classified pixels into body parts, infers
joint locations from this, and connects these into a skeleton. The Kinect
algorithm is both very quick (approximately 5ms per frame on an Xbox
console) and accurate - for most cases. It fails, however, on positions like
those in my gymnastics dataset. Since it depends so heavily on training
data (approximately 1 million images), it incorrectly classifies images that
are not similar to the training set. For instance, it may correctly classify the
gymnast’s upper body as it is a standard upper body pose that may be used
in home gaming, but legs swung out to the side or an upside down gymnast
cause it to fail dramatically, as seen in Figure 2.
In addition to Shotton’s work, a number of other pose estimation methods
have been created for depth imagery. One such work is from Zhu et al., used a
model based approach [9]. Their work uses basic image features that are not
processed with a SVM or similar classifier to fit a head-neck-torso model.
This model is constrained by detected features and the model position in
previous frames - taking time data into account, unlike the Kinect approach.
5
Figure 2: An example of Kinect generating an incorrect skeleton for a gymnast.
While they don’t state any quantitative results, they have good qualitative
demonstrations and state that their approach has been used to effectively
map human motion to a robot. However, their method is currently limited
to the upper body.
A few pose estimation methods for depth images have also been developed
that do not rely (extensively or at all) on training data. Two similar methods
focus on finding body extrema. Plagemann et al. [5] uses the depth data to
construct a surface mesh of the person. First, they find the point furthest
from the centroid in terms of geodesic distance - the distance along a surface
instead of direct Euclidean distance. Then they iteratively locate the next
furthest from all previously found points. This typically locates body extrema
- head, hands, and feet - in a small number of iterations. These interest points
are fed to a logistic regression classifier trained on image patches of body
parts, and results in significant speed improvements over a typical sliding
window approach. While this method still requires training data, a similar
method developed by Schwarz et al. [6] uses the same approach of geodesic
distance. Schwarz’s approach does require the addition of RGB data, but
with an initial pose and optical flow tracking is accurate enough to generate
full skeletons without training data.
I also investigated body part recognition algorithms based on RGB images. Most works in this area are based off two important papers. The first
is the work done on HOG features - Histograms of Oriented Gradients - done
by Dalal et al. [2]. This paper introduced HOG features as the accepted way
6
to detect people in images. HOG features use a binning process for gradient
orientation, and the features are used to train a Support Vector Machine
(SVM). Then the test image is processed with a sliding window approach,
and using a multiresolution pyramid the window is classified at various sizes
by the SVM. Building on this is work done in the area of ’poselets’ by Bourdev et al. [1]. Poselets utilize HOG features, but instead of detecting an
entire body they are trained on different parts of it. Additionally, instead
of the researcher defining parts such as ’arm’ or ’leg’, the poselet approach
learns which portions of the body are significant, and often ends up learning
HOG features that describe areas such as ’torso and left arm’. These poselets
can be combined, as seen in Wang et al.’s [8] work on heirarchical poselets.
Their method designs a body part detector that can look for traditional
poselets - for instance, the ’torso and left arm’. But it is then heirarchically
broken down to detect ’torso’ and ’left arm’, which could be further broken
down to detect ’upper left arm’ and ’lower left arm’. This allows them to
find specific body parts, which they demonstrate briefly could be combined
using kinematic constraints. The most advanced work in this line of research
seems to be from Pishchulin and Andriluka et al., who extended the poselet
idea into what they call ’Pictorial Structures’ [4]. Pictorial Structures are a
model of a body pose using a conditional random field (CRF). Their work
uses poselets to provide a basis prior pose for an image, augmenting terms
in their CRF-based model. This is used to estimate a variety of poses much
more accurately than previous poselet work, nearing the best in field performance and only being beaten by an approach that uses data from images
across the dataset.
One work that is based on the poselet method but uses depth images was
published by Holt et al. [3]. Their work runs a multiscale sliding window
over the depth image, processing each window with a classifier trained on
poselets represented as a vector of pixel intensities over a 24x18 window. Holt
questions the use of HOG features on depth imagery but does not explain this
reasoning. The classifier, built as a decision forest, is run over the entire image
and classifications are conglomerated into body parts. Holt shows very good
results for major body parts (head, shoulders), with reduced effectiveness
for upper and lower arms. Perhaps their biggest contribution is a dataset of
depth imagery poses from the waist up, with 10 body parts labeled as ground
truth.
7
3
Approach
My approach to this problem was based on finding the extrema of a gymnast’s
spin. If I could determine the exact time the gymnast’s feet were at their
furthest left point, then I could determine the amount of time it took for
them to reach the furthest left point again. This would be the exact time
that it took for the gymnast to complete one rotation around their center
axis (similarly for the furthest right point if they began spinning in that
direction).
In my explanation of the algorithm I will consider the case where the
gymnast began spinning from the camera’s left side, to the right side, and
back to the left side. I’ll use the term extrema to refer to the moment the
gymnast’s feet reach their furthest extension the moment when their circle
of travel in (X, Y, Z) space intersects with the image plane in (x, y) space.
To determine the furthest point that the gymnast’s feet travel to the left, we
can take a number of approaches. The simplest is to use image processing
techniques to isolate the area belonging to the gymnast, and determine which
pixel is farthest left and farthest right. The problem with this is that it results
in extrema being marked at the position of the head, shoulders and arms of
the gymnast in addition to his feet. One can attempt to diminish this by
trying to restrict the extrema to be furthest from the center of the pommel
horse, but this does not work in a large number of situations where the
gymnast is not rotating around the center of the pommel horse. My eventual
solution to this problem was to fit a bent axis to the gymnast’s body, ideally
extending from his feet, bending at the body center, and extending to his
head.
3.1
Extracting the Gymnast Figure
To start, I segment the gymnast from the background and pommel horse.
I built a simple background subtractor in C++, that takes the average of
N frames. A typical frame without a gymnast is seen in Figure 1a in the
dataset description. This mean background is thresholded by it’s depth as we know how far away the pommel horse is, it is simple to only use the
pixels in the depth range that a gymnast using the apparatus could appear
in. Finally, the background is cleaned up with basic morphological process,
and one instance can be seen in Figure 3.
8
Figure 3: Mean Background
When processing the video of a gymnast’s performance, each frame (for
example, Figure 4a) is similarly processed - thresholded by depth and opened
with a small structuring element. An example of the depth thresholding can
be seen in Figure 4b. The background is then subtracted from the frame.
Pixels that are less than zero are discarded - this can occur due to noise from
the Kinect. Pixels less than a threshold are also discarded to reduce noise.
A typical resulting image can be seen in Figure 4c.
This resulting image is fed to a contour detection algorithm. In OpenCV,
this is as simple as one function call. In Matlab, the easiest way to do
this is to find the largest connected component, and then find the boundary
belonging to that region. An example can be seen in Figure 5. At the same
time, the centroid of this contour is found. In Matlab, this has already been
computed by the connected component detection. In OpenCV, this is done
by calculating the moments of the contour, found with Green’s Theorem,
which - to simplify a lot - integrates over the curve formed by the contour.
3.2
Fitting an Axis to the Gymnast’s Body
The result of fitting an axis to the gymnast can be seen in Figure 6. The first
step in my algorithm iterates through the contour, searching for two points -
9
(a) Gymnast on the Pommel Horse
(b) After Depth Thresholding
(c) After Subtraction
Figure 4: The process of isolating a gymnast.
the furthest from the centroid, and the closest to the centroid. The reasoning
here is that the point furthest from the centroid is almost always the feet or
head, and especially so when the gymnast is extended to the side. I’ll refer
to the vector from the centroid to this point as Vector A, and it can be seen
in green in Figure 6. The point closest to the centroid is typically the waist,
especially in situations where the gymnast is bent at the waist. The vector
to this closest point can be seen in red in Figure 6 - I’ll refer to this vector
as Vector B. The angle between these vectors is important, and I’ll refer to
this as Angle C. Then the contour is searched again, this time looking for
the point that would form the other end of the axis. The vector is picked
based on the angles it forms to the two vectors found previously. Its angle to
Vector B should be the same as Angle C, and its angle to Vector A should
be twice as large as Angle C. This final vector can be seen in blue in Figure
6, and the relation between the angles should be clear.
10
Figure 5: Contour Around the Gymnast.
Figure 6: Axis Fitted to the Gymnast.
3.3
Calculating Spin Extrema
The x coordinate of the feet is tracked for each frame. To determine if f ramei
is an extrema, I compare the x coordinate to the value from two frames ago
and two frames after. If xi is less than both xi−2 and xi+2 , then f ramei is a
left extrema. An identical method is used for right extrema, but checking if
the value is greater than the other two.
The Kinect has a peak framerate of 30 frames per second, but often skips
frames for unknown reasons. Thus some frames may be separated by 33ms,
11
some by 66ms, up to a peak of 133ms (that I have encountered). Since
the gymnasts spin quickly and are very consistent between spins, ideally
we want to produce the actual time of the extrema, and not simply the
timestamp of the frame. To do this, I fit a curve to the points (timei−1 , xi−1 ),
(timei , xi ), and (timei+1 , xi+1 ) using a cubic spline - an example can be seen
in Figure 7. This enables the algorithm to return the actual extreme x
coordinate and its exact timestamp. An identical method can fit a curve to
the points (f ramei−1 , xi−1 ), (f ramei , xi ), and (f ramei+1 , xi+1 ) to return the
exact frame number (e.g. frame 144.7 instead of frame 145). The final step in
the algorithm is to merge very close extrema. Depending on the positioning
of the gymnast’s feet, multiple extrema may be recorded within a few frames
of eachother. My method merges these, averaging their timestamps and
frame numbers.
(b) A Right Extrema.
(a) A Left Extrema.
Figure 7: Cubic Splines Fitted to a Gymnast’s Path. Time (ms) is shown on
the x Axis and the x Coordinate of the Gymnast’s Feet is shown on the y
Axis.
4
4.1
Results
Evaluation Method
As described above, the dataset was split into sections corresponding to activity on the pommel horse. For each frame, the timestamp from the Kinect
is known. While positions of the head and hands was also annotated, for this
evaluation I just used my annotations of the gymnasts’ feet. The position
12
of these was put through the same cubic spline based interpolation method
as above to create a ground truth result consisting of the timestamp and
frame number of every extrema. The advantage of this over simply hand
identifying extrema frames and treating those as the ground truth is that
interpolating in this manner allows for extrema that occur between recorded
frames. For each segment, the image sequence of the gymnast performing
was processed by the described detection algorithm. The resulting timestamps and frame numbers were compared against the ground truth for that
sequence to compute the error for each extrema. Currently, I have results for
over 176 extrema.
4.2
Accuracy
This method, though simple, is remarkably accurate at this specific task. To
show this, I computed the root mean squared error (RMSE) of both ground
truth vs detection timestamps and frame numbers. Additionally, I compute
the average absolute error, taking the average error in both timestamps and
frame numbers, but treating a detection 6ms too early the same as a detection
6ms too late (similarly for frame numbers). These results can be seen in
Table 1, showing that the method is able to detect extrema almost within a
hundredth of a second of ground truth data.
Error
Error Metric
RMSE (Time)
12.9942 ms
RMSE (Frame Number)
0.2393
Average Absolute (Time) 7.8168 ms
0.1352
Average Absolute (Frame)
Table 1: Error over 176 Extrema
Additionally, Figure 8 displays the error frequency graphically. These
histograms show the frequency of timestamp errors and frame number errors,
displaying counts for both actual and absolute errors.
Finally, the detection method does not detect any more or any less extrema than are in the ground truth dataset.
4.3
Data Available to Coaches
This timing data is accurate enough to provide the information that the
coaches were looking for - how consistent their gymnasts are. One example
13
(b) Absolute Timestamp Errors
(a) Timestamp Errors
(c) Frame Number Errors
(d) Absolute Frame Number Errors
Figure 8: Histograms of Errors.
sequence is displayed in Figure 9. This figure plots time along the x axis and
the left-right position of the gymnast’s feet along the y axis. Also annotated
are spin times - i.e. the time between one left extrema and the next left
extrema, and the same for the right. This same sequence is displayed in
Figure 10 with a different method - each bar is the time for that spin. Note
the second graph in Figure 10, where the outlier spin at the end is removed
(possibly this spin was slower due to a dismount). With all 22 spins, this
gymnast’s average spin time was 973.86ms, with a standard deviation on
66.77ms. Removing the outlier, his average spin time was 960.62ms, with
a standard deviation of only 25.07ms. In a vacuum this data isn’t useful,
but would allow for good comparisons between gymnasts - a low standard
deviation being a sign of consistency. Additionally, the graphically displayed
data could prove useful to coaches who are more visually oriented.
14
Figure 9: Graph displaying the gymnast’s foot position over time.
(b) Without the Outlier Spin
(a) All Performed Spins
Figure 10: Spin times for one gymnast’s performance on the pommel horse.
4.4
Performance
In a real world implementation, little benefit would be had from running this
approach in real time. The coaches can’t give input to the gymnast while he
is performing, so it would make little difference if the application recorded the
gymnast and computed the extrema with a slight delay. That said, currently
the C++ implementation of this method easily works in real time, on a midrange modern laptop. Though the Kinect does not seem to reliably provide
30 frames per second, no delay is caused by the image processing described
here. The Matlab implementation used for the evaluation section does not
work in real time however - on a four year old laptop, it achieves roughly 10
frames per second.
15
5
5.1
Discussion
Achievements
The method described here works remarkably well for its simplicity. Built on
basic image processing techniques and a basic concept of human kinematics,
it provides accuracy to nearly within a one hundredth of a second from the
ground truth. Additionally, the literature review and dataset construction
set this area up for more advanced pose estimation techniques.
5.2
Limitations
This approach does have some limitations. With how it is currently built,
it is only able to provide information about a limited set of pommel horse
techniques. Advanced moves, such as the gymnast being upside-down or
performing split leg scissors, do not adhere to the spin pattern it was built to
detect and thus just return nonsense data. Along the same lines, the method
can pick up ’noise’ extremas from actions like the gymnast approaching the
apparatus, as this again does not follow the pattern. This was not mentioned
in the results section as the dataset was constructed to consist only of actual
gymmastics sequences.
5.3
Future Work
These limitations go right into my planned future work. The next step is to
develop a pose estimation method that is aware of what body parts are where,
instead of this naive axis method. This would provide more information when
the gymnast is doing more atypical moves or simply entering and leaving the
pommel horse area, as well as allow extension to other sports. My current
approach to this is based on some work that estimates body part position
through geodesic distance from the body centroid, and I believe I have already
made some improvements over published methods. Following that, I will
work on developing an effective feature representation for body parts in depth
images, as no single standard has shown itself to be the best. While HOG
features are the standard for RGB images, they are not used in depth imagery.
In addition to that, I’ve begun working on skeleton fitting, to find a full body
pose based on recognized body parts. I see potential here to improve over
16
current methods. Finally, I will assist a team of undergraduate students in
building the method described here into a full application suitable for use by
gymnastics coaches.
5.4
Conclusion
In conclusion, I presented an effective method for foot tracking through the
major part of a pommel horse routine. This method is without question
accurate enough for real world use, and runs much faster than real-time.
17
References
[1] Lubomir Bourdev and Jitendra Malik. Poselets: Body Part Detectors
Trained Using 3D Human Pose Annotations. ICCV, 2009.
[2] Navneet Dalal and Bill Triggs. Histograms of Oriented Gradients for
Human Detection. CVPR, 2005.
[3] Brian Holt, Eng Jon Ong, Helen Cooper, and Richard Bowden. Putting
the Pieces Together: Connected Poselets for Human Pose Estimation.
ICCV, 2011.
[4] Leonid Pishchulin, Mykhaylo Andriluka, Peter Gehler, and Bernt Schiele.
Poselet Conditioned Pictorial Structures. CVPR, 2013.
[5] Christian Plagemann and Daphne Koller. Real-time Identification and
Localization of Body parts From Depth Images. ICRA, 2010.
[6] Loren Arthur Schwarz, Artashes Mkhitaryan, Diana Mateus, and Nassir
Navab. Estimating human 3D pose from Time-of-Flight Images Based on
Geodesic Distances and Optical Flow. IEEE Conf. Autom. Face Gesture
Recognit., 2011.
[7] Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, Alex Kipman, and Andrew Blake. Real-Time Human Pose Recognition in Parts from Single Depth Images. CVPR, 2011.
[8] Yang Wang, Duan Tran, and Zicheng Liao. Learning Hierarchical Poselets
for Human Parsing. CVPR, 2011.
[9] Youding Zhu, Behzad Dariush, and Kikuo Fujimura. Controlled Human
Pose Estimation from Depth Image Streams. CVPR, 2008.
18
Listing 1: Detection Approach
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
%
%
%
%
%
%
%
D e t e c t i o n Method
−−−−−−−−−−−−−−−−
This i s an i m p l e m e n t a t i o n o f my d e s c r i b e d approach i n Matlab .
There i s a c o r r e s p o n d i n g C++ v e r s i o n o f t h i s approach .
Ground t r u t h a n n o t a t i o n and e r r o r c h e c k i n g a r e done e l s e w h e r e .
function [ EXTREMAS, FEET, TIMESTAMPS ] = g y m d e t e c t ( images , s t a r t f r a m e , end frame , BG )
% minimum s i z e o f c o n t o u r
CONTOUR LEN THRESH = 5 0 ;
% d e p t h t h r e s h o l d s (8 b i t ∗ s t e p )
DEPTH THRESH MIN = 6 0 ∗ 1 9 ;
DEPTH THRESH MAX = 1 8 0 ∗ 1 9 ;
HEIGHT THRESH = 3 7 0 ;
% noise threshold for subtraction
SUBTRACT THRESH = 1 0 0 ;
% minimum l e n g t h f o r f e e t v e c t o r
FEET LEN THRESH = 1 0 0 ;
% o p e r a t o r used t o c l e a n up n o i s e
STREL3 = s t r e l ( ’ d i s k ’ , 3 ) ;
n = s i z e ( images , 3 ) ;
h e i g h t = s i z e (BG, 1 ) ;
width = s i z e (BG, 2 ) ;
% do d e p t h t h r e s h o l d i n g on b a c k g r o u n d
BG = BG . ∗ (BG > DEPTH THRESH MIN & BG < DEPTH THRESH MAX ) ;
f o r j = HEIGHT THRESH : h e i g h t
BG( j , : ) = zeros ( 1 , width ) ;
end
BG = BG . ∗ (BG > DEPTH THRESH MIN & BG < DEPTH THRESH MAX ) ;
BG = ime rode (BG, STREL3 ) ;
BG = i m d i l a t e (BG, STREL3 ) ;
% data s t r u c t u r e s to track d e t e c t i o n s
FRAME TIMES = zeros ( 1 , 5 ) ;
FRAME IDS = zeros ( 1 , 5 ) ;
EX COORDS = zeros ( 2 , 5 ) ;
EX LEN = zeros ( 1 , 5 ) ;
N EXTREMAS = 0 ;
EXTREMAS = [ 0 0 ] ;
FEET = zeros ( n , 2 ) ;
TIMESTAMPS = zeros ( n , 1 ) ;
f o r i = s t a r t f r a m e −5 : e n d f r a m e+5
I = images ( : , : , i ) ;
t = get timestamp ( I ) ;
TIMESTAMPS( i ) = t ;
% i f t h i s timestamp i s n ’ t z e r o e d , image won ’ t d i s p l a y p r o p e r l y
I (1 , 2) = 0 ;
% t h r e s h o l d frame
f o r j = HEIGHT THRESH : h e i g h t
I ( j , : ) = zeros ( 1 , width ) ;
19
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
end
I = I . ∗ ( I > DEPTH THRESH MIN & I < DEPTH THRESH MAX ) ;
I = imopen ( I , STREL3 ) ;
% s u b t r a c t background
I = abs ( I − BG) ;
I = I
.∗ ( I > 0);
I = I . ∗ ( I > SUBTRACT THRESH ) ;
% f i n d l a r g e s t c o n n e c t e d component
R = main region ( I ) ;
B = bwboundaries (R. F i l l e d I m a g e , 4 , ’ n o h o l e s ’ ) ;
largest = 0;
f o r j = 1 : s i z e (B, 1 )
i f s i z e (B{ j } , 1 ) > l a r g e s t
l a r g e s t = s i z e (B{ j } , 1 ) ;
contour = B{ j } ;
end
end
i f l a r g e s t < CONTOUR LEN THRESH
continue ;
end
f o r j = 1 : s i z e ( contour , 1 )
contour ( j , 1 ) = contour ( j , 1 ) + R. BoundingBox ( 2 ) ;
contour ( j , 2 ) = contour ( j , 2 ) + R. BoundingBox ( 1 ) ;
end
% k e e p t r a c k o f l a s t 5 ti m e s t a m p s
% t h i s i s u s e f u l i n t h e v e r y odd c a s e t h a t a c o n t o u r can ’ t be found
FRAME TIMES( 5 ) = FRAME TIMES ( 4 ) ;
FRAME TIMES( 4 ) = FRAME TIMES ( 3 ) ;
FRAME TIMES( 3 ) = FRAME TIMES ( 2 ) ;
FRAME TIMES( 2 ) = FRAME TIMES ( 1 ) ;
FRAME TIMES( 1 ) = t ;
FRAME IDS( 5 ) = FRAME IDS ( 4 ) ;
FRAME IDS( 4 ) = FRAME IDS ( 3 ) ;
FRAME IDS( 3 ) = FRAME IDS ( 2 ) ;
FRAME IDS( 2 ) = FRAME IDS ( 1 ) ;
FRAME IDS( 1 ) = i ;
% find centroid
c e n x = R. C e n t r o i d ( 1 ) ;
c e n y = R. C e n t r o i d ( 2 ) ;
% f i n d p o i n t / v e c t o r on c o n t o u r f u r t h e s t from c e n t r o i d
% a l s o f i n d p o i n t / v e c t o r on c o n t o u r c l o s e s t t o c e n t r o i d
d i s t a n c e s = zeros ( s i z e ( contour , 1 ) , 1 ) ;
f o r j = 1 : s i z e ( contour , 1 )
x = contour ( j , 2 ) ;
y = contour ( j , 1 ) ;
d i s t a n c e s ( j ) = norm ( [ c e n x c e n y ] − [ x y ] ) ;
end
% d e f i n e t h e l o n g e s t and s h o r t e s t v e c t o r s
[ l o n g l e n , l o n g i ] = max( d i s t a n c e s ) ;
l o n g v e c = [ c e n x c e n y ] − [ contour ( l o n g i , 2 ) contour ( l o n g i , 1 ) ] ;
[ s h o r t l e n , s h o r t i ] = min( d i s t a n c e s ) ;
s h o r t v e c = [ c e n x c e n y ] − [ contour ( s h o r t i , 2 ) contour ( s h o r t i , 1 ) ] ;
% a n g l e b e t w e e n l o n g e s t v e c t o r and s h o r t e s t v e c t o r
l o n g a n g = acos ( dot ( l o n g v e c , s h o r t v e c ) / ( l o n g l e n ∗ s h o r t l e n ) ) ;
20
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
% f i n d p o i n t / v e c t o r most o p p o s i t e t h e s h o r t v e c t o r
% f i n d p o i n t / v e c t o r most o p p o s i t e t h e l o n g v e c t o r
s h o r t s c o r e = [ Inf 0 ] ;
l o n g s c o r e = [ Inf 0 ] ;
f o r j = 1 : s i z e ( contour , 1 )
x = contour ( j , 2 ) ;
y = contour ( j , 1 ) ;
vec = [ cen x cen y ] − [ x y ] ;
l e n = norm( v e c ) ;
% search for vector opposite longest vector
% b u t a t t h e same a n g l e t o s h o r t e s t v e c t o r
maj ang = acos ( dot ( vec , l o n g v e c ) / ( l e n ∗ l o n g l e n ) ) ;
min ang = acos ( dot ( vec , s h o r t v e c ) / ( l e n ∗ s h o r t l e n ) ) ;
s c o r e = abs ( min ang − l o n g a n g ) + abs ( maj ang − ( 2 ∗ l o n g a n g ) ) ;
i f isinf ( long score (1)) | | score < long score (1)
long score (1) = score ;
long score (2) = j ;
end
end
% d e f i n e end c o o r d i n a t e s f o r t h e two l o n g v e c t o r s
l o n g x = contour ( l o n g i , 2 ) ;
l o n g y = contour ( l o n g i , 1 ) ;
l o n g 2 x = contour ( l o n g s c o r e ( 2 ) , 2 ) ;
l o n g 2 y = contour ( l o n g s c o r e ( 2 ) , 1 ) ;
% s e e which l o n g v e c t o r p o i n t s down , assume t h a t ’ s t h e f e e t
i f long y > long2 y
feet len = long len ;
feet x = long x ;
feet y = long y ;
else
f e e t l e n = norm ( [ c e n x c e n y ] − [ l o n g 2 x l o n g 2 y ] ) ;
f e e t x = long2 x ;
f e e t y = long2 y ;
end
% track l a s t 5 feet points
EX COORDS ( : , 5 ) = EX COORDS ( : ,
EX COORDS ( : , 4 ) = EX COORDS ( : ,
EX COORDS ( : , 3 ) = EX COORDS ( : ,
EX COORDS ( : , 2 ) = EX COORDS ( : ,
EX COORDS ( : , 1 ) = [ f e e t x ; f e e t
EX LEN ( 5 ) = EX LEN ( 4 ) ;
EX LEN ( 4 ) = EX LEN ( 3 ) ;
EX LEN ( 3 ) = EX LEN ( 2 ) ;
EX LEN ( 2 ) = EX LEN ( 1 ) ;
EX LEN ( 1 ) = f e e t l e n ;
4);
3);
2);
1);
y ];
% track a l l foot points for possible replay
FEET( i , : ) = [ f e e t x f e e t y ] ;
% we may want f o o t c o o r d i n a t e s f o r t h e p r e v i o u s p o i n t s ,
% b u t don ’ t want t o draw extrema
i f i < s t a r t f r a m e | | i > end frame
continue ;
end
% i f p o i n t 3 i s f a r t h e r l e f t than 1 and 5 , l e f t extrema
% i f p o i n t 3 i s f a r t h e r r i g h t than 1 and 5 , r i g h t extrema
extrema = f a l s e ;
21
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
[ ˜ , max i ] = max(EX COORDS( 1 , : ) ) ;
[ ˜ , m i n i ] = min(EX COORDS( 1 , : ) ) ;
i f EX LEN ( 3 ) > FEET LEN THRESH && ( max i == 3 | | m i n i == 3 )
i f EX COORDS( 1 , 3 ) < EX COORDS( 1 , 5 ) && EX COORDS( 1 , 3 ) < EX COORDS( 1 , 1 )
extrema = t r u e ;
e l s e i f EX COORDS( 1 , 3 ) > EX COORDS( 1 , 5 ) && EX COORDS( 1 , 3 ) > EX COORDS( 1 , 1 )
extrema = t r u e ;
end
end
% i f an extrema , i n t e r p o l a t e i t ’ s time and frame
i f extrema
% o n l y u s e t h e extrema and one p o i n t on e i t h e r s i d e
x = [EX COORDS( 1 , 4 ) EX COORDS( 1 , 3 ) EX COORDS( 1 , 2 ) ] ;
% f i t s p l i n e over t h o s e 3 timestamps
t = [ FRAME TIMES( 4 ) FRAME TIMES( 3 ) FRAME TIMES ( 2 ) ] ;
t r g = FRAME TIMES ( 5 ) : FRAME TIMES ( 1 ) ;
% and t h o s e 3 frame numbers , down t o 0 . 1
f = [ FRAME IDS( 4 ) FRAME IDS( 3 ) FRAME IDS ( 2 ) ] ;
f r g = FRAME IDS ( 5 ) : . 1 : FRAME IDS ( 1 ) ;
% f i t s p l i n e t o time , p i c k max p o i n t d e p e n d i n g on l e f t / r i g h t
yy = s p l i n e ( t , x , t r g ) ;
i f EX COORDS( 1 , 3 ) > EX COORDS( 1 , 5 )
[ ˜ , i d x ] = max( yy ) ;
else
[ ˜ , i d x ] = min( yy ) ;
end
ex time = t r g ( idx ) ;
% f i t s p l i n e t o frame #
yy = s p l i n e ( f , x , f r g ) ;
i f EX COORDS( 1 , 3 ) > EX COORDS( 1 , 5 )
[ ˜ , i d x ] = max( yy ) ;
else
[ ˜ , i d x ] = min( yy ) ;
end
ex frame = f r g ( idx ) ;
% record i t
N EXTREMAS = N EXTREMAS + 1 ;
EXTREMAS(N EXTREMAS, 1 ) = e x f r a m e ;
EXTREMAS(N EXTREMAS, 2 ) = e x t i m e ;
end
end
% sometimes m u l t i p l e extrema a r e d e t e c t e d a t t h e end o f a s p i n
% combine n e a r b y extrema w i t h i n 4 frames
removed = 0 ;
f o r i = 1 : N EXTREMAS
i f i == N EXTREMAS
continue ;
end
i f abs (EXTREMAS( i , 1 ) − EXTREMAS( i +1 , 1 ) ) < 4
EXTREMAS( i +1 , 1 ) = (EXTREMAS( i , 1 ) + EXTREMAS( i +1 , 1 ) ) / 2 ;
EXTREMAS( i +1 , 2 ) = (EXTREMAS( i , 2 ) + EXTREMAS( i +1 , 2 ) ) / 2 ;
EXTREMAS( i , 1 ) = 0 ;
EXTREMAS( i , 2 ) = 0 ;
removed = removed + 1 ;
end
22
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
end
% o n l y r e t u r n t h e nonzero e l e m e n t s
EXTREMAS = EXTREMAS;
EXTREMAS = zeros ( nnz (EXTREMAS) / 2 , 2 ) ;
ct = 0;
f o r i = 1 : N EXTREMAS
i f EXTREMAS ( i , 1 ) ˜= 0
ct = ct + 1;
EXTREMAS( ct , 1 ) = EXTREMAS ( i , 1 ) ;
EXTREMAS( ct , 2 ) = EXTREMAS ( i , 2 ) ;
end
end
end
23
Download