>> Ganesh Ananthanarayanan: We are excited to have He... UIUC who works with Romit Roy Choudhury, who is quite...

advertisement
>> Ganesh Ananthanarayanan: We are excited to have He Wang today. He's a PhD student at
UIUC who works with Romit Roy Choudhury, who is quite familiar to many in MSR. He
himself is no stranger to MSR. He's interned here twice. Today, he'll be talking about his work
on multi-modal sensing, gesture recognition and indoor localization.
>> He Wang: Thank you. It's my pleasure to be here to talk about my research work, which is
using mobile devices to sense a user's location. And let me start by saying a few words about the
background of this research. In a matter of 10 years, from 2005 to 2015, the mobile phone has
transformed from a basic communication device to a smart device that performs sensing,
computing and communications. And today's smartphone has 14 sensors on them, more than
five communication interfaces and even more CPU power than the Apollo guidance computer
that landed man on the Moon. Now, given these devices are always on and always with us, we
value these devices as a general-purpose human sensor capable of zooming into our lives,
understanding our daily activities, preferences and behavior patterns. And Silicon Valley has
been calling this the quantified self, essentially suggesting that the data from these devices can be
used to cull inferences about ourselves and ultimately enabling a wide variety of applications.
So a few applications that are already on the market. Let me just talk about a few examples.
Using the accelerometer data on the smartphone, it is possible to count the number of steps that a
user has walked, and many calorie score applications have emerged. And the newer smartphones
has the skin conduction sensor that can measure the heart rate and enables many mobile health
applications. And of course, GPS gives us the driving directions, right? While these were really
cool and important a few years back, but now they seem obvious. And users lack patience for
the next generation of mobile applications -- is much higher. And if you want to deliver better
applications for the future, research is necessary towards robust, efficient and practical
influencing techniques on humans. And here are just a few examples that are still quite hard
today. Finding a user's indoor location, estimating a posture of a person and tracking precise
hand gestures and various forms of context awareness. My research focuses on developing
inference and techniques using the multi-modal sensing data for mobile devices, such as
smartphones and smartwatches. And I believe these inferences can enable new kinds of humancentric mobile applications. For example, consider business analytics, so if you understand this
user's location, postures and gestures, then we can understand the user's shopping behavior in a
grocery store, and interestingly, this is already happening extensively in the web. Our clicking
patterns, our mouse movements, are driving a billion-dollar business called web analytics. And I
believe our footsteps in the indoor environment is indeed like the clicking streams online, and
when we look at a cereal box in a grocery store, it is indeed like right-clicking an online item. So
this physical and web analytics are really similar, but still, these physical analytics is still not
present today, simply because we don't have the ability to understand the user's location, postures
and gestures. Suppose we have these location, postures and gestures as building blocks. Then I
believe we can enable this physical business analytics. And as many other applications open up - for example, understanding the location and orientation of the phone can enable augmented
reality for indoor museums and -- sure.
>>: [Indiscernible] do you also consider using the infrastructure and cameras and things, which
might be a more direct indicator of where you are?
>> He Wang: That's a possible solution. So if you run the analytics using a camera, that's
another possible solution. But there will be other challenges, such as [indiscernible] or
something else. So there are different tradeoffs. Yes. So a gesture-based control can be done
just by a smartwatch, and finally, this virtual gaming, virtual reality around the corner, and they
are emerging, and this understanding the postures and gestures of a user's hand can be key
building blocks for such immersive gaming experiences. And if you look at these building
blocks, across different models, the common denominator is really the location, the location of
the human body, indoor localization, and the location of the arm and posture estimation and the
location of the hand and fingers for precise hand gestures. So my talk is really about how we can
leverage smart device sensors to do this macro to micro localization on the human body. So let
me start by talking about indoor localization. So the question about indoor localization is
essentially where am I in the indoor environment? And there has been a huge amount of work
on this topic, and let me just quickly send out the key ideas. So the first question is, why don't
we just use outdoor localization technology? For example, GPS. Well, the problem with GPS is
that GPS signals do not penetrate well into the buildings, and even some that do that bounces too
much in the indoor environment, leaving the GPS receiver with poor accuracy, if at all. And WiFi turns out to be a reasonable alternative, especially given its wide availability, and it's probably
the most popular approach thus far. That's assuming we have a couple of Wi-Fi points in the
environment, and now when the phones run Wi-Fi scans, they can get a list of Wi-Fi signals. As
the user walks around, they received signal strength will change. Therefore, the signal can infer
the user's location. But to make this happen, somebody has to go to everywhere and then build
this mapping between the signal strength and the user's location. That takes a lot of effort, and
even worse, this mapping changes over time, so therefore, to ensure a high accuracy, we need to
periodically calibrate the signal and build this map. And Wi-Fi is not the only approach that
people have considered. We also look at deploying beacons, for example, Bluetooth beacons
and sound beacons. For the sound beacon case, people deploy a couple of sound beacons in an
environment, and then we can use smartphone microphones to calculate the -- use the timing to
calculate the distance between the phone and the beacon receiver. From that, they can
triangulate and figure out the user's location. And Bluetooth beacon is another approach that has
a smaller range, and then when the user passes by, the phone can hear the whisper of the
Bluetooth beacon, and then can figure out the location of the phone. But the problem is beaconbased approaches is that the huge deployment needs knowledge and cost, and so therefore, many
companies like Google, Cisco, Intel, Samsung, they insisted on us to provide a system that is
software based and scalable. So under this context, we started thinking, can we build something
that does not rely on the infrastructure? So that seems possible, and also around the same time
that the smartphone has this motion sensor, like accelerometers, compass and gyroscope. So we
started thinking, can we use these motion sensors themselves to figure out the user's location as
opposed to relying on a Wi-Fi or acoustics signal from the environment? So that seemed
possible, because when we used accelerometer data, when the user walked, the phone bounces,
right? That should tell us the status that a user has walked, if you just use simple filtering
techniques. And also, the compass will give us the direction, in which direction in which the
user is walking. So if we combine the direction and distance, we can estimate this actual path
that a user has walked. So we tried this. We collected some data, and then it turned out we
failed miserably, because -- so then we come back to this later and we find this method called
dead reckoning, and dead reckoning is essentially fundamentally difficult, because our
environment has these metals all around. And these metals cause the compass data to fluctuate,
and when the user plays with the phone, the phone also bounces. Therefore, all these together -errors together, it makes this dead reckoning tracking deviate from the actual path over time. But
actually, in the past, when people did not have GPS, dead reckoning has always been used for
shapes and planes to figure out their location. For example, in 1922, Charles Lindbergh has
landed in Paris after a nonstop flight from New York. And at that item, there was no GPS, and
Charles Lindbergh just used dead reckoning, and how could he do that? This is such a long trip,
and he just used dead reckoning. The trick was that he obtained fixes from the stars, and he used
that to guide himself. So once we knew this, we started thinking, hey, can we magically bring
some landmarks of stars to the indoor environment? We have this, when a user walks, the
moment he hits a landmark, we can reset the user's location and the user can dead reckon from
there, and we can keep doing the same, so therefore over the time, this new estimated path in this
green line is much closer to the actual path. That would be awesome, but a problem is that in the
indoor environment, we don't have the stars or landmarks. So we start looking to our
environment and start thinking, can we find some landmarks? And at first, we cannot find any,
but then we collect some data and go back to live, and we suddenly realize that if we see through
the data through the eyes of the sensors, then we can find many unique patterns in the
environment. For example, we find this unique magnetic fluctuation, and that could be
potentially used as a landmark. And out of curiosity, we also check where does this happen. It
turned out this happened near a water fountain, and we also observed other patterns, such as
accelerometer data, so we find some over rate and lost rate in the accelerometer data, and by the
way, our smartphones today have the barometer data that can monitor the pressure changes, so
we also observed these pressure changes along with this accelerometer pattern. So it turned out
these patterns are caused by the elevator, and this particular spot, when a user passed by this, the
cellular signal drops significantly, and then again comes back. So it can also be used as a
landmark, and there are many other examples, such as Wi-Fi fluctuation, turning around a
corner, taking stairs and so on and so forth. So we can use them as landmarks to help us
calibrate the user's location, but of course we cannot predefine all of these patterns, because we
want our system to work all over the world, in all our buildings. We don't know whether their
buildings have this water fountain or have the elevators, so the idea is to automatically learn this
pattern from the environment. Yes. So using this cool idea, we developed a system called
UnLoc, unsupervised indoor localization, and we need to solve three main design questions.
One, how to automatically detect landmarks, and second, how to localize the landmarks, and
third, how to localize the users, and our solution to all these problems are interdependent and
recursive, but let me try to explain them one by one. So let me start by talking about how to
automatically detect landmarks. So let's consider a user walking in this building, and let's say
that for the next five minutes, we have a reasonable estimate of the user's location. Say, perhaps
from dead reckoning, but this is a rough estimate. But we will soon relax this assumption and
explain even dead reckoning is bad, how we can bootstrap our system. Let's say we have a basic
estimate, and then at the same time we can collect the sensor data from the user. So therefore,
we will have a location to sensor tuple. As more and more people walk in this environment, if
you have more and more of such tuples, and then we can extract features from the sensor data,
and then we can get this location to feature tuple. Because we are interested in the unique sensor
pattern in the feature domain, so we run an unsupervised clustering algorithm on that, so this is
just an illustration, but in the system, we have multiple features. There is a higher dimension.
But the idea is that the same pattern here in the same cluster, you share the same feature pattern,
but that's not enough, because we want to know where does this pattern really happen in the real
world? Well, we can do that because we already have this crude location estimate, so we can
map this pattern back to the floor plan. Let's see one example. Let's say this right point. Well, it
turns out these right points also happened in a small cluster in a geographic space, which gives us
a sense that it can be used as a landmark. But not everything is as good as this. For example, if
you look at these blue patterns, it happens in multiple locations. So the way to deal with that is
that we can leverage the Wi-Fi signal, but this is different from Wi-Fi-based indoor localization.
You don't need to manually label the location and Wi-Fi mapping. As long as you know the WiFi signal of these two clusters are different, that's enough to distinguish them, so even if we use
Wi-Fi, it still comes for free. And some patterns, like this green one, it just happens everywhere,
so it turned out this pattern is because when a user is walking, they also create a unique pattern,
but walking just had an error that's not helpful at all. So just because it's a sensor domain cluster,
not necessarily means it's a landmark. And because we have this feature -- sure.
>>: So does this work best in narrowly constrained hallway type of environments? What about
like a mall, where the hall is much, much larger than some of the examples that you've shown?
>> He Wang: Yes, so the system will work better in narrow corridors, not open spaces, but the
system still can work in those areas, but with a little worse accuracy.
>>: [Indiscernible] more people in the mall, for example.
>> He Wang: Yes, it's not impacted by that, because we -- so when more people walk around,
the Wi-Fi signal can flux a little bit, but this magnetic signal is quite stable, still, and initial
pattern of the users walking past.
>>: So in which departments did you test on?
>> He Wang: We also tested our system in the engineering building, office building of different
places, and we also tested our system in a shopping mall. Yes, so if you try to play with these
features, we can generate different kind of landmarks. In our system, we use this inertial
landmark, meaning we use the gyroscope and accelerometer data. We also have the magnetic
landmarks and Wi-Fi landmarks. So because our environment has so many sensing data, and
they are likely to be evenly distributed -- and also, we can always use this Wi-Fi signal to divide
our area into subzones, so our assumption, our hypothesis, is that our indoor environment has
enough landmarks for us for localization. So that just means we have found the stars in our
indoor environment, and we can use this for periodically calibrating the user's location and then
providing high accuracy. Sure.
>>: Part of the assumption was so that at the landmarks, this set of sensor signals that you see,
don't change over time or are invariant?
>> He Wang: So the signal, in the short time scale, it doesn't change, but in the longer time, it
does change over time. So what we do is after a period, we regenerate the landmark, such that it
can be tuned to the state of the latest environment. That is possible because we have this data,
it's automatically generated, so there is no manual effort to go there and label, label again and
again. So potentially, what we can do is at the end of the day, you regenerate, and that should be
working by at least one day.
>>: So one thing is an example of when it changes your long-term demand and your short term?
Do you have an intuition for why it is?
>> He Wang: For example, the Wi-Fi signal, it should work for at least one or two days, and
magnetic signal, that works even longer. I think it works for more than one or two months.
>>: That would be like if someone was installing a new drinking fountain or something like that,
then the magnetic signatures would change.
>> He Wang: Yes, in those cases. The building structure changes, then that could change, but
typically, it's quite stable.
>>: [Indiscernible] coordinate system do?
>> He Wang: Yes, so we need to manually label two places to figure out the outputs to the
location.
>>: [Indiscernible] walk around the place manually? So how do you bootstrap ->> He Wang: Right. Yes, so I will talk about how to bootstrap in a minute, but we need to
manually label where is the building entrance and one of the stairs or elevators. But you don't
need to go there. Let's say I can stand here and label what is the building entrance of Building
42.
>>: [Indiscernible].
>> He Wang: The assumption, but the thing is that we don't need a detailed map how to label
where is the walls. We just need to have -- you just need to tell me two information -- where is
the entrance? Where is one of the elevators. That's our requirement.
>>: Could you -- maybe you'll get to it later, I don't know. Are there any applications to a home
environment?
>> He Wang: Yes, I think there will be applications for the home environment. For example, if
you leave the room and you want to automatically turn off the lights, or you want to track the age
of the people alone at the home and how things are going on, I think there will be many
applications.
>>: I was thinking you would get fewer Wi-Fi signals, most likely.
>> He Wang: Yes, I think for the home environment, you have fewer Wi-Fi, but we still can
rely on the sensors to track your steps. Yes, sure.
>>: Just curious, so what if many people are using a Wi-Fi hotspot, and they are moving, so will
it impact this accuracy?
>> He Wang: What if multiple people are using a hotspot and they are moving, under this
scenario, what is the accuracy? So in our system, because we can rely on this step counting and
tracking, and we also rely on magnetic inertial patterns, so we don't rely heavily on the Wi-Fi, so
that means we only use the most stable Wi-Fi signal. We can afford that, because we rely on
other stuff, so we use highly selective Wi-Fi signal and use the cruise features, so that increased
accuracy. Yes. Yes, so that sounds good, but wait, right? So we assume that we have
reasonably good dead reckoning, but what if dead reckoning system is not that good, which may
be always the case in the real world, right? So what happens then? So our idea is that -- let’s say
the user walked past three sensor patterns like this, white, blue and green, and different users will
walk the same path, but just because of the tracking error, these paths are going to diverge from
each other over time. But the initial point shouldn't be too bad. It’s reasonably good for the
initial point, the dead reckoning. So if we can use those choices to figure out that this right
pattern actually happened in a small area, it should be recognized as a landmark. And then we
use the [indiscernible] of the landmark as the estimate. But this is just initial estimate. It's not
accurate at all. This is not perfect at all, but it's a reasonable estimate. And then because these
errors from different devices, different hub error and different phone, different user step size,
counting data uncorrelated. So if you plot the real-world data, and this is one example, this right
square is the true landmark location, and each of these blue points shows the estimated
estimation from the dead reckoning paths. So as you can see here, they are really uncorrelated,
because of the hardware noises and the human step size and all the patterns. So they essentially
can be used as an initial estimate, and from that -- but the problem here is that these blue and
green patterns cannot be recognized as landmarks, because they are scattered in a quite large
area. You are not sure what will happen, right? But what we can do is, since we know the first
one is a landmark, we can refine user's location to the first one, and then dead reckoning from
there, and now the second one, this blue one, becomes this more area, and then we can recognize
it as a landmark. And we can keep doing the same, and then finally recognize that the last one is
also a landmark. So in other words, we can gradually grow the landmark from the origin
building and then finally fill the whole building in a bootstrapping phase. But these estimates are
not perfect. They have errors, but the good news is that we always have users walking in our
building, so we can leverage these new user choices and fit new user choices to our system and
to improve the estimate of landmark location. So here's how we do this. Let's say that a user
walks in the building, and then we can always find whether there's a new landmark, and we can
also find if it's the landmark, and then we can update the landmark list. And then when more
users walk in the building, the landmark location estimation will become better, which in turn
will improve the user location error, because the user relies on the landmark to reset themselves,
right? If you have a better landmark, you have better user estimation. As more and more users
walk, and then we have new data for our system, this landmark location error and user location
error both decreases all the time. And we demonstrate -- so we test our system in more than six
buildings, including shopping malls, universities and office buildings, and we test our system
using five different Android models, with more than 20 users. And currently, it's running live in
our lab CSL, and even though we use landmark generated a half year ago, it still can work
robustly. And to get a quantitative evaluation of the system, we collect the ground truth from the
shopping mall ECE and the CS buildings, and this graph shows the result. And the X-axis shows
the error in meters, and the Y-axis shows the CDF, and different lines shows the system
performance over time. And different lines shows the system performance over time. As you
can see, the system performance improves, and after around two hours walking, we can achieve a
median accuracy of 1.63 meters. So in summary, so our nature has diversity that's guided
Charles Lindbergh to find his way to Paris, and then our indoor manmade nature, our indoor
environment, also has the diversity that can be used to find landmarks and then help us improve
our indoor localization system accuracy and allow us to achieve a median accuracy of 1.63
meters with no infrastructure cost and no manual calibration. Sure.
>>: This goes back to the question asked earlier. Your summary numbers are kind of averaged
over all the places that you tested, right? So in a place like this, what would you expect? In a
place like that mall?
>> He Wang: In a place like this, I think the accuracy will be worse than the corridors narrow.
We don't have an exact number for that right now. Yes. Yes, so we test our system -- after
publishing the paper, we continued working on this work for another one year and optimized
different things, and we deployed our system in different buildings, and here is one example.
The user is using our system, and what KDC on the film screen is showing on the left part of the
video, and when a user passes by Room 246, an UnLoc system can precisely capture this. And
we also have a back-end server that can automatically calculate all the landmarks from different
users from the building and collect data from different users and generate landmarks and also
realize the traces and do all the management, localization management. And we demonstrate our
work to different companies at different places, and TKE is an elevator company who is
interested in our system, because they want to use our system for their elevator scheduling. They
have the elevators in Wall Street -- again, super-tall buildings, and people are tired of waiting
there for the elevator for a long time. They want to use indoor localization to schedule their
elevator, and during these demos, we assumed that the phone is in the user's hand and the user is
looking at the screen and walking like this. But in our demo to Intel, they tried our system at
different places. They tried to put the phone in the pant pocket, in the shirt pocket, and all kinds
of places and orientations, and the system just failed in those cases. And this reminded us that
indoor localization is not just about navigation. You cannot always assume the phone is in the
user's hand and you're looking at this and wireless tracking. So the question is how we can still
estimate a user's walking direction, even if the phone is in the user's pocket in other different
places and different orientations, right? So we solved this challenge with another more business
paper, but because of timing issues, I cannot go into details, but I would just show a very brief
demo of that. So in this video, the green line shows the estimated walking direction from the
user, and then while the user is walking, the user tries to hack the system. They're trying to
change orientation of the phone, and we still can estimate the false on the phone and then quickly
analyze the user's walking direction. And then we integrated the system and UnLoc together,
and then Samsung purchased a research license of our work, and they are interested in pushing
this to their Android platforms. So with that, I will move to the next part, but before that, any
questions.
>>: [Indiscernible] you said your accuracy was 1.63 meters.
>> He Wang: Yes, in the median case.
>>: Yes, so again, I'm not totally familiar with this space, but I do see recent work on indoor
localization using Wi-Fi that talks about decimeter level -- decimeter, one-tenth of a meter. How
does your thing compare?
>> He Wang: Yes, so I am aware that many systems can achieve better accuracy, but typically,
they will require deploying infrastructure, deploying hardware or you need to [indiscernible] the
detailed Wi-Fi information, such that it can provide that accuracy.
>>: The hardware, so has been known for ages, that you can take one approach, which is that
you apply infrastructure and hardware, which can get a significant level of accuracy. In fact, you
can [indiscernible] deploy a ticket system a long time back, which we kept saying was ultrasound
and it [indiscernible]. And so the other one with [Kyle] and [Suchi] and all these guys ->>: Yes, this was [Suchi] and ->>: Yes, they're deploying -- [Bridget's] deploying an extra piece of hardware at the access
points.
>> He Wang: Yes, so in this cases, it may not scale to all the buildings easily.
>>: I actually missed -- I'm so sorry. I thought it was 10:30. I came in at exactly 10:30.
Anyway, what is the weakness of your system.
>> He Wang: Yes, so the weakness, I think, our system is still in some sense relying on the dead
reckoning, relying on the step counting and the direction estimation. So even though we kind of
solved the problem when we changed the orientation and put it in different places, but the user
behavior, they could be not as our expectation. They can just check the phone like this, like I'm
talking, and then the system it thinks they are walking. So in some cases, we can still rely on
some Wi-Fi as the lower bound, but those cases, the system needs to handle carefully. User
behavior may not always be as expected.
>>: Did you try -- well, what were the challenges of doing it passively? So when you said that it
might be in your [indiscernible], but if the user is just -- they're not really trying to locate right
now, they've been walking around in the mall for 10 minutes, they can't find their store, and now
they want to start localizing. And so you'd have to keep all these sensors on, for example, when
it passes them, so people might not like the battery drain.
>> He Wang: So in those cases, we have achieved a sensor arm. The tracking is not really a
problem, because it's relatively in a stable place, even if it's in the pocket, right? But what about
energy? We don't have an energy number, but I think as continuous sensing are quite popular,
and I know that many of you are working on continuous sensing projects, and newer
smartphones have the additional hardware that can -- additional CPU, lightweight CPU, that can
process these easy tasks such as sampling and do some simple calculations. So using that, that
will reduce your energy cost. Yes. So yeah, let me move to this pose-estimation problem. So
here, what we want to do is, we want to understand the pose of the user, and this is working in
these small spaces. So can we track the arm pose for the user using just smartwatches? So by
posture, I mean the 3D location of wrist and elbow, and there are a couple of challenges. First of
all, there are noises in the sensors, so accelerometer and gyroscope, it's not perfect. And over
time, these errors will accumulate, so you cannot just use double integration. And also, this
accelerator data is only on the user's wrist, so smartwatch is on the wrist, so how can you infer
the elbow location? It seems we don't have enough data. And then the third one is that we don't
have training data. We don't want to train the user on a specific set of gestures such that we can
work for those gestures. We want to do freeform gestures, so those are the challenges we are
facing, but this arm posture tracking problem is not new at all, and many researchers have looked
at this. For example, from robotics or biology domain, signal-processing domain, people have
been looking at this problem. And most close work to us is one work that is also trying to use
the smartwatch to figure out user's pose. So what they do, they leverage a couple of
opportunities. First of all, our elbow is always going to be on a circle around by our shoulder,
and then the wrist is on another circle surrounded by our elbow, so based on these two concepts,
we can actually narrow down the search space quite a lot. And then people also borrow the
medical information domain, because our arm has five angles of freedom, and then each of them
has the range of constraints. For example, in this [indiscernible] here, our arm can only move
from 0 to 150. We cannot move in a negative way. So if we combine these constraints together,
we can narrow down the search space, but that's still not enough. So that's why they also train a
preset of 15 gestures, and then they can do a pretty good job there. But what we want to do is we
want only to use the smartwatch, but we also want to do the freeform gestures. So there are a
couple of opportunities we are going to leverage. First of all, once we know the orientation of
the watch, that can be very valuable to us. That can infer the user's wrist and elbow location.
And second of all, if you used acceleration from the watch, that can show some information
about a user's movement, and we combine that with a hidden Markov model, then probably we
can get a better estimation. And of course, we can also leverage data structure to improve the
speed of our tracking system, so let me first explain what do I mean by watch orientation. By
watch orientation, I mean the three axis pointing direction of the smartwatch. And now we
found that -- so for a given orientation, basically, we iterate all the angles possible, and then from
that, we can find a subset of a combination that satisfies this particular orientation. With this
subset theta, we can map this theta to the possible location of the user's wrist and elbow. Sure.
>>: You said five degrees of freedom -- three here, two here. What is this? Isn't this another
one?
>> He Wang: Yes, yes, so there are ->>: Wrist rotation?
>> He Wang: So here I think we have ->>: Move the slides back.
>> He Wang: So here we have three -- one, two, three, and there is one. There's one here,
actually. So that's one that comes to here, so when you do this, it comes to the upper part, so we
call it there. Yes. And then so basically, from these five angles, we can try all the combinations
and see which one satisfies this orientation constraint. And from that, we can map back to the
points, and see these are the possible orientations that satisfy this orientation constraint. And our
key observation is that, supposedly, another three orientations of the watch, the possible wrist
and the elbow locations are quite limited. For example, in this case, when our orientation is like
this, the possible wrist and elbow will be like this, right? So as shown in these green and red
dots, and it's quite limited, so let's try to quantify that. Let's see how good is this opportunity. So
we can easily do this, because we can always try different combinations and see how it goes,
right? The metric we have is the area of the point plot here, and divide by the whole sphere area,
so what is the region. If the region is low, then that's good news to us, and we plot this CDF.
Access shows the ratio and Y-axis shows the CDF, and we found in the median case, we
typically have 9% of the sphere, so that means it's a really small area. If you just use the central,
probably you can do a reasonably good job. But what if -- how do we know that -- how can we
figure out if the user's within this constraint, within this narrowed down area? So the idea ->>: [Indiscernible] the graph like that? But your 80th or 90th percent of degrees are, right? It's
30%, 40%. It could be anywhere in that sphere.
>> He Wang: Yes, so in many cases, it's not that good, and that's true, and I will just explain
how we can improve this. Yes. So the opportunity is using the accelerometer data, and from
that, we can get an inference of the user's wrist movement. Let's say one example, right, when
the user is punching like this, so the orientation of the watch doesn't change at all, but your
elbow is going to move backward and forward, and so the tracking -- the naive method, you
think, you don't move, it's a static point. But since we know acceleration of the watch, we should
do better than just giving us static estimation, so that's the intuition, so how we can leverage
accelerometer data here. So the question to understand is the real sequence, right? From that
sequence, this is like a location A and B and C. From that, we can actually infer the acceleration.
And we also have the real-world acceleration from the user's watch, and from this choice, for any
given choice, we can -- from the two locations, we can infer velocity from. To velocity, we can
infer acceleration, so basically, the question is how we can combine this inferred acceleration
from the smartwatch with what the acceleration that [trees] give us, and then we can bind these
together and consider the noise model, and then figure out what is the best possible estimate. So
then, the thing is here, what we can do easily is that we can use the third-order hidden Markov
model, right, so we can combine three states together, and then if you consider three states, and
then you can track the acceleration. And from that, it can combine these together to build the
arm choices, right? But the problem is that this is slow, and this third-order hidden Markov
model cannot be solved efficiently, so we need to reorganize these things to make that efficient
algorithm can apply. So if we combine these three locations into one state here, then we can use
-- that becomes a first-order hidden Markov model. Then we can use a Viterbi decoding
algorithm, which is efficient for decoding first-order hidden Markov models. So the idea is to
build three adjacent locations to one state, and then because two adjacent locations will tell you
the velocity, and then from that actually each state itself you embed this acceleration
information. And if you also have this acceleration environment from the watch, so we can
combine these together and also take this noise model into consideration, and from that, we can
infer the choice. We also consider the continuity, because your choice is going to be smooth, so
this overlapping should be the same, and we also consider the point cloud limitation. Given this
orientation limitation at this particular time, your possible candidates will be in there, range.
>>: [Indiscernible] Viterbi?
>> He Wang: Yeah.
>>: Viterbi's already N-squared in the number of states, and this is state explosion -- it's like an
[indiscernible] in the number of states in which you're doing the Viterbi. How does that? So
why does this benefit in terms of speed?
>> He Wang: So I think what you are saying is exactly right, because ->>: Something like this is not Viterbi, but do some kind of beam search, so you don't actually
look at all the Viterbi paths.
>>: And why would you use -- you're not using multiple states here, right? You're first order,
essentially ->> He Wang: Right. So previously, if you consider three states separately, then that would be a
third algorithm, so we mold them together to use Viterbi. If you have three states, probably you
have to search that in exponential time. Now, the least that we can do it is in polynomial time.
>>: Are you actually doing full Viterbi over all the states?
>> He Wang: I try to reduce that in the next slides. Like the slide, let's say we have N possible
locations, right? And then the T time steps. Then our state number will be N-cubed, and then
the running time will be N to the power [indiscernible], and N will be a huge number, like 1,000
possible locations on a sphere, so that's not quite acceptable, and how we can reduce the number
of states. So we try to reduce the number of states by actually only looking at two adjacent
locations. From the two locations actually now for each state, you can encode the velocity -- not
acceleration, though, but we can always build this acceleration transition into the state
translation, so now from one state to another state, this translation, it shows the acceleration.
Now, you can move this acceleration from the observation to the transition. Now, at this point,
you can then advance forward. Then we can have -- of course, we also have this continuity and
this orientation clustering, but we can reduce the state number to N to the power 2, and now we
can do N to the power 4, all divided by T. And we can even do better, so because we have this
continuity constraint, because this stage has to be smooth, right? So that means in a Viterbi
decoding, each stage, many of them will be zero, and so if you can reorganize this such that they
happen in a continuous trunk and we label the start and endpoints points, then we can fully
reduce the ON complexity, so then we can get something like ON to the power 3, divided by T.
>>: [Indiscernible] large-scale instrument tracking, you don't try to run Viterbi, right? We run
some kind of a beam search. We just run the most promising K paths or something like that. So
even N cubed seems large, and you're running a N cube algorithm on a large N on a phone, is
that's what's happening? You're running this on a phone?
>> He Wang: No, not on the phone. We run that on the server.
>>: The model there now, I thought it was running on the watch.
>> He Wang: On the watch, right. So there are two tradeoffs. There are two system design
points. Well, one is you can achieve real time by using a simple method that's just you calculate
the centroid of the point cloud. That can be done in real time. And then the other is for you're
offloading the data to the cloud, and then that will give you an offline result for other purposes.
Let's say understanding your activity over the day, precise hand activity recognition, but that
cannot be done in real time.
>>: The transition properties?
>> He Wang: I'm sorry?
>>: How do you determine the transition properties to begin with for your ->>: How do you train the agent?
>> He Wang: Right, so this transition probability actually is here, right? So it depends on this
noise model, basically, and in the noise model, we do that by putting the watch not the table, and
then we see what is the variance, and from that we generate its parameters. Yes. And then this
figure shows the accuracy. The X-axis shows the error, and the Y-axis shows the CDF, and this
black line shows the real-time accuracy, which means we just use the centroid, which is not
always working. And this red line shows what we can do offline using the hidden Markov
model. I think what you are saying is right. It may not be the best and fastest method, but at
least in theory, it can give you the best estimate. It is not in real time yet.
>>: There's been a bunch of work in the graphics community and the motion capture world,
where they've actually done similar work with accelerometers, as well, and the way they solve
the scaling problem there is you know how in Viterbi, you look at all the paths using dynamic
programming and you get the N-squared. Instead of using all the paths at every point, you can
just pick the K most promising paths. It's a standard thing called a beam search, so you focus on
that, so then you can do the HMM much faster, so you might want to try that.
>> He Wang: I think that's a good system.
>>: So this notion of getting point estimation from accelerometers, as we've done in the motion
capture community doing that, and they use all the similar insights there, so you might want to
look this over.
>>: But will the speedup be enough to run it on the watch?
>>: Well, I don't know about the watch, but it's certainly not N-squared at that point. So Nsquared is what you're starting with.
>>: It might be [indiscernible].
>>: It's probably implemented at this, right?
>> He Wang: We implemented this. Currently, for a one-minute choice, we take 10 minutes to
process. It's like 10X time.
>>: On the watch or on the server?
>> He Wang: On the server, on the server, but suppose we have more computing power,
probably we can do it better, or you can try to use different beamforming hidden Markov model
decoding. That could also speed it up. But I think in the near future, I don't think this is really
possible on a watch, but offline, anything. And we have -- yes, sure.
>>: How did you measure the ground through here?
>> He Wang: We used Kinect. We used Kinect.
>>: The main use of Kinect.
>>: Going back to the question about determining transition probabilities, you could put people
in front of a Kinect and also have the watch on and have them move and then report data that
way. Did you guys do that?
>> He Wang: So if you do that, then there are many problems, because the acceleration tracking
error also depends on what orientation mechanism, so all this mixed together is hard to record in
a precise watch orientation, so that also has an impact on who good is your acceleration, right?
>>: At least you can record elbow position, right?
>> He Wang: First of all, your watch is on the wrist, and even if you put it here and try to
estimate this, Kinect is not that good yet to give you price accelerometer-level model. It's not
that good yet, right? It can do ->>: Its frame rate is sufficient for that.
>> He Wang: So you can see the acceleration is kind of similar, but that model is not good
enough, because it's not that precise. Let's say you have a two-centimeter error. With that, you
have a large impact on your acceleration estimation, but Kinect cannot give two-centimeter
accuracy today, especially for this kind of tracking.
>>: We found its accuracy to be quite good, centimeter for sure.
>> He Wang: Maybe you are using the latest Kinect.
>>: We're using actually an old one, but that's okay. We can take it offline.
>> He Wang: If that's good, then we can do that, too, right? We can build a better model if
possible, right? And we have this demo, so this user is using the smartwatch and then trying to
write in the air, and then writing ABCDE, and this red line shows the Kinect, and then this dot
shows what our system tracking result is. So even though here we let the user write ABCDE, but
in our system, we also do freeform gestures and all kinds of evaluations in the paper.
>>: Do you do this on the watch, or is the one that's done ->> He Wang: This is a version that runs offline, so this is an offline result.
>>: What would it look like if you ran it on the watch?
>> He Wang: We have that video online, too, and the [indiscernible]. I think I linked that, too.
We have a side-by-side comparison in real time, offline and ground truth. Yeah. So the offline
accuracy here is around 8 to 9 centimeters, and this online real-time tracking is about 5 to 13
median accuracy. And with that, I want to very quickly talk about the last piece of work, which
is about hand tracking, hand gestures. I spent only one minute about this, so the question here is,
can we use the smartwatch to understand what the user is typing on the keyboard? So that may
be a security problem or a privacy leakage, because we use the watch to track steps or calculate
calories. What if the user can use the smartwatch to understand what you're typing? So we have
a couple of challenges. First of all, as before, this noise, the sensor data on a watch are noisy,
and secondly, your watch location is not necessarily your finger location. You type ASDF, your
wrist doesn't move at all, right? And third, we don't have the right-hand data. We have only one
hand watch, right? And then we solve this -- and also, of course, we don't have training data.
When we attack somebody, this guy won't train the typing for us, so we solve these challenges
using [indiscernible], Kalman filtering, and we also borrow the English word structure from the
dictionary, and we combine these together, and we have detailed results in the paper. But in
short, when you type a word longer than six characters, our system can on average shortlist 10
words that you include the word that you have just typed. For example, if you type confident,
and out of 5,000 words in our dictionary, we can rank confident in the second. And so in
summary, so our devices have this computation in and sensing power, and we can use those as
sensing and computing lens for a society. It is possible by using motion data precisely, we can
use this to build interesting inferences on the humans, and my research is focused on designing
these inferencing techniques, using a multi-modal sensor data from the mobile devices and do
this location posture and gesture estimation. And our research goal, the ultimate goal, is building
the system that has impact. And beyond that work that I have talked about, I also work on
outdoor localization, context awareness, search, mobile security and augmented reality. And
moving forward, I plan to spend this human inferencing techniques for several years, but I will
broaden to not only the motion sensor. I will do other research to other sensors, sensing
dimensions. And I will do both bottom-up and top-down research. For the bottom-up part, I will
not only do this indoor localization and postures. I will also want to understand the finer grain of
finger gestures, cross-sourcing and other behavioral analysis. And I also look at it as a privacy
problem, because as that project has showed right there, privacy challenges, even though the
sensing are good, it's a double-edged sword. It can have privacy concerns. And then at the top
down, I will try to elaborate these unreliable system building blocks to build the systems
applications, such as augmented reality, smart home, vehicle analytics and mobile health
applications. With that, I would like to thank you for your patience, and I am happy to take any
questions and comments. Thank you.
>>: Do you [indiscernible]. What do you think is the single biggest opportunity here in that
space that you talked about? You talked about many, many things. What do you think is ->>: Can you show us your slide?
>>: What do you think is the single biggest bet or opportunity that we can make in this space?
Because to some extent, things like accelerometers and these sensors, it seems like everybody's
doing something or the other, and lots of stuff has already happened, so what's the biggest
possibility in your mind?
>> He Wang: I think I would focus on mobile health and physical analytics that will require
many underlying techniques that are still not ready today. And also, we have these variable
devices and HoloLens, right, and all these devices together. How can we leverage those together
to build detailed fine-grained inferences?
>>: How do you use HoloLens for mobile health, for example?
>> He Wang: For mobile health, I'm not really sure, but if you have some sensors on your head,
I think that could maybe -- I'm not fully sure, of course, and have never played with that, but I
think maybe you can use the sensor data to infer your detailed activity recognition and you can
combine those sensors with your smartwatch and mobile phone together to do something
interesting.
>>: And do you think we are -- you grayed out indoor localization. Are we done?
>> He Wang: Not yet, I think. It's still not there, of course, and yeah.
>>: Okay. Are you done?
>> He Wang: For indoor localization? I think there are interesting ideas, and the collaborations
are definitely I'm open minded to work on that, but I am not saying I will have to work on this
for a couple of years. I'm not sure. Yeah. Okay. Thank you so much.
Download