>> Jie Liu: All right. Let's get started. ... pleasure to have Shahriar Nirjon to give a talk today. ...

advertisement
>> Jie Liu: All right. Let's get started. It's a great
pleasure to have Shahriar Nirjon to give a talk today. Nirjon,
as he really goes by, is graduating from the University of
Virginia, from [indiscernible] group and his Ph.D. research has
been primarily on mobile sensing using mobile signals to derive
interesting contextual informations. And last summer, he also
did an internship with us in Microsoft Research. He did some
indoor GPS work with me, also very exciting.
So today, I think he's primarily going to talk about the audio
piece of his work. And he will have time to meet with people for
the rest of today and tomorrow. All right.
>> Shahriar Nirjon: Hello, everyone. So my name is Shahriar
Nirjon. I'm from the University of Virginia. Thanks, Jie, for
introducing me. It's my great pleasure to be here once again at
Microsoft Research. So the title that I picked for my talk today
is a Smartphone as your Third Ear. Technically, it is about
general purpose acoustic event detection on a mobile device. So
let's get started.
So every day around us, we hear hundreds and thousands different
types of acoustic events. Even like short time sounds that's
happening. And we human beings are capable of remembering them,
recognizing them, and acting upon them.
Now, having said that, I would like to do a small experiment with
you. I want to play a sound clip, a 30 second sound clip, and
you have to describe the sound. So at the end of this, I will be
asking to volunteer, whoever wants, to describe the sound. So
let's wait a second for it. Okay. So your sound clip starts
now. [music].
Okay. So can anyone describe what it was using any words or any
tag you would like to attach to it.
>>:
Music.
>> Shahriar Nirjon:
>>:
Anything else?
Melody.
>> Shahriar Nirjon:
>>:
Music, we've got one, music.
Melody, fine.
Piano.
>> Shahriar Nirjon: Piano, the instrument, he's trying to sense
the instrument, okay. Piano. Anything else?
>>:
It sounded like a movie sound track or something.
>> Shahriar Nirjon: It's a soundtrack from something. A movie.
Not a movie, but a TV drama series. Okay. So this is what it
was. It was from Sherlock, a famous TV series, recent TV series
shown at BBC. So I asked the same question to several of my
previous talks and this is what I got. So everyone could say
like it's music, it's easy to guess.
>>:
[inaudible].
>> Shahriar Nirjon: Yeah, one of out of three persons could
actually say like it was from Sherlock, because they are mostly
students so they know what it is. They watch a lot of TVs. And
some people could actually guess the right instrument. So there
are like ten percent of them.
Now, those were guessing who wanted to know what it was, it's
nothing fancy, like piano. It was simple DIY music box. So this
is how it was made.
[music].
Okay. So the point that I was trying to make that is the same
sound click can be described by different people differently.
And it really depends upon our experiences. So someone who
learned this music for the first time, people probably have
learned as Sherlock. Someone else have learned it as a music
box, the instrument. Or some people just guess from his
experience, it is something like music.
So it depends really on how we teach ourselves this song or the
sound the first time. And in future, when we hear the same
thing, these are the tags that pops up into our mind. Like music
or Sherlock or instrument.
So today's talk is about making smartphones do exactly the same
thing. So every smartphone comes with a microphone. It listens
to the surrounding. It should be able to do exactly the same
thing, just like our third ear. So it should be able to classify
sounds and generic events.
So today's talk, I want to talk at first why I want to achieve
that. That is the motivation of my research. Then I'm going to
show exactly how I did. What was my approach and the description
of my approach. And then I'm going to show some results that I
got. And then there will be one more thing, an interesting
thing. It's a surprise. It's a gift for all of you, those who
have come here today. I'm going to give you a free mobile phone
app that you can use and you can share with your friends and you
can enjoy. I'm sure you're going to love it, so wait for my last
slide.
Now I'm going to start with the motivation of my work. So there
are several different types of web services available for
smartphones to use. So we have seen search, we have video
sharing, we have email, maps, music related services, translator,
weather, and so on. But if you think of what are the services
that are doing any kind of acoustic processing, sound processing,
you'll primarily see two types.
Number one, speech. Number two, music. Now, inside these two
categories, you have several different types of services, like
speech related services can detect the speaker, speaker ID,
speech recognition is there. Some services can actually detect
your emotion or your stress level. On the other hand, music
related service are like, for example, chasm is a very nice app
that can detect, ID the music. So music ID is there. There are
some services for sharing and streaming music.
But I'm kind of interested in a wide varieties of different types
of sounds. What about those sounds, for example, the sound of
your heart beat or your cough or your snoring, like physiological
sound. What about sound of your bus, car to detect the mode of
transportation or your pet cat or some machine. So there also a
general purpose acoustic event recognition service that the
developers can use.
So that is the main motivation of my work. Now imagine the
possibilities that we could actually turn into realities. By
detecting sounds like snoring or mumbling or talking during
sleep, we could create a nice sleep quality measurement app. Or
we could create a nice baby monitoring app by detecting baby
crying or whining and talking baby.
Similarly, we could create a nice social interaction monitoring
application by detecting who you were talking to how much you
were talking to, whether you are crying too much these days or so
on.
So these of different types of possibilities that can be real
apps if we have a service that can detect all these different
types of sound events.
Now, we asked this question, why this is so hard. So we came up
with these challenges. First one is applications are diverse.
So creating one single platform is like creating one shoe that
fits all. And this is not going to work. So this app diversity
is a thing that's the first challenge because they require
different amount of processing, different type of classifier,
different type of features. So they are really diverse.
The second challenge is a common challenge for any mobile phone
app. That is battery life. Creating an application that is
energy efficient is in general hard problem. Now if we add
acoustic processing on top of it, it becomes harder. So battery
life is our second challenge.
And the third thing is user context. Now, many application that
works indoors might not work outdoors. I'm talking about sound
recognition application, because the characteristics of the room
and the environment, the background noise, everything changes
from context to context. So user context is the third thing that
we need to consider.
Now let's see what the state of the art is doing. So I picked
three applications from state of the art, and I see they have
some common structure, like they are all doing some
preprocessing, the data from the microphone, the extracting
features. Now, the number of features and the type of features
could be different. But in general, they have some way of
extracting features.
And there's a classification state. Now, again, the classifiers
might be different, but they have this classification mechanism.
Now, these applications differ in implementations too. Some
application, for example, the first column implemented everything
inside the phone. So it does not require to talk to the cloud.
It does everything inside the phone. So I call that in phone
implementation style. Whereas the other two, like the middle
column, it performs the classification into the cloud. And the
third column does both feature extraction and classification in
the cloud. So they require assistance from cloud. That's why I
call them cloud assisted or in cloud implementations.
Now see their limitations. First, the common limitations in both
approaches is that, first of all, people are reinventing the
wheel. Every time a sound classification problem comes, we do
the same thing again and again. We collect data, we extract a
bunch of features. We train classifiers. We try to optimize
that for energy. We try to make it context sensitive. But
again, the second problem comes, we do the same thing again. So
I say this is a waste of time and therefore we should not be
doing that.
And number two is developers are not researchers so if we can
isolate these two like researchers, should be able to do the
research and provide the developers with the very easy to use API
that a lot of problems can be solved. Developer can think of
imagine new applications whereas the actual research is already
done by the researchers and they're just using that outcome of
this research.
There are some implementation specific limitations too. For
example, cloud assisted ones require network connectivity. Which
in turn has energy cost and it costs money, whereas in phone
classifies, they're highly rigid, they're fixed, and they are
trained on a different people so personalization is not there so
changing them is not quite often happens. So that's why they're
very rigid and they do not generalize well.
Now let's see what my research goal is. My research goal is to
create one single platform that should be able to generate many
applications automatically. Now, here I'm not really talking
about one creating one shoe that fits all. I'm creating about
one shoe factory that people can borrow and use to create their
own customized shoes. So the platform that I envision is a
mobile cloud platform so it has a mobile component. It has some
cloud component. And input to this system it's a bunch of sound
clips with some tags to describe the sound clips. And in
addition to that, you can specify in some constraint, like energy
constraint. So the developer can say, like, this is a long
running application. It wants to monitor my sleep so I want this
application to run for eight hours or more. Why is that? So the
developer can say like I don't care about energy. The user is
going to use it for five minutes or so. So this is the energy
constant that the developer can specify.
And the output of this system is energy and context aware
acoustic event classification application. So that is our goal.
Now let's see our approach. So we divide the computation between
the cloud and the phone, and it is like this. The phone contains
the mechanism. That is, it has all the preprocessing you need.
It has all the features extractors. It has all the classifiers
that may or may not be needed to solve an acoustic recognition
problem. So the phone already contains all the software pieces.
Now, the pieces that are required to be run to detect the type of
sound is data mined by the cloud. So the cloud is like the
brain. It creates the plan. So it decides which units to use
and which
what are the parameters of that unit and how these
units will be connected together. So this communication to the
cloud happens during the creation of the application or the first
run. It's not like continuous communication with the cloud.
This is one time use.
And once we have the plan, we start running the application, then
there is no communication to the cloud. Everything runs locally.
I picked an example. So let's say we want to detect Alice's
voice at her office. Now, this is a three step process. Step
one, this is happening inside the phone. It is pretty simple.
We record Alice's voice at her office. We tag it with some
words, like it's voice, it's Alice. Some of these tags are
automatically generated using the sensors inside the phone, like
it is an indoor environment. It is
when it was recorded, the
phone was at Alice's hand, things like that.
And then we can download that sound to the cloud. The more sound
you upload, the better, and this is a crowd contribution approach
so people all over the world will be uploading sounds and Alice
will be uploading hers. Someone will be uploading theirs.
Depending on what type of application will be selecting a subset
of these sounds.
In step two, which happens in the cloud, is a phone can request
for a plan to the cloud. So the phone can send a request like I
want to detect Alice's voice and there are some other sounds like
phone and printer that might happen in the environment. And upon
receiving such a request, what the cloud service does, it creates
a training set by selecting a bunch of sound [indiscernible] that
are relevant to the problem, trains the classifier, optimizes it,
and what it creates, we call a classification plan that can be
downloaded to the phone and executed.
This whole process is automated. And in step three is the
simplest step. You just download the classification plan and run
it on the phone. So that's it. So these are the three steps to
use our platform.
And now we'll see into a little bit detail how we implemented
this platform. We call this platform Auditeur. Auditeur is a
French word which means a listener. So in our Auditeur platform,
we have some components running in the phone and some components
are running in the cloud.
So I'm going to explain in phone processing components first and
then the in cloud ones. Inside the phone, basically there are
three services running. The first one is context service. So
this guy is sensing the location context and the phone position
context of the phone. So location context like indoors or
outdoors. You can also name some context.
And body position or phone position context is like the
positioning of the phone with respect to your body. So it could
be at your hand. It could be on a stable position like a table,
or it could be inside your pocket.
The second service is a communication
responsible for talking to the cloud,
downloading the classification plan.
service is the sound engine service.
service, so this guy is
so uploading sounds,
Third and most important
So inside this sound engine
service, we implemented all the acoustic processing components
that might be needed in our solving problems.
So together, they form a very standard five stage acoustic
processing pipeline. So in stage one, we do some preprocessing
of the sensed, microphone sensed data. Then we create frames of
32 milliseconds length and extract some frame level acoustic
features. I'm going to explain those features in my next slide.
Then there is a frame admission controller. So most of the time,
our phones are listening to silence or non interesting things.
So we want to throw them away before classifying. So that frame
admission stage does that.
In stage four, we accumulate a bunch of frames, like three to
five seconds worth and extract some statistics, what we call
window level acoustic features. And finally, that window gets
classified.
Let's see the current state of the implementation.
have implemented four preprocessing units. Yes?
>>:
So far, we
All of this is in the phone or in the cloud?
>> Shahriar Nirjon: In the phone. These are in phone
components. So preprocessing, units, there are four. We have
implemented so far 13 high level acoustic features. They include
temporal acoustic features as well as frequency domain acoustic
features. We have implemented 12 different statistics. So for
each features, we could create
we could compute these 12
statistics.
And finally, we have created seven classifiers. Now, this is an
exhaustive list. Not all units will be used in the phone for a
given problem. This is an example of the classification plan
that we download from the cloud. So this is an XML file. It is
basically a description of a directed acyclic graph. So the
nodes are the acoustic processing units and the edges are like
the data flow.
Now, upon receiving such a file, what the phone does first, it
parses that file and creates an acoustic processing pipeline like
the one shown on the right. So in this particular pipeline
example, a frame is coming on top. Then we are extracting some
features at the frame level, and then we have decision tree to
make
throw away some of these noninteresting frames and then
we extract some more statistics and we create a window level
features and finally we classify that.
Now, one thing to mention here is that this pipeline will be
different for different problems. It might be different for the
same problem in different context. So this is just an instance
of a pipeline. This is not exactly the pipeline that we always
use.
So that kind of concludes my in phone processing and I'm moving
on to more interesting stuff, which is in cloud processing,
what's happening in the cloud, how this process plan is
generated.
So I want to talk about three parts here. One thing is training,
how the classifiers are trained. Then the next to our energy
efficiency and then about the user context. So let's start with
the training first.
So our design, in Auditeur is inspired by the natural way a
developer thinks of an app. Say a person was thinking in his
mind, like, I want to create an application that detects Alice's
voice at her office where other sounds could be like other
people's voice, printers and phones. Now, from this description,
there's three things are basically important. Number one,
Alice's voice. This is what we want to detect. Number two,
office. This is where the sound classification application will
be running. And the third thing is what about other sounds that
might be in the environment. Voice, printer, and phones.
Now, from this description, writing code is very simple. All you
have to do, simply specify a [indiscernible] of strings. First
one to describe the sound you are looking for, which is Alice
voice. The second one is the other sounds that may or may not be
present in the environment, like voice, printer and phone. That
kind of limits the scope. And the third thing is the context;
that is, where exactly the phone will be
the application will
be running.
Now, the programming, the amount of programming the developer
needs to know is just to make a simple API call, like
soundletmanager.get configuration file and specify these three
sets of strings. That's how much programming you need to know to
use our system.
So this is kind of sending the request to the cloud and now let's
see what happens behind the scene. So the cloud kind of
identifies all the positive and negative training examples from
this set of tags. So at first, it tags all the sound clips that
are tagged with Alice's voice. That is, the blue rectangle,
shaded blue rectangle. Within that, sound clips that are tagged
with either voice or printer or phone are the violet circle. So
they include both the positive and the negative sounds.
And inside them, sound claps that are tagged with voice and Alice
are the ones that are positive ones. That is the one that we're
looking for. So now we have positive examples. Now we have the
violet minus green is the negative examples. Now we are ready to
create a training set.
This is what might look like a training set in our system. So it
has like over 200 columns. Each of these columns represent one
feature, and why is the target, target class? Yes.
>>: How do you select the negative training data? You just
blindly go and get negative training data? For instance if you
want to detect Alice's voice, for instance, right, you compare
that to [indiscernible].
>> Shahriar Nirjon: So the question is, I mean, we are really
care about the positive training data because Alice's voice is
the one that we want to detect. How do we get the negative
training data. Now, the developer specifies that using the
second set of strings; that is, the within sound. Within string.
The voice, printer and phone.
This means that these are the sounds that might be present in
that environment. So all the sounds that are tagged with either
voice, we tag them all. All the sounds that are tagged with
printer, we tag them all. Phone, we tag them all. Within the
context of office, Alice's office. And then we mark that those
which have tags, Alice and voice are the positive ones. The rest
are negative ones.
>>: More precisely, who generates the tags?
tags?
Who assigns the
>> Shahriar Nirjon: That's a very good question. We have two
models of usage. Here, the user, the concept of user is often
confusing in our system. I get this question again and again.
Who is the user of this system? There are two models. Number
one, where a developer is in the picture. So we are talking
about a scenario where there are three entities. The user, the
end user, the developer, and our service. So here, the developer
enables collection of data from the end user and uploads that to
the cloud. That might happen like during the first installation
of the application, during the first usage of the application.
So this is one way of doing that.
The third
the second category is that taking the developer out
of the picture completely. We can actually
we means the
platform provider, the Auditeur platform provider, we have a
generic application with which the end user can actually record
some sounds, add tags and use that to record nice sounds.
there the developer is not in this.
So
So let's get back to the training set again. So we are creating
binary classifiers. So the class level Y has only positive or
plus one or minus one. And the number of rows depends on the
number of sound clips that we have in our system.
Now one thing again here, we have to mention that we will not be
using all 220 features because most of the features will not be
relevant to our problem.
So how do we know which are relevant? So that brings me to the
next few slides. So here we are talking about energy efficient
classifiers. And we asked ourselves this question. How do we
obtain a classifier which is accurate? Accuracy is number one
priority of ours. But it has to run within an energy budget,
because the developer also specified some energy constraint.
So we did some measurements. We implemented our units and we
measured the energy consumption of each of these units and we
found that featuring extractors are the ones that are responsible
for consuming over 75 percent to it could be up to 90 percent
energy. If we can optimize that part, we can save a lot of
energy and that's the part where we are flexible.
Now I formulate that as an optimization problem so this reads
like this. Given a set of features, X, X1, X2, and their
relative energy costs, E1, E2 up to EN and a total energy budget,
B, how do we select a subset of feature S so that the sum of the
energy cost is within the
of the subset is within an energy
bound yet S contains enough information to classify the sound
with a high accuracy.
So our approach is going towards an information [indiscernible]
approach.
>>: So by energy, you mean the energy actually runs on the
phone.
>> Shahriar Nirjon:
Yes.
>>: And how do you monitor that? How do you know from different
kind of phones how much you're going to spend on these?
>> Shahriar Nirjon: It is going to be different. It is going to
be different, but we have a range. Like for
you cannot
specify like I want my want my application to run within 100
joules of energy, but you can say I want my application to run
for long time, like four to eight hours. So we translate that to
an energy number. So that's how we do it. And for different
phones, the exact energy [indiscernible] can be mapped to
different numbers within the eight
four to eight hours. So
it's like a soft guarantee. It's not exactly hard guarantee.
>>: For speech recognition, the one thing that doesn't take any
time is actually generating the features, so I'm a little bit
surprised that 98 percent of the time is going into the features
here. What kind of features are these? Isn't it all just
basically doing an FFT?
>> Shahriar Nirjon: It is because of the phone, actually. I
mean inside of the phone, if we extracting, like, 200 features in
realtime, it's consuming a lot of the energy. It's because of
the computation.
>>: Aren't they all just sort of based on doing an FFT of the
signal, though, and doing a little bit of this and a little bit
of that?
>> Shahriar Nirjon: Yes, it is. Now, the total energy, out of
the total energy 75 percent is used by these feature extractors.
The overall energy might be total compared to the overall energy
consumption on the phone could be less.
>>: Can you give me an example of how much time it takes for a
phone to extract 220 features?
>> Shahriar Nirjon: It won't be realtime. Say for 18 features,
it is real time, like one second of data would take less than one
second. But if you go more than 18, it becomes non realtime. I
really don't have the number on the top of my head. It could be
minutes.
Okay. So let's move on to the solution approach. Our solution
approach is a constraint, minimum redundancy, maximum relevant
selection. So we have the minimum redundancy maximum relevancy
is a pretty common approach of selecting features, but we do it
in a constrained setup. That's kind of the novel thing here.
So I'm going to explain what this minimum redundancy, maximum
relevancy means because not everyone is familiar with this.
These are two properties that are very desirable. The first one
is minimum redundancy size. Select the acoustic features that
should have less correlation or less mutual information within
themselves.
Now, we're not going to take two acoustic features that are very
similar. We're going to use only one of them because we want to
save some energy. And maximum relevancy says selected features
should have high correlation; that is, more mutual information
with the target class.
This proves that
this shows that the features that we select
are actually correlated to the target class. That means they're
relevant. We're not going to take an irrelevant that is not
related to our target class and select that.
Now, let's try an analogy to understand this matrix even better.
Say H is a function that measures uncertainty of something. And
inside the bracket, that's my talk. So today, those people who
came to this talk today, without any prior knowledge about who I
am and what I'm going to say, to them the uncertainty is like
hundred.
units.
Uncertainty is measured in bits so let's a hundred
A second guy who came to this room, he knew my name, he Googled
me, Binged me and got my web page. And he knows that I work in
sensing, mobile computing, acoustics and things like that. So to
him, the uncertainty of my talk with that kind of knowledge is a
little less, like 80.
And the third person who came to this room today, who actually
read my paper on Auditeur and knows exactly how my algorithm
works and everything. So to him, the uncertainty of my talk will
be almost zero, like 15.
Now, what we do, we take the difference between these
uncertainties. So by taking the difference of first and second
row, we get 20, which is the mutual information between my talk
and my web page. So this is uncertainty, the difference of
uncertainty is it's a nice measure of correlation too.
And the third thing, third row shows that mutual information
between my talk and my paper is very high, like 85, which is
true. So we want to replace this with acoustic features and
variables. Say XN is the end acoustic feature. S1 is the
already selected acoustic features, and Y is the target class.
Now everything will start making sense.
So we want this number row to be minimized, because we want to
make X and the N acoustic feature that we are thinking about
selecting should have less mutual correlation with already
selected ones in order to satisfy our minimum redundancy
criteria.
And we also want to maximize the third row; that is, to maximize
the mutual information between the end acoustic feature and the
target class, because that tends to maximize our relevance.
Now that we are clear about the definitions, we are for the first
time we are plugging in the energy bound. So again I'm going to
show you the simulation of this algorithm with an example,
because examples are easier to understand.
So let's say we have N minus 1 acoustic features, and again our
energy bound is B, and we magically, we magically know the best
solution of the scenario.
For example, if N equals 5, and for all one up to B bounds, we
magically know the best solution using the first four acoustic
features. N minus 1 is four. So these are the solutions somehow
we got.
Now, the question that we ask is what happens when the next
feature, X5, comes? More precisely, what happens to the set
pointed to by B, bound, when the X5 comes? If we can answer the
question, we do the same thing for X6, X7 up to XN and that's how
our iterative algorithm advances.
So what we are really talking about here is decision on X5. Now,
the decision on X5 could be either we take it or we do not take
it. If we take X5, X5 has its own energy cost. Let's say it's
E5. Now, without crossing the energy bound of B, we can insert
X5 into one of these up to B minus E5 sets. These are the
candidate sets where we can add X5 without crossing the bound
energy bound B.
Now, within that top B minus E5 set, we do a linear search to
find the set with which X5 has the minimum mutual information.
That's one candidate solution. But if we think about not taking
X5, somehow we decide that we do not want X5, then by the
definition of magic, we already know the solution from our
previous step.
So now within these two candidate solutions, we pick the one that
have maximum mutual information with the target class. So this
algorithm is iterative. This algorithm is kind of gritty. It is
not that it does not give you the optimal solution, but it is
better, way better than regular [indiscernible] algorithm where
you would probably rank those features and take the top.
This brings me to the third part of my talk, which is achieving
context sensitive classification. So at this moment, we consider
two types of context. One is the location context, another one
is the body position context. So some examples of the location
context are on the left. We have this
yes.
>>: I'm sorry. Can you compare that previous one with methods
like training and max int classifier with regularization and
throwing away the features that have low weight?
>> Shahriar Nirjon: The one that you particularly mentioned, we
did not do that. What we did, we used
you are probably
talking about embedded way of creating the classifier while
feature selection. We did not do that.
>>: Exactly. Or, in other words, looking directly at the
classifier performance in the decision
>> Shahriar Nirjon:
Yeah, we did not do that.
>>: Select the features. The big classifier with all the
features and throw away the ones that have low weight.
>> Shahriar Nirjon:
>>:
Build it up.
Yeah.
There's a lot of variations on that.
>> Shahriar Nirjon: Exactly. So we did not compare with that
one. We compared with an easier one, I have to admit that, which
is the ranking method. Filtering method, actually, like filters
some of the features after ranking them.
>>:
Based on
>> Shahriar Nirjon:
>>:
Based on, say
Performance?
>> Shahriar Nirjon: If we use classifier performance, then it
becomes the [indiscernible] method, which is different because we
do not actually do the actual classification first, because there
are so different possibilities, we cannot do that because of this
exponential number of states, possible states. Instead, we use
this information tiering approach just to estimate how better
features they could be.
Ideally, I totally agree with you that we should be doing that,
[indiscernible] all possible good sets, doing actual
classification, and taking the one that actually performs the
best. But we did not go through that process, because it takes
long.
Okay. Let's get back to the location context. So we are
considering several different types of location context like
indoors versus outdoors versus traveling. You can also name some
of this based on GPS coordinates.
On the other hand, we have fixed three body position contexts.
Number one is whether that you are holding the phone with your
hand that is interacting with the phone while recording or using
this application. Number two is whether the phone is inside your
pocket. And number three is whether the phone is on a table.
Now, the way we infer this context is using on board sensors like
WiFi and GPS can be used for the first one. Light sensors,
accelerometers and these days Android phones actually comes with
an API to detect all these three things.
So there is no novelty in terms of research here, but this is a
solve problem for this particular state of context so we just
used that. But the important novelty, interesting thing here is
the way we used that in our system, the way we used that in our
system is for each location context and for each body position
context, the cloud creates a different plan.
So now, having said that, not all plans will be necessarily used
or relevant to a particular problem. Say you're trying to
monitor your sleep quality. Here, you're not going to sleep
outside or you're not going to sleep while driving, hopefully.
So only the first row is valid.
Again, within that, you are not going to sleep while using the
phone or phone is not likely to be inside your pocket, so
probably the third plan is only relevant to one problem. So the
[indiscernible] can specify what plan he wants instead of getting
all those computed.
So now let's see some results. So we have implemented several
applications. Seven of them were deployed and tested really
well. These are the ones. So I implemented
the first two are
related to speech, voice. It's not about the content of the
speech. It is more about speaker identification, male, female,
or whether it is not speech.
The next two relates to music. And the last three relates to
sound of vehicle, sound of your kitchen, or sounds that happen
inside your sleep.
Let's see some results. I picked only a couple of results. More
are in my paper and in my backup slides if you are interested in
that. So the top results is about the power consumption here,
which is the main concern. So for each of these application, we
have implemented three versions of it. One is our automated
system, which is Auditeur, the first bar.
And the other two are in phone and in cloud implication. Let me
explain what are those. So we picked the features that these are
out of this seven, four are actually published papers. So there
we picked a feature that they used and the classifier that they
used and their setup. And then we evaluated all three
applications together.
The difference between in phone and in cloud is basically the
communication. And classification, they call it classification
happens, whether inside or we are sending the data and
classifying in the cloud.
So based on that, we see that on average, our automatically
generated application classifiers are actually 11 percent up to
441 percent more
less power hungry than the baselines.
Accuracy is also important. An app can be energy efficient, but
it should not be poor in accuracy. So we brought this graph too.
So here, once again, I'm comparing the accuracy of all three
implementations. We are pretty much the winner most of the time.
Yes.
>>: My question is how did you measure power? Did you actually
implement this on a phone and measure the overall power of the
phone?
>> Shahriar Nirjon: No. So this is more like the number of
hours that they run and Android has an API [indiscernible] that
we can get. So from the long running application, this is not
exactly like the previous power measurements during the
construction of the model where we really used a power meter to
measure.
So yeah, these two numbers can be different.
>>: So just to clarify that, that means you have an energy
estimation passed for every feature and then create a model of
combining those features. There will be that much energy and
then you just write it for a simulator type of thing to give you
the number of days that you would have your
>> Shahriar Nirjon:
Yes, yes.
>>: And if you go to the next slide, for sound, you seem to be
close to 100 percent while everybody else seems to be close to
85, 90 percent, right? While you're actually reducing the number
of [indiscernible] you're using to save energy, right?
>> Shahriar Nirjon:
Yes.
>>: Isn't this a bit counter intuitive?
this happening?
I mean, like why is
>> Shahriar Nirjon: One main reason here is that the context is
not considered in most of these applications. So they're not
switching context. So for example, sound sense and speaker
sense, here we are dealing with voice. And we are creating
several different classifiers for different scenarios. Like my
office room in my university has its own acoustic signature and I
pick data from there and I train the classifier from there.
And then, we come back at home, I collect data from my drawing
room and train another classifier. And my system knows this
difference. But others do not switch classifiers. When I train
the other ones, I kind of use all data and create a classifier.
So this is the difference between a specialization and
generalization. Yes.
>>: So do you take into account the cost of detecting these
different contexts when you compute the power consumption?
>> Shahriar Nirjon:
No.
>>: Because now you [indiscernible] light sensor, all these are
running on the phone.
>> Shahriar Nirjon: Yes, we did not consider that. We did not
consider that. I'm transparent here. But in most cases, it is
about WiFi to infer indoors, which I think WiFi will be always
on. I think people have it turned on. Accelerometer is very
cheap to detect the body position context. Light sensor is also
cheap unless we use GPS, I think it's not going to be a big deal.
>>: I think I'm saying the same thing as the previous two
people. But if you do the same operations on the phone, you'll
get the same answer. If you do the same operations in the cloud,
you'll get the same answer. So the differences in accuracy here
aren't about in phone or in the cloud
>> Shahriar Nirjon:
>>:
It is more about communications.
What [indiscernible] was used or what features were used.
>> Shahriar Nirjon: Exactly. So let's see. I mean, let's see,
for example, the vehicle application, the fifth one, here the in
cloud performance really, really poor. Why? Because we were
losing connectivity to the cloud. So many of our frames and
Windows were dropped. So most of the times, we're outdoors
versus indoors. This context, switching was happening in cloud
was performing bad.
So this is one important lesson that we learned during our study,
which was intuitive. Yes.
>>: So you are dealing with realtime.
now.
>> Shahriar Nirjon:
You want the answer right
Absolutely.
>>: Because [indiscernible] you'll come back in a half a second,
you'll say, I don't know.
>> Shahriar Nirjon:
>>:
Okay.
No.
It's realtime classification, yeah.
How important is that?
>> Shahriar Nirjon: For some applications, people want realtime
answers. For example, if we're trying to detect vehicle mode of
transportation and based on that you want to present some
advertise for, say, it may not be the realtime, but it is not
offline. We distinguish between offline versus immediate result,
let's say.
Realtime is a different issue where deadline is there. But let's
say offline versus online. So some of this application might
require online, say mode of transportation detection. We might
want to detect the genre off of music and play the song so we
don't want to wait that long. So some application might be
online.
So anyway, the red circle shows the cases where we did not
perform as well as the base lines. This happens in some cases
where the application uses a very sophisticated, highly designed
specialized algorithm. For example, this particular application
heart, it's a musical heart application that inserts microphones
into people's ear to detect heart rate. And they have a very
sophisticated algorithm to counter heartbeats. We could not beat
that. But we are pretty close.
So our solution does not necessarily be the best always. We also
did some interesting user studies so we gave our API to a bunch
of people in University of Virginia. We also got some Microsoft
Research interns who participated in this study, and we gave them
our API and told them to implement whatever update can envision
and then we asked them questions, like how long it took for you
to learn and how hard is your coding time.
So this is what we got. So on an average, it actually takes less
than 30 minutes for a person to learn the API and actually core
the core logic. So this is how easy the system is to use.
These are some interesting app ideas that I thought I should
bring here that came out of our study. So five of them actually
are more interesting than others. So these are the five ones.
The first one is detecting snoring as sound and vibrate the phone
until he stops snoring. This is one thing I thought interesting.
Honk app is like detecting the honking sound of a car and alert
an inattentive user when he's listening to music or something.
Asthma sound detection and asthma monitoring application where
the goal is to detect crackling and wheezing sound automatically
with the phone and keep records. So we became so much interested
in that, we are now actively parsing this idea in a much broader
and more realistic with nor sensor devices and holistic approach.
Fourth application, subways, like the honk application, but it
detects subway. And the fifth one is dog app that detects dog
barking sound and alerts the user when he is outside home. The
idea is probably some intruder came into the room and is trying
to steal something away.
So in summary, I presented a platform, Auditeur, which is
basically a developer platform providing acoustic event
classification service. It's very easy to use, and it provides
context sensitive and energy efficient classification of acoustic
events and its potential is enormous. Would like to either make
it publicly available so people can use our Microsoft
might be
interested in to make it part of their [indiscernible] platform
so that it provides
developers, Windows Phone developers can
use interesting applications with that.
So they're's a list of other works other than acoustic sensing
and processing. I do a lot of other stuff too. So the first set
of work that I have done during my Ph.D. is smartphone acoustic
sensing. It includes platform. It includes some interesting
applications. It also includes some computational efficiency
stops. I also have some work in realtime systems. It also
involves mobile computing but a little bit of networking and
mobile computing and energy research and overlaps. So that one
got a best paper award in RTAS 2012.
Other than these two, I have also done some work that is
completely different using Kinect sensors, but it is sensing.
these two works, Kintense and Kinsight, a lot of attention in
So
electronic and print media and they are really excited about
Kinect and how we can use it in building nice [indiscernible]
computing systems.
And the fourth and the recent one that I work here last year with
Jie as an intern. So we build the first indoor GPS that works.
So people think that GPS doesn't work indoors, and we kind of
broke that belief and we built the first GPS that works indoors.
And that got published in the MOBISYS 2014. So this is
remarkable, groundbreaking work that I did with Jie.
So let me talk a little bit about my future interests, what I'm
going to work in future. So I'm a mobile computing and sensing
person. And in future, I want to keep continuing my research
into different but related areas. The first one is more aligned
towards my current work, which is about smart devices and
wearable computing. Wearable computing has a huge market. It is
estimated to be 30 billion dollars by the year 2018.
So there the problems could be like signal processing problems,
building the sensor itself, integrating, networking, location
sensing. These are the technical challenges there.
There are some interesting applications too that I am interested
in. Some examples could be health and wellness monitoring and
also I'm interested in connected vehicle or monitoring people,
the drivers while commuting.
Another interest of mine is about more recent trends in computer
science, when is we're collecting a lot of data from the cloud.
We're creating enormous database, big data so I want to explore
the idea that how about creating scalable back end services that
are specific to big sensor data. There the problems could be
like detecting patterns and detecting anomalies.
The fourth thing, which in my understanding is the next big thing
after big data is [indiscernible] computing, where people, human
and machine will be making decisions together so mobile phone may
play a very vital role there.
area that I'd like to pursue.
So this is another interesting
So finally, I promised one thing. So this is the time. So this
is an application that I built during my Ph.D. it got published
in the top conference of ours. So this is hardware software
system. It reads like a biofeedback based context aware and
automatic music recommendation system.
The way it works, you have to wear special [indiscernible]
wearable ear plugs just like this, because this is the prototype
that we built. This work was again in collaboration with
Microsoft Research Asia that I did.
So inside these ear buds, we have added a lot of sensors. IR
sensors, microphones. We have accelerometers. And when you wear
that, the sensor board running tiny OS operating system drives
the sensor data and sends it to the phone so we can sense your
heart rate. We can sense your activity level, like what you are
doing at what activity level you are, and based on that, we play
the right music for the right moment.
And not only that, music actually plays as an intervention. So
you can set your target heart rate and then you can
the music
are played based on your physiological parameters, like heart
rate and activity and your heart rate is controlled by music.
This is the sensors. I already explained what it is. IMU
microphone. IR thermometer and communications happens two ways,
either through the audio jack, using hijack technology or
Bluetooth. And it has a battery to power. But the problem with
the previous version
this version that only three copies of
this hardware exist, and unless we make it
produce it massly,
this application is not going to be used by people. So that's
why we thought creating a better, newer version of music heart
version 2, that actually works with any off the shelf heart rate
monitoring devices. Like currently, I am wearing one of those
devices. This is a smart watch. It detects my heart rate. It
sends the data over Bluetooth 4. This application also works
with chest straps and things like that.
So I want to show
I'm going to share this application with
you, first of all. Go to my website/musicalheart.html. You'll
get this application and you can use that. If you are lazy like
me, we can take a picture, and that's the download link in
keyword code format.
Now I'm going to show a live demo of musical heart, and then I'll
be taking questions. So I need to switch from PC to this camera.
So I've already turned my wrist watch on. I'm going to turn on
the musical heart application. So this is what musical heart
application use are interface looks like. At first, I'm going to
scan and it should get my device. Smile global is my device. I
connected to it. And in a moment, it should be showing my heart
rate. There it is. 115. I'm just talking here. I mean, I
should not be that high. But it happens. It always happens.
It's always over 100.
So that graph is showing my heart rate, is getting the data from
my wrist watch. It's also showing my intensity level what
activity zone I'm in, et cetera. And the bottom is actually
showing my activity level. Previously, we used data from the
earphones. Now basically we're using in phone accelerometers.
So if I shake it like this, the activity will go high and again
low.
So these are the two sensing modes that we have, heart rate and
activity level. Based on that is correct the music player on the
next tab is actually going to play the music. So this is the
tab. On top panel, we have how long you want to run this
application. Say you want to do 20 minute cardio exercise. We
set the time. You set it to cardio. Then we know what the
target heart rate should be. Or you can actually specify your
target heart rate.
Let's say we make my target heart rate to 62. And this
the
list on the bottom is showing the top recommended songs. So if I
hit the refresh button, it's actually based on my current heart
rate, it's calculating what song I should be listening to now in
order to make my heart rate go down.
So I can play one for you like, say, Africa. [music]. This is a
very soft, melodious song. That's the idea. I can do the
opposite too. Let's try to make my heart rate 160 or something.
If I hit refresh again, it's going to say
suggest to me a
different set of songs. So Down For Whatever. [music]. So this
is very aggressive and high beat, high tempo song. That is
supposed to energize me and I should be working out hard.
The third tab is my play list, the song library. The way you
should be using it, just copy English MP3 files into the SD card
of this phone and we'll be extracting all these features using
our website automatically.
For example, let's pick any one, say Roar. So what the phone is
doing, it's talking to a remote server and sending a piece of the
song to the server and these are the acoustic features that it is
computing in the cloud and sending it back. So we have tempo, we
have energy, danceability, liveness, [indiscernible] and built on
all of these acoustic features and my previous history of sounds
and heart regulation, we are playing the right music for the
right moment.
So I'm going to keep this running and now I'll be taking
questions. Ask me questions and see how my heart rate changes on
that or if I am telling the truth.
>> Jie Liu:
Thank you.
Questions?
>>: Question about how [indiscernible] user is responsible for
this input [indiscernible].
>> Shahriar Nirjon: Yes. A very good question. I mean, how do
we get the samples? Who tags this and when tags it. Now, end
users are like our grandmothers. They do not have good sense of
what should be doing and we should treat them as our
grandmothers, even if they are younger than us. It is the
responsibility of the developers who should utilize the end user
and make them collect that sound and upload that.
In the settings of an application, we envision it this way. We
have a demo application that is actually given to the developers
to see how should be programmed. So there, the idea is during
the settings, the first usage in the settings, during the first
usage of the application, the user will be prompted to record
this and that kind of sounds. So it is
and we provide the API
to the developers to be able to do it the way they want their
application to be collecting data.
So the thing is a developer enables. The end user does it.
Another approach could be for some application, we do not
necessarily need to engage the end user. Say we are trying to
detect snoring. It's not possible to [indiscernible] the user,
because he is sleeping. He cannot record his snoring sound
unless there is a recorder running overnight.
So there you can actually use collected data from other sources
and developer who will be the one who is uploading and then the
end user is completely out of the picture. That may also happen.
So there are a lot of flexible ways of using this system.
>>: So will your optimization framework work if you want to do
some multiple things from the same audio clip. For example, you
want to detect whether it's Alice, you want to detect whether
she's alone or talking to somebody. If you want to do multiple
things at the same time with the same clip, will your
optimization framework still work?
>> Shahriar Nirjon: No. The way it is now, for Alice, you have
to create one class fire, one pipeline. If you have Bob, you
have another pipeline. And if you run both of them
>>:
Multiple pipelines in parallel?
>> Shahriar Nirjon: Yes, it will run both pipelines parallel.
And as a [indiscernible] Alice plus Bob. Because when creating
the Bob's pipeline, we did not know Alice was there. So the
general answer is no. But if the developer is clever, he can
make that happen. He can
during the creation of the second
application estimating the first one, he can constrain energy.
He can say like now, energy constraint is even more.
Now, another way it can happen, the third and most appropriate
use case for creating this Alice and Bob should be to have both
Alice and Bob in your dataset as your positive examples and all
others as negative one. But the problem here is there's one
classification result, Alice or Bob. It's not Alice, Bob, or
else. It will be Alice or Bob. So that has that problem. But
there are multiple ways of dealing with that.
The fourth and better approach could be to [indiscernible] the
pipelines. Pipelines are almost always for similar type of data
could be similar like Alice's voice versus Bob's voice. Could
have same set of acoustic features. We do not do that, but it is
in our future plan that we can combine multiple pipelines and use
them together. That's another possibility.
So there are several possibilities. When we created that, we
really trying to prove the hypothesis that it works. It's a
useful platform. We can use it. And now that it is working,
we'll be looking forward to optimize it, of course, in different
ways.
>>: So in the Alice example, how many samples did you need to
train the classifier?
>> Shahriar Nirjon: It's about five minutes long is just fine.
If you are talking about
>>:
You need one context or multiple samples from
>> Shahriar Nirjon:
Each context.
For each context.
>>: So at the beginning of your talk, you showed a recording
device but said the piano music was recorded that way.
>> Shahriar Nirjon: No, actually. I was saying it was not piano
that was used to create this music. It was the music box that
was used to create that music.
>>:
[indiscernible].
>> Shahriar Nirjon:
Oh, no, that's from YouTube.
>>: So your system heavily relied on sort of contributed data,
right? Most developer would intentionally know what to do and
others, right? [indiscernible] examples. How do you think about
collecting them or what's the incentive people will contribute
other signals to your system to bootstrap?
>> Shahriar Nirjon: Really did not think about that during that
moment because we have 35 participants who were trusted
participants and they were told to do that so they did that. But
in a real system, we could, first of all, we could create a crowd
contributed system by providing some insights like, I don't know,
maybe money or maybe some service. We can make people do that.
This could be one approach of doing that.
Developers can collect some data with people's permission without
people activity management, like whether they are conversating
over the phone. They can turn on
turn this feature on like
you can collect my data while I'm talking through the phone. Not
private data, maybe some acoustic features and upload this that
like that. So you can opt in for that, and your offer could be
like using that will be updating your classifier and in future,
you will get better classification. That could be one incentive.
So say you are using an application and you are prompting the
user that if you are not satisfied or if you want to improve that
classification, you can do this
you can go and collect these
four more dataset, four more data sound clips for minutes long.
>>:
You would also use [indiscernible], right?
>> Shahriar Nirjon:
>>:
Yes, of course.
So that's how you can [indiscernible].
>> Shahriar Nirjon: Yeah. For example, you are [indiscernible]
from one context to another context we could play the same game
in the cloud. Like user can provide some tags and from that tag,
we can understand that it is
it belongs to some other
problems, yeah. That is a very good [indiscernible], yeah.
>>: [indiscernible] because I think that user would be more
interested in
they don't really care about the context, right,
when they [indiscernible] recognizing [indiscernible]. So have
you been through this? Like [indiscernible].
>> Shahriar Nirjon: Let me try to rephrase what you are trying
ask. It is like what happens if you do not use context at all.
Is that what you are asking? Or
>>:
Yeah, context independent.
>> Shahriar Nirjon: Well, one thing is that sound sensing is
context dependent, fundamentally, because the environment has a
lot of effect. Like even the room size, even if it is indoors,
the room size has some effect. Like reverberation model differs
from big room to small rooms. So it is context sensitive
problem. So I would vote for context sensitiveness than to try
to fix the problem without considering context.
>>: Because the user [indiscernible] recognizing different
contexts cause him to recognize data for all these different
contexts and this is pretty, you know
>> Shahriar Nirjon:
>>:
Yes.
[indiscernible] for the user.
>> Shahriar Nirjon: Yes. I totally agree with that. I mean,
this is a trade off. So if your user wants better performance,
which can [indiscernible] them, you should do this. If he does
that, we are giving him [indiscernible] so he has some
incentives. But it is
I agree that it is [indiscernible].
>>: [indiscernible] tries to mitigate that context problem
[indiscernible].
>> Shahriar Nirjon: So I would comment on that, first of all,
mobile phones comes with all the sensors. We can now easily and
cheaply infer context. So maybe the reason why speech
recognition community did not do that is the absence of all the
sensors that they can detect the context and detect speech
specific to context so that attack the problem in a different way
that assuming that context is not available. So we
[indiscernible]. But in our case, in mobile computing, we are us
a thinking about that
we are constantly working on detecting
context. And context sensing algorithms are becoming better and
cheaper.
So how can we leverage that is a nice trend in many mobile
sensing problems. So my solution is built on top of that.
>> Jie Liu:
Let's thank Nirjon again.
Download