>> Jie Liu: All right. Let's get started. It's a great pleasure to have Shahriar Nirjon to give a talk today. Nirjon, as he really goes by, is graduating from the University of Virginia, from [indiscernible] group and his Ph.D. research has been primarily on mobile sensing using mobile signals to derive interesting contextual informations. And last summer, he also did an internship with us in Microsoft Research. He did some indoor GPS work with me, also very exciting. So today, I think he's primarily going to talk about the audio piece of his work. And he will have time to meet with people for the rest of today and tomorrow. All right. >> Shahriar Nirjon: Hello, everyone. So my name is Shahriar Nirjon. I'm from the University of Virginia. Thanks, Jie, for introducing me. It's my great pleasure to be here once again at Microsoft Research. So the title that I picked for my talk today is a Smartphone as your Third Ear. Technically, it is about general purpose acoustic event detection on a mobile device. So let's get started. So every day around us, we hear hundreds and thousands different types of acoustic events. Even like short time sounds that's happening. And we human beings are capable of remembering them, recognizing them, and acting upon them. Now, having said that, I would like to do a small experiment with you. I want to play a sound clip, a 30 second sound clip, and you have to describe the sound. So at the end of this, I will be asking to volunteer, whoever wants, to describe the sound. So let's wait a second for it. Okay. So your sound clip starts now. [music]. Okay. So can anyone describe what it was using any words or any tag you would like to attach to it. >>: Music. >> Shahriar Nirjon: >>: Anything else? Melody. >> Shahriar Nirjon: >>: Music, we've got one, music. Melody, fine. Piano. >> Shahriar Nirjon: Piano, the instrument, he's trying to sense the instrument, okay. Piano. Anything else? >>: It sounded like a movie sound track or something. >> Shahriar Nirjon: It's a soundtrack from something. A movie. Not a movie, but a TV drama series. Okay. So this is what it was. It was from Sherlock, a famous TV series, recent TV series shown at BBC. So I asked the same question to several of my previous talks and this is what I got. So everyone could say like it's music, it's easy to guess. >>: [inaudible]. >> Shahriar Nirjon: Yeah, one of out of three persons could actually say like it was from Sherlock, because they are mostly students so they know what it is. They watch a lot of TVs. And some people could actually guess the right instrument. So there are like ten percent of them. Now, those were guessing who wanted to know what it was, it's nothing fancy, like piano. It was simple DIY music box. So this is how it was made. [music]. Okay. So the point that I was trying to make that is the same sound click can be described by different people differently. And it really depends upon our experiences. So someone who learned this music for the first time, people probably have learned as Sherlock. Someone else have learned it as a music box, the instrument. Or some people just guess from his experience, it is something like music. So it depends really on how we teach ourselves this song or the sound the first time. And in future, when we hear the same thing, these are the tags that pops up into our mind. Like music or Sherlock or instrument. So today's talk is about making smartphones do exactly the same thing. So every smartphone comes with a microphone. It listens to the surrounding. It should be able to do exactly the same thing, just like our third ear. So it should be able to classify sounds and generic events. So today's talk, I want to talk at first why I want to achieve that. That is the motivation of my research. Then I'm going to show exactly how I did. What was my approach and the description of my approach. And then I'm going to show some results that I got. And then there will be one more thing, an interesting thing. It's a surprise. It's a gift for all of you, those who have come here today. I'm going to give you a free mobile phone app that you can use and you can share with your friends and you can enjoy. I'm sure you're going to love it, so wait for my last slide. Now I'm going to start with the motivation of my work. So there are several different types of web services available for smartphones to use. So we have seen search, we have video sharing, we have email, maps, music related services, translator, weather, and so on. But if you think of what are the services that are doing any kind of acoustic processing, sound processing, you'll primarily see two types. Number one, speech. Number two, music. Now, inside these two categories, you have several different types of services, like speech related services can detect the speaker, speaker ID, speech recognition is there. Some services can actually detect your emotion or your stress level. On the other hand, music related service are like, for example, chasm is a very nice app that can detect, ID the music. So music ID is there. There are some services for sharing and streaming music. But I'm kind of interested in a wide varieties of different types of sounds. What about those sounds, for example, the sound of your heart beat or your cough or your snoring, like physiological sound. What about sound of your bus, car to detect the mode of transportation or your pet cat or some machine. So there also a general purpose acoustic event recognition service that the developers can use. So that is the main motivation of my work. Now imagine the possibilities that we could actually turn into realities. By detecting sounds like snoring or mumbling or talking during sleep, we could create a nice sleep quality measurement app. Or we could create a nice baby monitoring app by detecting baby crying or whining and talking baby. Similarly, we could create a nice social interaction monitoring application by detecting who you were talking to how much you were talking to, whether you are crying too much these days or so on. So these of different types of possibilities that can be real apps if we have a service that can detect all these different types of sound events. Now, we asked this question, why this is so hard. So we came up with these challenges. First one is applications are diverse. So creating one single platform is like creating one shoe that fits all. And this is not going to work. So this app diversity is a thing that's the first challenge because they require different amount of processing, different type of classifier, different type of features. So they are really diverse. The second challenge is a common challenge for any mobile phone app. That is battery life. Creating an application that is energy efficient is in general hard problem. Now if we add acoustic processing on top of it, it becomes harder. So battery life is our second challenge. And the third thing is user context. Now, many application that works indoors might not work outdoors. I'm talking about sound recognition application, because the characteristics of the room and the environment, the background noise, everything changes from context to context. So user context is the third thing that we need to consider. Now let's see what the state of the art is doing. So I picked three applications from state of the art, and I see they have some common structure, like they are all doing some preprocessing, the data from the microphone, the extracting features. Now, the number of features and the type of features could be different. But in general, they have some way of extracting features. And there's a classification state. Now, again, the classifiers might be different, but they have this classification mechanism. Now, these applications differ in implementations too. Some application, for example, the first column implemented everything inside the phone. So it does not require to talk to the cloud. It does everything inside the phone. So I call that in phone implementation style. Whereas the other two, like the middle column, it performs the classification into the cloud. And the third column does both feature extraction and classification in the cloud. So they require assistance from cloud. That's why I call them cloud assisted or in cloud implementations. Now see their limitations. First, the common limitations in both approaches is that, first of all, people are reinventing the wheel. Every time a sound classification problem comes, we do the same thing again and again. We collect data, we extract a bunch of features. We train classifiers. We try to optimize that for energy. We try to make it context sensitive. But again, the second problem comes, we do the same thing again. So I say this is a waste of time and therefore we should not be doing that. And number two is developers are not researchers so if we can isolate these two like researchers, should be able to do the research and provide the developers with the very easy to use API that a lot of problems can be solved. Developer can think of imagine new applications whereas the actual research is already done by the researchers and they're just using that outcome of this research. There are some implementation specific limitations too. For example, cloud assisted ones require network connectivity. Which in turn has energy cost and it costs money, whereas in phone classifies, they're highly rigid, they're fixed, and they are trained on a different people so personalization is not there so changing them is not quite often happens. So that's why they're very rigid and they do not generalize well. Now let's see what my research goal is. My research goal is to create one single platform that should be able to generate many applications automatically. Now, here I'm not really talking about one creating one shoe that fits all. I'm creating about one shoe factory that people can borrow and use to create their own customized shoes. So the platform that I envision is a mobile cloud platform so it has a mobile component. It has some cloud component. And input to this system it's a bunch of sound clips with some tags to describe the sound clips. And in addition to that, you can specify in some constraint, like energy constraint. So the developer can say, like, this is a long running application. It wants to monitor my sleep so I want this application to run for eight hours or more. Why is that? So the developer can say like I don't care about energy. The user is going to use it for five minutes or so. So this is the energy constant that the developer can specify. And the output of this system is energy and context aware acoustic event classification application. So that is our goal. Now let's see our approach. So we divide the computation between the cloud and the phone, and it is like this. The phone contains the mechanism. That is, it has all the preprocessing you need. It has all the features extractors. It has all the classifiers that may or may not be needed to solve an acoustic recognition problem. So the phone already contains all the software pieces. Now, the pieces that are required to be run to detect the type of sound is data mined by the cloud. So the cloud is like the brain. It creates the plan. So it decides which units to use and which what are the parameters of that unit and how these units will be connected together. So this communication to the cloud happens during the creation of the application or the first run. It's not like continuous communication with the cloud. This is one time use. And once we have the plan, we start running the application, then there is no communication to the cloud. Everything runs locally. I picked an example. So let's say we want to detect Alice's voice at her office. Now, this is a three step process. Step one, this is happening inside the phone. It is pretty simple. We record Alice's voice at her office. We tag it with some words, like it's voice, it's Alice. Some of these tags are automatically generated using the sensors inside the phone, like it is an indoor environment. It is when it was recorded, the phone was at Alice's hand, things like that. And then we can download that sound to the cloud. The more sound you upload, the better, and this is a crowd contribution approach so people all over the world will be uploading sounds and Alice will be uploading hers. Someone will be uploading theirs. Depending on what type of application will be selecting a subset of these sounds. In step two, which happens in the cloud, is a phone can request for a plan to the cloud. So the phone can send a request like I want to detect Alice's voice and there are some other sounds like phone and printer that might happen in the environment. And upon receiving such a request, what the cloud service does, it creates a training set by selecting a bunch of sound [indiscernible] that are relevant to the problem, trains the classifier, optimizes it, and what it creates, we call a classification plan that can be downloaded to the phone and executed. This whole process is automated. And in step three is the simplest step. You just download the classification plan and run it on the phone. So that's it. So these are the three steps to use our platform. And now we'll see into a little bit detail how we implemented this platform. We call this platform Auditeur. Auditeur is a French word which means a listener. So in our Auditeur platform, we have some components running in the phone and some components are running in the cloud. So I'm going to explain in phone processing components first and then the in cloud ones. Inside the phone, basically there are three services running. The first one is context service. So this guy is sensing the location context and the phone position context of the phone. So location context like indoors or outdoors. You can also name some context. And body position or phone position context is like the positioning of the phone with respect to your body. So it could be at your hand. It could be on a stable position like a table, or it could be inside your pocket. The second service is a communication responsible for talking to the cloud, downloading the classification plan. service is the sound engine service. service, so this guy is so uploading sounds, Third and most important So inside this sound engine service, we implemented all the acoustic processing components that might be needed in our solving problems. So together, they form a very standard five stage acoustic processing pipeline. So in stage one, we do some preprocessing of the sensed, microphone sensed data. Then we create frames of 32 milliseconds length and extract some frame level acoustic features. I'm going to explain those features in my next slide. Then there is a frame admission controller. So most of the time, our phones are listening to silence or non interesting things. So we want to throw them away before classifying. So that frame admission stage does that. In stage four, we accumulate a bunch of frames, like three to five seconds worth and extract some statistics, what we call window level acoustic features. And finally, that window gets classified. Let's see the current state of the implementation. have implemented four preprocessing units. Yes? >>: So far, we All of this is in the phone or in the cloud? >> Shahriar Nirjon: In the phone. These are in phone components. So preprocessing, units, there are four. We have implemented so far 13 high level acoustic features. They include temporal acoustic features as well as frequency domain acoustic features. We have implemented 12 different statistics. So for each features, we could create we could compute these 12 statistics. And finally, we have created seven classifiers. Now, this is an exhaustive list. Not all units will be used in the phone for a given problem. This is an example of the classification plan that we download from the cloud. So this is an XML file. It is basically a description of a directed acyclic graph. So the nodes are the acoustic processing units and the edges are like the data flow. Now, upon receiving such a file, what the phone does first, it parses that file and creates an acoustic processing pipeline like the one shown on the right. So in this particular pipeline example, a frame is coming on top. Then we are extracting some features at the frame level, and then we have decision tree to make throw away some of these noninteresting frames and then we extract some more statistics and we create a window level features and finally we classify that. Now, one thing to mention here is that this pipeline will be different for different problems. It might be different for the same problem in different context. So this is just an instance of a pipeline. This is not exactly the pipeline that we always use. So that kind of concludes my in phone processing and I'm moving on to more interesting stuff, which is in cloud processing, what's happening in the cloud, how this process plan is generated. So I want to talk about three parts here. One thing is training, how the classifiers are trained. Then the next to our energy efficiency and then about the user context. So let's start with the training first. So our design, in Auditeur is inspired by the natural way a developer thinks of an app. Say a person was thinking in his mind, like, I want to create an application that detects Alice's voice at her office where other sounds could be like other people's voice, printers and phones. Now, from this description, there's three things are basically important. Number one, Alice's voice. This is what we want to detect. Number two, office. This is where the sound classification application will be running. And the third thing is what about other sounds that might be in the environment. Voice, printer, and phones. Now, from this description, writing code is very simple. All you have to do, simply specify a [indiscernible] of strings. First one to describe the sound you are looking for, which is Alice voice. The second one is the other sounds that may or may not be present in the environment, like voice, printer and phone. That kind of limits the scope. And the third thing is the context; that is, where exactly the phone will be the application will be running. Now, the programming, the amount of programming the developer needs to know is just to make a simple API call, like soundletmanager.get configuration file and specify these three sets of strings. That's how much programming you need to know to use our system. So this is kind of sending the request to the cloud and now let's see what happens behind the scene. So the cloud kind of identifies all the positive and negative training examples from this set of tags. So at first, it tags all the sound clips that are tagged with Alice's voice. That is, the blue rectangle, shaded blue rectangle. Within that, sound clips that are tagged with either voice or printer or phone are the violet circle. So they include both the positive and the negative sounds. And inside them, sound claps that are tagged with voice and Alice are the ones that are positive ones. That is the one that we're looking for. So now we have positive examples. Now we have the violet minus green is the negative examples. Now we are ready to create a training set. This is what might look like a training set in our system. So it has like over 200 columns. Each of these columns represent one feature, and why is the target, target class? Yes. >>: How do you select the negative training data? You just blindly go and get negative training data? For instance if you want to detect Alice's voice, for instance, right, you compare that to [indiscernible]. >> Shahriar Nirjon: So the question is, I mean, we are really care about the positive training data because Alice's voice is the one that we want to detect. How do we get the negative training data. Now, the developer specifies that using the second set of strings; that is, the within sound. Within string. The voice, printer and phone. This means that these are the sounds that might be present in that environment. So all the sounds that are tagged with either voice, we tag them all. All the sounds that are tagged with printer, we tag them all. Phone, we tag them all. Within the context of office, Alice's office. And then we mark that those which have tags, Alice and voice are the positive ones. The rest are negative ones. >>: More precisely, who generates the tags? tags? Who assigns the >> Shahriar Nirjon: That's a very good question. We have two models of usage. Here, the user, the concept of user is often confusing in our system. I get this question again and again. Who is the user of this system? There are two models. Number one, where a developer is in the picture. So we are talking about a scenario where there are three entities. The user, the end user, the developer, and our service. So here, the developer enables collection of data from the end user and uploads that to the cloud. That might happen like during the first installation of the application, during the first usage of the application. So this is one way of doing that. The third the second category is that taking the developer out of the picture completely. We can actually we means the platform provider, the Auditeur platform provider, we have a generic application with which the end user can actually record some sounds, add tags and use that to record nice sounds. there the developer is not in this. So So let's get back to the training set again. So we are creating binary classifiers. So the class level Y has only positive or plus one or minus one. And the number of rows depends on the number of sound clips that we have in our system. Now one thing again here, we have to mention that we will not be using all 220 features because most of the features will not be relevant to our problem. So how do we know which are relevant? So that brings me to the next few slides. So here we are talking about energy efficient classifiers. And we asked ourselves this question. How do we obtain a classifier which is accurate? Accuracy is number one priority of ours. But it has to run within an energy budget, because the developer also specified some energy constraint. So we did some measurements. We implemented our units and we measured the energy consumption of each of these units and we found that featuring extractors are the ones that are responsible for consuming over 75 percent to it could be up to 90 percent energy. If we can optimize that part, we can save a lot of energy and that's the part where we are flexible. Now I formulate that as an optimization problem so this reads like this. Given a set of features, X, X1, X2, and their relative energy costs, E1, E2 up to EN and a total energy budget, B, how do we select a subset of feature S so that the sum of the energy cost is within the of the subset is within an energy bound yet S contains enough information to classify the sound with a high accuracy. So our approach is going towards an information [indiscernible] approach. >>: So by energy, you mean the energy actually runs on the phone. >> Shahriar Nirjon: Yes. >>: And how do you monitor that? How do you know from different kind of phones how much you're going to spend on these? >> Shahriar Nirjon: It is going to be different. It is going to be different, but we have a range. Like for you cannot specify like I want my want my application to run within 100 joules of energy, but you can say I want my application to run for long time, like four to eight hours. So we translate that to an energy number. So that's how we do it. And for different phones, the exact energy [indiscernible] can be mapped to different numbers within the eight four to eight hours. So it's like a soft guarantee. It's not exactly hard guarantee. >>: For speech recognition, the one thing that doesn't take any time is actually generating the features, so I'm a little bit surprised that 98 percent of the time is going into the features here. What kind of features are these? Isn't it all just basically doing an FFT? >> Shahriar Nirjon: It is because of the phone, actually. I mean inside of the phone, if we extracting, like, 200 features in realtime, it's consuming a lot of the energy. It's because of the computation. >>: Aren't they all just sort of based on doing an FFT of the signal, though, and doing a little bit of this and a little bit of that? >> Shahriar Nirjon: Yes, it is. Now, the total energy, out of the total energy 75 percent is used by these feature extractors. The overall energy might be total compared to the overall energy consumption on the phone could be less. >>: Can you give me an example of how much time it takes for a phone to extract 220 features? >> Shahriar Nirjon: It won't be realtime. Say for 18 features, it is real time, like one second of data would take less than one second. But if you go more than 18, it becomes non realtime. I really don't have the number on the top of my head. It could be minutes. Okay. So let's move on to the solution approach. Our solution approach is a constraint, minimum redundancy, maximum relevant selection. So we have the minimum redundancy maximum relevancy is a pretty common approach of selecting features, but we do it in a constrained setup. That's kind of the novel thing here. So I'm going to explain what this minimum redundancy, maximum relevancy means because not everyone is familiar with this. These are two properties that are very desirable. The first one is minimum redundancy size. Select the acoustic features that should have less correlation or less mutual information within themselves. Now, we're not going to take two acoustic features that are very similar. We're going to use only one of them because we want to save some energy. And maximum relevancy says selected features should have high correlation; that is, more mutual information with the target class. This proves that this shows that the features that we select are actually correlated to the target class. That means they're relevant. We're not going to take an irrelevant that is not related to our target class and select that. Now, let's try an analogy to understand this matrix even better. Say H is a function that measures uncertainty of something. And inside the bracket, that's my talk. So today, those people who came to this talk today, without any prior knowledge about who I am and what I'm going to say, to them the uncertainty is like hundred. units. Uncertainty is measured in bits so let's a hundred A second guy who came to this room, he knew my name, he Googled me, Binged me and got my web page. And he knows that I work in sensing, mobile computing, acoustics and things like that. So to him, the uncertainty of my talk with that kind of knowledge is a little less, like 80. And the third person who came to this room today, who actually read my paper on Auditeur and knows exactly how my algorithm works and everything. So to him, the uncertainty of my talk will be almost zero, like 15. Now, what we do, we take the difference between these uncertainties. So by taking the difference of first and second row, we get 20, which is the mutual information between my talk and my web page. So this is uncertainty, the difference of uncertainty is it's a nice measure of correlation too. And the third thing, third row shows that mutual information between my talk and my paper is very high, like 85, which is true. So we want to replace this with acoustic features and variables. Say XN is the end acoustic feature. S1 is the already selected acoustic features, and Y is the target class. Now everything will start making sense. So we want this number row to be minimized, because we want to make X and the N acoustic feature that we are thinking about selecting should have less mutual correlation with already selected ones in order to satisfy our minimum redundancy criteria. And we also want to maximize the third row; that is, to maximize the mutual information between the end acoustic feature and the target class, because that tends to maximize our relevance. Now that we are clear about the definitions, we are for the first time we are plugging in the energy bound. So again I'm going to show you the simulation of this algorithm with an example, because examples are easier to understand. So let's say we have N minus 1 acoustic features, and again our energy bound is B, and we magically, we magically know the best solution of the scenario. For example, if N equals 5, and for all one up to B bounds, we magically know the best solution using the first four acoustic features. N minus 1 is four. So these are the solutions somehow we got. Now, the question that we ask is what happens when the next feature, X5, comes? More precisely, what happens to the set pointed to by B, bound, when the X5 comes? If we can answer the question, we do the same thing for X6, X7 up to XN and that's how our iterative algorithm advances. So what we are really talking about here is decision on X5. Now, the decision on X5 could be either we take it or we do not take it. If we take X5, X5 has its own energy cost. Let's say it's E5. Now, without crossing the energy bound of B, we can insert X5 into one of these up to B minus E5 sets. These are the candidate sets where we can add X5 without crossing the bound energy bound B. Now, within that top B minus E5 set, we do a linear search to find the set with which X5 has the minimum mutual information. That's one candidate solution. But if we think about not taking X5, somehow we decide that we do not want X5, then by the definition of magic, we already know the solution from our previous step. So now within these two candidate solutions, we pick the one that have maximum mutual information with the target class. So this algorithm is iterative. This algorithm is kind of gritty. It is not that it does not give you the optimal solution, but it is better, way better than regular [indiscernible] algorithm where you would probably rank those features and take the top. This brings me to the third part of my talk, which is achieving context sensitive classification. So at this moment, we consider two types of context. One is the location context, another one is the body position context. So some examples of the location context are on the left. We have this yes. >>: I'm sorry. Can you compare that previous one with methods like training and max int classifier with regularization and throwing away the features that have low weight? >> Shahriar Nirjon: The one that you particularly mentioned, we did not do that. What we did, we used you are probably talking about embedded way of creating the classifier while feature selection. We did not do that. >>: Exactly. Or, in other words, looking directly at the classifier performance in the decision >> Shahriar Nirjon: Yeah, we did not do that. >>: Select the features. The big classifier with all the features and throw away the ones that have low weight. >> Shahriar Nirjon: >>: Build it up. Yeah. There's a lot of variations on that. >> Shahriar Nirjon: Exactly. So we did not compare with that one. We compared with an easier one, I have to admit that, which is the ranking method. Filtering method, actually, like filters some of the features after ranking them. >>: Based on >> Shahriar Nirjon: >>: Based on, say Performance? >> Shahriar Nirjon: If we use classifier performance, then it becomes the [indiscernible] method, which is different because we do not actually do the actual classification first, because there are so different possibilities, we cannot do that because of this exponential number of states, possible states. Instead, we use this information tiering approach just to estimate how better features they could be. Ideally, I totally agree with you that we should be doing that, [indiscernible] all possible good sets, doing actual classification, and taking the one that actually performs the best. But we did not go through that process, because it takes long. Okay. Let's get back to the location context. So we are considering several different types of location context like indoors versus outdoors versus traveling. You can also name some of this based on GPS coordinates. On the other hand, we have fixed three body position contexts. Number one is whether that you are holding the phone with your hand that is interacting with the phone while recording or using this application. Number two is whether the phone is inside your pocket. And number three is whether the phone is on a table. Now, the way we infer this context is using on board sensors like WiFi and GPS can be used for the first one. Light sensors, accelerometers and these days Android phones actually comes with an API to detect all these three things. So there is no novelty in terms of research here, but this is a solve problem for this particular state of context so we just used that. But the important novelty, interesting thing here is the way we used that in our system, the way we used that in our system is for each location context and for each body position context, the cloud creates a different plan. So now, having said that, not all plans will be necessarily used or relevant to a particular problem. Say you're trying to monitor your sleep quality. Here, you're not going to sleep outside or you're not going to sleep while driving, hopefully. So only the first row is valid. Again, within that, you are not going to sleep while using the phone or phone is not likely to be inside your pocket, so probably the third plan is only relevant to one problem. So the [indiscernible] can specify what plan he wants instead of getting all those computed. So now let's see some results. So we have implemented several applications. Seven of them were deployed and tested really well. These are the ones. So I implemented the first two are related to speech, voice. It's not about the content of the speech. It is more about speaker identification, male, female, or whether it is not speech. The next two relates to music. And the last three relates to sound of vehicle, sound of your kitchen, or sounds that happen inside your sleep. Let's see some results. I picked only a couple of results. More are in my paper and in my backup slides if you are interested in that. So the top results is about the power consumption here, which is the main concern. So for each of these application, we have implemented three versions of it. One is our automated system, which is Auditeur, the first bar. And the other two are in phone and in cloud implication. Let me explain what are those. So we picked the features that these are out of this seven, four are actually published papers. So there we picked a feature that they used and the classifier that they used and their setup. And then we evaluated all three applications together. The difference between in phone and in cloud is basically the communication. And classification, they call it classification happens, whether inside or we are sending the data and classifying in the cloud. So based on that, we see that on average, our automatically generated application classifiers are actually 11 percent up to 441 percent more less power hungry than the baselines. Accuracy is also important. An app can be energy efficient, but it should not be poor in accuracy. So we brought this graph too. So here, once again, I'm comparing the accuracy of all three implementations. We are pretty much the winner most of the time. Yes. >>: My question is how did you measure power? Did you actually implement this on a phone and measure the overall power of the phone? >> Shahriar Nirjon: No. So this is more like the number of hours that they run and Android has an API [indiscernible] that we can get. So from the long running application, this is not exactly like the previous power measurements during the construction of the model where we really used a power meter to measure. So yeah, these two numbers can be different. >>: So just to clarify that, that means you have an energy estimation passed for every feature and then create a model of combining those features. There will be that much energy and then you just write it for a simulator type of thing to give you the number of days that you would have your >> Shahriar Nirjon: Yes, yes. >>: And if you go to the next slide, for sound, you seem to be close to 100 percent while everybody else seems to be close to 85, 90 percent, right? While you're actually reducing the number of [indiscernible] you're using to save energy, right? >> Shahriar Nirjon: Yes. >>: Isn't this a bit counter intuitive? this happening? I mean, like why is >> Shahriar Nirjon: One main reason here is that the context is not considered in most of these applications. So they're not switching context. So for example, sound sense and speaker sense, here we are dealing with voice. And we are creating several different classifiers for different scenarios. Like my office room in my university has its own acoustic signature and I pick data from there and I train the classifier from there. And then, we come back at home, I collect data from my drawing room and train another classifier. And my system knows this difference. But others do not switch classifiers. When I train the other ones, I kind of use all data and create a classifier. So this is the difference between a specialization and generalization. Yes. >>: So do you take into account the cost of detecting these different contexts when you compute the power consumption? >> Shahriar Nirjon: No. >>: Because now you [indiscernible] light sensor, all these are running on the phone. >> Shahriar Nirjon: Yes, we did not consider that. We did not consider that. I'm transparent here. But in most cases, it is about WiFi to infer indoors, which I think WiFi will be always on. I think people have it turned on. Accelerometer is very cheap to detect the body position context. Light sensor is also cheap unless we use GPS, I think it's not going to be a big deal. >>: I think I'm saying the same thing as the previous two people. But if you do the same operations on the phone, you'll get the same answer. If you do the same operations in the cloud, you'll get the same answer. So the differences in accuracy here aren't about in phone or in the cloud >> Shahriar Nirjon: >>: It is more about communications. What [indiscernible] was used or what features were used. >> Shahriar Nirjon: Exactly. So let's see. I mean, let's see, for example, the vehicle application, the fifth one, here the in cloud performance really, really poor. Why? Because we were losing connectivity to the cloud. So many of our frames and Windows were dropped. So most of the times, we're outdoors versus indoors. This context, switching was happening in cloud was performing bad. So this is one important lesson that we learned during our study, which was intuitive. Yes. >>: So you are dealing with realtime. now. >> Shahriar Nirjon: You want the answer right Absolutely. >>: Because [indiscernible] you'll come back in a half a second, you'll say, I don't know. >> Shahriar Nirjon: >>: Okay. No. It's realtime classification, yeah. How important is that? >> Shahriar Nirjon: For some applications, people want realtime answers. For example, if we're trying to detect vehicle mode of transportation and based on that you want to present some advertise for, say, it may not be the realtime, but it is not offline. We distinguish between offline versus immediate result, let's say. Realtime is a different issue where deadline is there. But let's say offline versus online. So some of this application might require online, say mode of transportation detection. We might want to detect the genre off of music and play the song so we don't want to wait that long. So some application might be online. So anyway, the red circle shows the cases where we did not perform as well as the base lines. This happens in some cases where the application uses a very sophisticated, highly designed specialized algorithm. For example, this particular application heart, it's a musical heart application that inserts microphones into people's ear to detect heart rate. And they have a very sophisticated algorithm to counter heartbeats. We could not beat that. But we are pretty close. So our solution does not necessarily be the best always. We also did some interesting user studies so we gave our API to a bunch of people in University of Virginia. We also got some Microsoft Research interns who participated in this study, and we gave them our API and told them to implement whatever update can envision and then we asked them questions, like how long it took for you to learn and how hard is your coding time. So this is what we got. So on an average, it actually takes less than 30 minutes for a person to learn the API and actually core the core logic. So this is how easy the system is to use. These are some interesting app ideas that I thought I should bring here that came out of our study. So five of them actually are more interesting than others. So these are the five ones. The first one is detecting snoring as sound and vibrate the phone until he stops snoring. This is one thing I thought interesting. Honk app is like detecting the honking sound of a car and alert an inattentive user when he's listening to music or something. Asthma sound detection and asthma monitoring application where the goal is to detect crackling and wheezing sound automatically with the phone and keep records. So we became so much interested in that, we are now actively parsing this idea in a much broader and more realistic with nor sensor devices and holistic approach. Fourth application, subways, like the honk application, but it detects subway. And the fifth one is dog app that detects dog barking sound and alerts the user when he is outside home. The idea is probably some intruder came into the room and is trying to steal something away. So in summary, I presented a platform, Auditeur, which is basically a developer platform providing acoustic event classification service. It's very easy to use, and it provides context sensitive and energy efficient classification of acoustic events and its potential is enormous. Would like to either make it publicly available so people can use our Microsoft might be interested in to make it part of their [indiscernible] platform so that it provides developers, Windows Phone developers can use interesting applications with that. So they're's a list of other works other than acoustic sensing and processing. I do a lot of other stuff too. So the first set of work that I have done during my Ph.D. is smartphone acoustic sensing. It includes platform. It includes some interesting applications. It also includes some computational efficiency stops. I also have some work in realtime systems. It also involves mobile computing but a little bit of networking and mobile computing and energy research and overlaps. So that one got a best paper award in RTAS 2012. Other than these two, I have also done some work that is completely different using Kinect sensors, but it is sensing. these two works, Kintense and Kinsight, a lot of attention in So electronic and print media and they are really excited about Kinect and how we can use it in building nice [indiscernible] computing systems. And the fourth and the recent one that I work here last year with Jie as an intern. So we build the first indoor GPS that works. So people think that GPS doesn't work indoors, and we kind of broke that belief and we built the first GPS that works indoors. And that got published in the MOBISYS 2014. So this is remarkable, groundbreaking work that I did with Jie. So let me talk a little bit about my future interests, what I'm going to work in future. So I'm a mobile computing and sensing person. And in future, I want to keep continuing my research into different but related areas. The first one is more aligned towards my current work, which is about smart devices and wearable computing. Wearable computing has a huge market. It is estimated to be 30 billion dollars by the year 2018. So there the problems could be like signal processing problems, building the sensor itself, integrating, networking, location sensing. These are the technical challenges there. There are some interesting applications too that I am interested in. Some examples could be health and wellness monitoring and also I'm interested in connected vehicle or monitoring people, the drivers while commuting. Another interest of mine is about more recent trends in computer science, when is we're collecting a lot of data from the cloud. We're creating enormous database, big data so I want to explore the idea that how about creating scalable back end services that are specific to big sensor data. There the problems could be like detecting patterns and detecting anomalies. The fourth thing, which in my understanding is the next big thing after big data is [indiscernible] computing, where people, human and machine will be making decisions together so mobile phone may play a very vital role there. area that I'd like to pursue. So this is another interesting So finally, I promised one thing. So this is the time. So this is an application that I built during my Ph.D. it got published in the top conference of ours. So this is hardware software system. It reads like a biofeedback based context aware and automatic music recommendation system. The way it works, you have to wear special [indiscernible] wearable ear plugs just like this, because this is the prototype that we built. This work was again in collaboration with Microsoft Research Asia that I did. So inside these ear buds, we have added a lot of sensors. IR sensors, microphones. We have accelerometers. And when you wear that, the sensor board running tiny OS operating system drives the sensor data and sends it to the phone so we can sense your heart rate. We can sense your activity level, like what you are doing at what activity level you are, and based on that, we play the right music for the right moment. And not only that, music actually plays as an intervention. So you can set your target heart rate and then you can the music are played based on your physiological parameters, like heart rate and activity and your heart rate is controlled by music. This is the sensors. I already explained what it is. IMU microphone. IR thermometer and communications happens two ways, either through the audio jack, using hijack technology or Bluetooth. And it has a battery to power. But the problem with the previous version this version that only three copies of this hardware exist, and unless we make it produce it massly, this application is not going to be used by people. So that's why we thought creating a better, newer version of music heart version 2, that actually works with any off the shelf heart rate monitoring devices. Like currently, I am wearing one of those devices. This is a smart watch. It detects my heart rate. It sends the data over Bluetooth 4. This application also works with chest straps and things like that. So I want to show I'm going to share this application with you, first of all. Go to my website/musicalheart.html. You'll get this application and you can use that. If you are lazy like me, we can take a picture, and that's the download link in keyword code format. Now I'm going to show a live demo of musical heart, and then I'll be taking questions. So I need to switch from PC to this camera. So I've already turned my wrist watch on. I'm going to turn on the musical heart application. So this is what musical heart application use are interface looks like. At first, I'm going to scan and it should get my device. Smile global is my device. I connected to it. And in a moment, it should be showing my heart rate. There it is. 115. I'm just talking here. I mean, I should not be that high. But it happens. It always happens. It's always over 100. So that graph is showing my heart rate, is getting the data from my wrist watch. It's also showing my intensity level what activity zone I'm in, et cetera. And the bottom is actually showing my activity level. Previously, we used data from the earphones. Now basically we're using in phone accelerometers. So if I shake it like this, the activity will go high and again low. So these are the two sensing modes that we have, heart rate and activity level. Based on that is correct the music player on the next tab is actually going to play the music. So this is the tab. On top panel, we have how long you want to run this application. Say you want to do 20 minute cardio exercise. We set the time. You set it to cardio. Then we know what the target heart rate should be. Or you can actually specify your target heart rate. Let's say we make my target heart rate to 62. And this the list on the bottom is showing the top recommended songs. So if I hit the refresh button, it's actually based on my current heart rate, it's calculating what song I should be listening to now in order to make my heart rate go down. So I can play one for you like, say, Africa. [music]. This is a very soft, melodious song. That's the idea. I can do the opposite too. Let's try to make my heart rate 160 or something. If I hit refresh again, it's going to say suggest to me a different set of songs. So Down For Whatever. [music]. So this is very aggressive and high beat, high tempo song. That is supposed to energize me and I should be working out hard. The third tab is my play list, the song library. The way you should be using it, just copy English MP3 files into the SD card of this phone and we'll be extracting all these features using our website automatically. For example, let's pick any one, say Roar. So what the phone is doing, it's talking to a remote server and sending a piece of the song to the server and these are the acoustic features that it is computing in the cloud and sending it back. So we have tempo, we have energy, danceability, liveness, [indiscernible] and built on all of these acoustic features and my previous history of sounds and heart regulation, we are playing the right music for the right moment. So I'm going to keep this running and now I'll be taking questions. Ask me questions and see how my heart rate changes on that or if I am telling the truth. >> Jie Liu: Thank you. Questions? >>: Question about how [indiscernible] user is responsible for this input [indiscernible]. >> Shahriar Nirjon: Yes. A very good question. I mean, how do we get the samples? Who tags this and when tags it. Now, end users are like our grandmothers. They do not have good sense of what should be doing and we should treat them as our grandmothers, even if they are younger than us. It is the responsibility of the developers who should utilize the end user and make them collect that sound and upload that. In the settings of an application, we envision it this way. We have a demo application that is actually given to the developers to see how should be programmed. So there, the idea is during the settings, the first usage in the settings, during the first usage of the application, the user will be prompted to record this and that kind of sounds. So it is and we provide the API to the developers to be able to do it the way they want their application to be collecting data. So the thing is a developer enables. The end user does it. Another approach could be for some application, we do not necessarily need to engage the end user. Say we are trying to detect snoring. It's not possible to [indiscernible] the user, because he is sleeping. He cannot record his snoring sound unless there is a recorder running overnight. So there you can actually use collected data from other sources and developer who will be the one who is uploading and then the end user is completely out of the picture. That may also happen. So there are a lot of flexible ways of using this system. >>: So will your optimization framework work if you want to do some multiple things from the same audio clip. For example, you want to detect whether it's Alice, you want to detect whether she's alone or talking to somebody. If you want to do multiple things at the same time with the same clip, will your optimization framework still work? >> Shahriar Nirjon: No. The way it is now, for Alice, you have to create one class fire, one pipeline. If you have Bob, you have another pipeline. And if you run both of them >>: Multiple pipelines in parallel? >> Shahriar Nirjon: Yes, it will run both pipelines parallel. And as a [indiscernible] Alice plus Bob. Because when creating the Bob's pipeline, we did not know Alice was there. So the general answer is no. But if the developer is clever, he can make that happen. He can during the creation of the second application estimating the first one, he can constrain energy. He can say like now, energy constraint is even more. Now, another way it can happen, the third and most appropriate use case for creating this Alice and Bob should be to have both Alice and Bob in your dataset as your positive examples and all others as negative one. But the problem here is there's one classification result, Alice or Bob. It's not Alice, Bob, or else. It will be Alice or Bob. So that has that problem. But there are multiple ways of dealing with that. The fourth and better approach could be to [indiscernible] the pipelines. Pipelines are almost always for similar type of data could be similar like Alice's voice versus Bob's voice. Could have same set of acoustic features. We do not do that, but it is in our future plan that we can combine multiple pipelines and use them together. That's another possibility. So there are several possibilities. When we created that, we really trying to prove the hypothesis that it works. It's a useful platform. We can use it. And now that it is working, we'll be looking forward to optimize it, of course, in different ways. >>: So in the Alice example, how many samples did you need to train the classifier? >> Shahriar Nirjon: It's about five minutes long is just fine. If you are talking about >>: You need one context or multiple samples from >> Shahriar Nirjon: Each context. For each context. >>: So at the beginning of your talk, you showed a recording device but said the piano music was recorded that way. >> Shahriar Nirjon: No, actually. I was saying it was not piano that was used to create this music. It was the music box that was used to create that music. >>: [indiscernible]. >> Shahriar Nirjon: Oh, no, that's from YouTube. >>: So your system heavily relied on sort of contributed data, right? Most developer would intentionally know what to do and others, right? [indiscernible] examples. How do you think about collecting them or what's the incentive people will contribute other signals to your system to bootstrap? >> Shahriar Nirjon: Really did not think about that during that moment because we have 35 participants who were trusted participants and they were told to do that so they did that. But in a real system, we could, first of all, we could create a crowd contributed system by providing some insights like, I don't know, maybe money or maybe some service. We can make people do that. This could be one approach of doing that. Developers can collect some data with people's permission without people activity management, like whether they are conversating over the phone. They can turn on turn this feature on like you can collect my data while I'm talking through the phone. Not private data, maybe some acoustic features and upload this that like that. So you can opt in for that, and your offer could be like using that will be updating your classifier and in future, you will get better classification. That could be one incentive. So say you are using an application and you are prompting the user that if you are not satisfied or if you want to improve that classification, you can do this you can go and collect these four more dataset, four more data sound clips for minutes long. >>: You would also use [indiscernible], right? >> Shahriar Nirjon: >>: Yes, of course. So that's how you can [indiscernible]. >> Shahriar Nirjon: Yeah. For example, you are [indiscernible] from one context to another context we could play the same game in the cloud. Like user can provide some tags and from that tag, we can understand that it is it belongs to some other problems, yeah. That is a very good [indiscernible], yeah. >>: [indiscernible] because I think that user would be more interested in they don't really care about the context, right, when they [indiscernible] recognizing [indiscernible]. So have you been through this? Like [indiscernible]. >> Shahriar Nirjon: Let me try to rephrase what you are trying ask. It is like what happens if you do not use context at all. Is that what you are asking? Or >>: Yeah, context independent. >> Shahriar Nirjon: Well, one thing is that sound sensing is context dependent, fundamentally, because the environment has a lot of effect. Like even the room size, even if it is indoors, the room size has some effect. Like reverberation model differs from big room to small rooms. So it is context sensitive problem. So I would vote for context sensitiveness than to try to fix the problem without considering context. >>: Because the user [indiscernible] recognizing different contexts cause him to recognize data for all these different contexts and this is pretty, you know >> Shahriar Nirjon: >>: Yes. [indiscernible] for the user. >> Shahriar Nirjon: Yes. I totally agree with that. I mean, this is a trade off. So if your user wants better performance, which can [indiscernible] them, you should do this. If he does that, we are giving him [indiscernible] so he has some incentives. But it is I agree that it is [indiscernible]. >>: [indiscernible] tries to mitigate that context problem [indiscernible]. >> Shahriar Nirjon: So I would comment on that, first of all, mobile phones comes with all the sensors. We can now easily and cheaply infer context. So maybe the reason why speech recognition community did not do that is the absence of all the sensors that they can detect the context and detect speech specific to context so that attack the problem in a different way that assuming that context is not available. So we [indiscernible]. But in our case, in mobile computing, we are us a thinking about that we are constantly working on detecting context. And context sensing algorithms are becoming better and cheaper. So how can we leverage that is a nice trend in many mobile sensing problems. So my solution is built on top of that. >> Jie Liu: Let's thank Nirjon again.