>> Dah Ming Chiu: Thank you. I'm very glad to visit Microsoft and thank you, Jin, for inviting me. So in today's talk basically I will cover a couple topics. One is straight from the paper that I will be talking to come and it's collaboration with PPLive that we did on their P2 VoD system. And then after that I will try to give some -- introduce some of the work that we have been doing. I think in academic world people want to just study more modeling, the capacity of P2P networks, the algorithms, modeling and so on. So I talk a little bit about that if there's time. So the case for P2P of VoD, I think this has already been well established last year in the Sitcom(phonetic) Conference, in fact by people sitting here that you wrote a good paper studying from the Microsoft to VoD, kind several VoD system, and to analyze how you might use the P2P technology to significantly reduce the server loading. Okay. So you can make it more scaleable. So you call it peer-assisted VoD. I think it is the same thing, P2P VoD, meaning the same thing. So the key challenges is clear. We already have P2P streaming systems for working for quite a few years. I think that was started originally from the Core Stream paper in InfoCom several years ago. And since then several platforms have been built and so people already feel that P2P streaming is mature, you know how to do it and it can be done quite well. But for P2P VoD is a different story because for P2P streaming you have, you know, a lot of users that are viewing the same content at the same time. So it's kind of intuitively it's easy to make them help each other share the content they are watching. They have more or less a similar content in the buffer and then can easily help each other by relieving the server in terms of serving each other. But for P2P VoD it's a different story because you can have many other peers that may be viewing different movies. There can be many movies. And also they are viewing the same movie and it could be they are looking at different parts of the movie and how to make them help each other is a much more challenging problem. So the -- so how to build a system, whether it can be built -- I mean, I think the paper last year was more a sort of paper study and user describe how you might use Prefetching essentially for the peers to fetch the content and then under optimal situations you can cut down the server load to around 10%. I think you predicted that through the analysis. So the question is how to build a system that will actually deliver that. I think apparently while you were working on that paper that people in the P2P and P2P-based streaming, they already started doing these kind of systems. And I think in the case of P2Live(phonetic) they built a kind of system in maybe last summer and they were testing it and then by last fall they deployed a system. And it reached a scale of several hundred thousands subscribers and many thousands simultaneous users. I think in the whole world, I don't want to keep exact numbers because I don't necessarily exactly what's the scale, but I think these are the kind of scale that was reached. Yeah? >> Question: This number is much smaller than the Live streaming. >> Dah Ming Chiu: Yes. >> Question: I mean, but essentially and I think VoD is a very popular form of service. >> Dah Ming Chiu: Right. >> Question: And that is from my personal value the ones that VoD is here. >> Dah Ming Chiu: Right. >> Question: They use that function. (Inaudible) ->> Dah Ming Chiu: Okay. So I think part of the reason is when they first started the deployment they -- I think they were trying to be cautious. They were using a no-profile way of starting it. They just created some button on their streaming thing and so some people may not notice the service. And I -- partly I also think some of these things I'm just conjecturing, okay, because I -- our collaboration with PPLive is -- I was not there sort of running their system so neither was I sort of involved in doing the system. But mainly we were trying to sort of working with them trying to bring out the insight of the design of the system and also through the measurement study some of the user behavior and some of the -- sort of how to measure the system in order to make sure it works well. So to answer your question, I think they were my conjecture is they probably tried to be cautious because P2P VoD actually does take a lot more server look. Okay. So you know trying to go very fast. I mean the performance may not be so good. Okay. And then in -- the other thing they can control is in P2P VoD you can sort of control the server loading and how well you do by limiting the number of movies. If you reduce the number of movies to only the popular ones you can probably reduce the load initially. And then you may lose some users if you have only a few popular movies, not so many movies. So they are going through a -- I think a period to build up. I mean this -- I think this is not surprising that compared to their P2P streaming, this is very mature. In fact, the numbers is for P2P streaming they can -- the server loading is at probably 1% or less. I mean, the server doesn't need to do much, but for P2P VoD initially when they deployed this, as I will mention later on, the server loading was 20 to 30% of the -- compared to whether the server would serve by itself. Eventually only this year, I think, earlier this year, they cut it down to about 10% level. Okay. So they are still going through this and I think they are still trying to optimize and make it work better. But I think what happened is they actually did succeed in demonstrating relatively large scale system can be built and made to work. Okay. So the and also that through the measurement we can see that in fact the system delivers reasonable user satisfaction and we'll look at how to measure that. And obviously they also do some subjective kind of measurement by having friendly users look at the video and see whether it's delivering reasonable quality. So this is just a little animation to show the point I made earlier that for streaming you have essentially all the users looking -- synchronized. They are looking at more or less the same video at the same time where as for P2P VoD you can have different users looking at different movies or different parts of a movie so the problem -- that's why the problem is more challenging. So what is the secret in their system? I think the probably the most -- I mean, I don't know what it is, it is how everyone builds it. But the way they make it work is to additional to making all the peers share what is sitting in their buffer in memory, they also make the peers contribute some storage. Okay. This is a significant difference than the P2P streaming. So as you watch movies, you would be, you know, after you watch it you would just leave some of the movie in your storage and when you're online you may be watching one movie, but if there's another user who wants to watch some movie which is sitting in your storage you can also serve that user. So that is how they manage to make the P2P VoD work. So each user -- each peer, is contributing about 1 gigabyte of hard disc, so the key problem of P2P VoD is how to manage this storage so that the users are leaving the right content in their hard disc so when the other user come online there's sufficient content out there sitting in different peers' hard disc who are online and they can relieve the video server. So that is the challenging problem. So this is new. This is in addition to whatever you have to do in P2P streaming already. So some of the other things I observe, which is I think in building this system that in contrast to the file-sharing protocols like PT and so on, in this system that PPLive built, it's really more like a distributed system. They are not letting the peers sort of control so many things. It's in fact if they want -- if a peer wants to watch the system they have to watch the movie. They have to stay online. They have to essentially contribute. Okay. If they don't contribute they have to show that they are still connected through hot-b(phonetic) messages and so on. If they are not contributing they're not watching, they cannot watch. So this, I think, also solves the problem of free riding. I think there are many other less technical factors that are important to make such a system successful. For example, working with ISPs. I know they work quite closely with ISPs. In fact ISPs are happy to help them provide some content in many situations. And then they actually work closely with ISPs. Obviously you need to get good content to draw the eyeballs and get commercials and these kind of things, which I think more important, very important for the success of such systems. So the problem we're looking at is really a problem of how to do content replication. All right. So what they are doing is what is known as multiple video replication. So in traditional P2P streaming essentially everyone is watching the same thing. So there is only one video -- okay -- you're concerned with. There's a tracker or thing that keeps track of all the peers watching that video and so on. So in their system essentially each peer is storing multiple movies. So the tracker system has to know which -- given the movie which peers are storing that movie and provide help. So you need to build a tracker system to provide information at that level and then you also need to make sure that the IO system is fast enough in bringing the movie into the memory when you need to serve other peers. So the replication of content is at two levels. One is at the movie level. You need to provide multiple movies, serve other peers, and also add chunk level. So once you have a movie, you are sort of serving another peer, you store what chunks you have for that movie into a bit map. This is just like traditional P2P streaming, there's no big secret there. In their case, the size of a chunk is about 2 megabits, so to store a movie it's -the resolution they have is about 100 bits of a bit(phonetic) map. So here's -okay. >> Question: So does the tracker know which chance, you know, the task. The tracker only know that, just know that it has something about this (inaudible). >> Dah Ming Chiu: That's a good question. Actually later on I will clarify on this point because the tracker actually does keep track of all the chunks, what chunks appears half. And in fact -- let's see, now the tracker keeps track of -- I know the -- sorry. I know the tracker keeps track of which chunk it has, but I've forgotten why it has to do that. But in terms of streaming, mainly you find out which chunk your peers have through gossiping. Okay. That's -- because it's more up to date. Okay. But the tracker does keep a lot of statistics. So you have the movies divided into different pieces. You have what is called a chunk, which is a bigger piece b2 megabits and this is used to advertise this is advertising a bit map(phonetic), so you exchange this information with your neighbors and then you have what is called a piece, which is a minimum viewing unit, which is about 16 kilobits. And then you have sub-piece, which is about one kilobyte or bits, I'm not sure. But this is more useful for transmission. Okay. One kilobyte, I think. So you need to schedule a transmission and from different neighbors. So I mention this point later on in the algorithms. So there are three important algorithms. One is Piece Selectron(phonetic) algorithm, which is this is the streaming part. You have to work with different neighbors to find out what piece to get from other neighbors. The other one is the replication algorithm, which is what I mentioned earlier. This is the important new aspect of P2P VoD is to decide what movies you will store in your hard disc or keep in your hard disc. This is like managing the cash. Okay. And this one, in order to do this, the tracker collects information from different peers and then gives peers the information called -- essentially it's supply-to-demand ratio. It tells you which movies are in supply and which movies are in demand. And given this information you can decide better which movie to store. Okay. The third algorithm that's very important is the transmission scheduling algorithm. So in their system what they do is they -- each peer when they're trying to get a chunk from other peers they actually try to get it from different peers. Okay. And then they need to schedule how much to request each neighbor and to achieve the low balancing. So this is actually a very tricky algorithm. If you don't do it right, you will not sort of be making good use of the uplink of your neighbors. So these three algorithms actually, you know, I think they are all interesting algorithms worthy for the study in terms of research. In their system they just build something ad hoc. I mean, they just try different ways of doing it and see how well it works. In terms of piece selection this part is kind of similar to P2P streaming. We all know that essentially two kind of algorithms for pulling data. One is -- by the way, their system is more or less a sort of mash-based system. Okay, not the tree-based system. So the peers basically pull the content from neighbors. One approach is called a sequential pulling so that you try to get -- you're a little bit short sided. You try to get what you need to playback so that the most urgent content you get higher priority, you get that first. The other approach, the other algorithm, is called the rarest first. This one -- the name comes from the Vitorin, and essentially you're trying to get the freshest content first. Whatever the server is sending out, you try to get that first. And this strategy helps propagate a content as to provide scalability. So this one -this is the other one. And a third algorithm is called anchor based, which is you try to select, okay, certain anchor points in your video and then randomly pick an anchor point and then do sequential from that anchor point. This also helps -- this achieves two things. One is to -- if certain anchor point is viewed more often people tend to drag to that -- to certain area. Then you can use the anchor point to approximate. You know, essentially user try to move their viewing to a certain point, okay. And then you -- you cannot exactly move to that point. You can at least move to an anchor point close to them, sort of like a little bit cheating way of simulating the moving to random moving. And another thing that anchor point helps you is that it also is a way to help different peers to sort of collect different things so that they can help each other. Okay. So that's also how anchor point can be used. So in PPLive's system they actually use a mixture of this. Later on -- I think in fact they experimented with anchor point approach and they found that at least in the current deployed system they don't need to use that because the users are not essentially moving, jumping too much. Okay. You will see from the measurement data that is the case. So the second class of algorithms that's very important in the overall system design is the replication algorithm. In their system they are not doing any prefetch as in the Sitcom(phonetic) paper last year, you know. Essentially they use this cache to store movies for other peers to use. So this -- so the algorithm basically is a cache replacement algorithm. You -- the peer would be watching -- the user would be watching some movie and then after the viewing you can just decide to store that movie, but after a while the cache will fill up. Okay. One gigabyte -- each movie, if it's one or two hours, is probably at the resolution they are providing is about I think 200 megabytes, around in that neighborhood. So one gigabyte can store four or five movies if it's full length. Okay. So if you are storing just a fraction then you can store more. So the question is which movies you're keeping in your cache. Okay. So the -you can use traditional cache management algorithm, such as the least recently used or least frequently used and you can use this kind of algorithm, but in the system they built they did a lot of experimentation and they found out that this weight-based approach is better than the traditional one. Okay. The waiting factor is based on these two factors. One is how complete the movie is. Okay. So if some users may be just browsing, just watching part of the movie and then go away. So this is not so useful. If you watched almost the entire movie, this is more useful. So depending on how the fraction of the movie you have, that's one factor. Okay. The other factor is availability-to-demand ratio. This is something that the tracker would collect from all the peers, what movie they are watching and how many peers are storing that movie. So this is the computer ratio and you can get this information from the tracker before you make the decision and so these two things multiplied together is the weight. Okay. But all the movies you essentially look at this weight and then decide which one to get -- to throw away. So once you started throwing away a movie you just throw away the whole thing. You don't try to keep partial movies. So that's the algorithm. The third algorithm I think it's very important in your system, I think in many systems, P2P streaming or P2P VoD is the transmission strategy. This is, you know, you are trying to get a chunk or a piece that many other peers have and then you can ask anyone of them, okay, simultaneously ask multiple to get this. And so how do you do this? I think this algorithm is again this is in their case they have sort of ad hoc way of doing it. They tried to experiment many times, but I think this is something that actually you know I'm studying with my students trying to model this more formally as an algorithm. So the idea is how -- you have all these peers out there. Each peer is receiving request from many other peers. They can be overloaded. Okay. So you want to more or less use up all the peers uplink, right, to offload the server. So how do you schedule so that each peer -- each peer if they just go to one neighbor to get the content it's very risky because that neighbor may be gone and you have to wait a long time and timeout and then just wait a long time and you didn't get anything. So the strategy they're using, the algorithm they're using, is to simultaneously go to multiple peers, even for the same piece or chunk. You will be asking different peers for different sub-pieces. They don't ask the same sub-peers different sub-pieces. And then dynamically adjust, you know, how much to ask each neighbor that you're sort of requesting from. And this algorithm reminds me of the kind of like in the past I've done some work in P2P congestion control, right? This is very similar to that. How do you adjust the window size and the timeout to essentially make good use of all the set of neighbors you identify. An algorithm is more complicated in P2P congestion (inaudible) because this is like working with multiple, it's not only with one destination. It's multiple destinations and it's very challenging. Okay. So this is something we are actually actively working on as a research problem. >> Question: Let me ask some question about this basically. >> Dah Ming Chiu: Okay. >> Question: The transmission strategy. >> Dah Ming Chiu: Okay. >> Question: When you talk about the receiving strategy, right? >> Dah Ming Chiu: Uh-huh. >> Question: Upon the sender ->> Dah Ming Chiu: Uh-huh. >> Question: What if it's trying to accommodate different requests? I mean, I may have let's say 10 or 20 peers asking me for content ->> Dah Ming Chiu: That's true. That's true. >> Question: -- rerouting them or accommodate for (inaudible) or I mean consider for an ISP ->> Dah Ming Chiu: That's true. I understand. Unfortunately in -- with the PPLive system I didn't get enough detail from their system. In fact, they have a patent on this. I mean at the time we were writing this paper they were applying for the patent so we didn't get to the detail. But as I'm studying this problem I know the question you ask is very relevant because the server probably don't want to keep the request from all users because that would increase the delay he has. There are many variations to this. The server probably want to keep only certain requests. Okay. And then that will determine how the users will set the timeout and all kinds of things. This is an algorithm that can be quite complicated in reality. And that's why it is quite interesting. So I know everyone building a P2P streaming system has to work on this. And this effects the real performance, the efficiency of the system, as wells user-perceived performance. >> Question: Very unique thing about basically the P2P congestion comparable, I think that is a very great direction, although I don't know anyone has solved that yet. I mean, basically how we can combine P2P congestion control algorithm into this framework. >> Dah Ming Chiu: Yeah, what we need is actually a new string of congestion control. This is something that I think by the time somebody work out the exact model this is going to be receiving so much interest and probably surpass the P2P congestion control. Okay. So the next part I will talk about is look at some of their measurement data. The -- in this area we look at several things. User behavior, that is how -- sort of how much the user view a movie and whether it moves around and how often the users come into the system and these kind of things. Okay. The second part is about replication, how to measure supply and demand. So this I touched on earlier so we go to see some real data. The third part is how to measure user satisfaction. This is a very important problem. Okay. In fact, this is also a problem we are actively studying as a research problem. If you want to deliver a P2P VoD system or streaming system as a platform or as a content provider you better care about user satisfaction. In fact, in IPTV world people study this very seriously. They have a term called a QOE, okay, Quality of Experience. And the whole issue is how to measure some simple parameters and then predict user satisfaction. Because you are just broadcasting some content. You don't know whether the users are happy or not. And you cannot afford to ask them one by one and if they are unhappy -- if you are the operator, you are in big trouble because they are going to phone call and just a lot of problems. So you want to have simple ways to predict and monitor how the whole system is doing. Okay. So we'll show some of the results in this area, as well. And the last part is I show some results about what kind of uplink bandwidth users have, what kind of -- whether they are behind firewalls, some statistics. So the -- sorry. So the data that we got from PPLive to do this study is based on just traces and each trace you can think of it as just a -- if you have user MS, something similar to that. You have essentially a sequence of records. Right. Each record contains basically -- this is a simplified version of the record. You have a user I.D., unique I.D. of the user, identifying a user. And then a movie I.D. identify which video the user is watching. And then you have a start time and end time and a stop position. Okay. So each viewing record could be just viewing part of a movie and then if you jump to another point in the movie that is a new viewing record. Okay. >> Question: -- as long as they are collected by the -- I'm sorry, as long as they are collected on the client side basis and then uploaded to the server? >> Dah Ming Chiu: These logs are collected, yes, there's some kind of a log server, okay, which is maybe the tracker is doing that job or something. And then the kind where peer periodically send message to this. Okay. Yeah. And collecting this kind of information, as I said, is also an interesting problem. How do you -- I mean probably you cannot afford to collect everything. You probably are doing some kind of sampling and how do you design it so you collect as much information correctly as possible so that's another challenge. So this is -- so after we got some traces from them we look at a bunch of movies and here just they all similar in some sense and here are three typical movies. And you can see that they are -- there's maybe around each one to two hours long and then they -- we show here how many chunks there are and how many viewing records we collected which shows. And we use the fact that if the viewing records start from the beginning of the movie, you know, that sort of identifies a unique viewer. Actually the unique -- the viewer, the I.D. can also identify the viewers, but I think in the beginning -- in the first version of their software they don't have the viewer I.D., a unique viewer I.D., okay, so we have to rely on this sort of position, viewing position to identify the viewers. And you can see the average number of chunks is not that high, okay, it's just one or two or three chunks when people are viewing. And one thing that is very interesting is how much does the user view and what -most of the time the user viewing the whole movie or are they doing a lot of browsing? So as you can see from this, they're actually -- I mean the amount of movie they view looks like the average is very low. I mean it's just a few percent. But if you look at the distribution, what you find out is a lot of times the users are -- because there is a view in your system they are browsing. They look at the first few minutes of a movie and if it's not interesting they go to the next movie and so on. So a lot of the viewing records are short, but there are significant people who actually finish viewing entire movie. For example, the movie "To Here" you can see that there are actually a few thousand users which sort of saw the entire movie because there's a big chunk at the particular point, okay, here. So this figure shows the -- whether there are specific points that people tend to jump to. Okay. So this one actually shows for most of these viewing records people start from the beginning, which is here, significant, the starting point at the beginning and then after that, there is no specific point in the movie that people tend to jump to. They are just random jump to different points. So that's what this figure shows. >> Question: (Inaudible) -- interface doesn't really give you any kind of chaptering information, right? You get into the DVD menus. >> Dah Ming Chiu: Um ->> Question: -- obviously you just have the slider bar. >> Dah Ming Chiu: Yeah, they just are using the slider bar, I think. I think that's why there's no specific place they tend to jump to. If they're a chapter, maybe that's true, yeah. If there's ->> Question: So you could cause their DVD to be, to jump to more predictable (inaudible) change in (inaudible) in a way they probably link. >> Dah Ming Chiu: That's true. That's true. >> Question: (Talking over each other) -- algorithm may become even more interesting in a chapter. >> Dah Ming Chiu: Right. Right. Exactly. Exactly. In their system I think there's no chapter. >> Question: You can experiment and then try and force anchoring points. >> Dah Ming Chiu: You had a question? Okay. Okay. So this one shows the -- you know, how long the peers tend to stay in the system. This is very important if they tend to stay there longer then they can help the server more. So we look at server base of trace records and it shows that the users -- this is the distribution for each day that users staying for greater than two hours or a few minutes. More or less pretty flat and then so I think the encouraging thing is that user tend to stay for 15 minutes or more so that they can actually provide some help. >> Question: So I mean this is the number of unique views, right? >> Dah Ming Chiu: Yeah. >> Question: They basically are in ->> Dah Ming Chiu: Right. >> Question: -- the system. And I mean succinctly you have something like 300 falling in this per day. >> Dah Ming Chiu: Right. Right. Right. This is -- I think this data may be ->> Question: Something around Christmas time. >> Dah Ming Chiu: Christmas time. Right, right, right. So this is actually the data that was originally submitted to Sitcom(phonetic). And later on we adjusted some of the numbers when we send in the final version because they actually did some additional -- some of the data came from -- came after the paper was accepted. Yeah. >> Question: You mean the (inaudible) -- describing this ->> Dah Ming Chiu: No, no, we had some data which measured in May of 2008 or whatever. I mean for different things. You will see a table later on. >> Question: Higher basically (inaudible) -- saw something. >> Dah Ming Chiu: Um, no, I didn't get any new number of users data. I think this is probably quite difficult. I mean I don't know whether they are significantly higher or not. That is more like marketing, yeah, I don't have latest information on that. >> Question: I think basically, I mean, so look at this VoD, right? >> Dah Ming Chiu: Mmm. >> Question: You are in the (inaudible) column number of users watch a bit or part of the movie and they don't finish. >> Dah Ming Chiu: Right. >> Question: How can we interpret this behavior, presuming you have synced at least 20 or 30% of the users watch something like from 15 to an hour because they don't (inaudible) the movie, but they don't finish watching it. >> Dah Ming Chiu: Right, right. I think that is interesting. I think for VoD system maybe a lot of times people are just browsing, they are not -- they just want to kill time. I'm just guessing, I don't know. They are just looking at some movies, not interested. Easy to go to the next one and this may be typical behavior or this may be just the behavior in China. I don't know. This could be, you know, if you have better quality and better content maybe you have different behavior, so this is -- the bit is a measure we use of behavior is quite kind of dangerous depending on what content you have there and what -- yeah. So... >> Question: One thing of interest is that when you turned in the application does it just go -- does it go, does it stop or does it go either staying in the system as a service that hangs around still serves other peers? >> Dah Ming Chiu: Of the user, right? >> Question: Yes. Actually terminate a visit (inaudible) continue to serving peers. >> Dah Ming Chiu: This part I don't know. I think this is like you can design your software may cause either one to happen, I think. >> Question: (Talking over each other) -- quit. >> Question: I'm sorry. >> Question: I think they actually quit. >> Question: They actually quit. >> Question: Even this, I know there is a bit conformance difference ->> Question: Okay. >> Question: -- when this application is running. >> Question: Uh-huh. >> Question: Versus not running regarding public applications. >> Question: All right. >> Question: Whether I'm doing web browsing. >> Question: So my question really is more, you have stronger results here for user behavior is the (inaudible), actually quit, right? >> Dah Ming Chiu: Yeah. >> Question: Then if you quit then it just sneakily goes off and still does stuff in the background, in which case... >> Dah Ming Chiu: No, I think in this case I'm showing you is how users staying in a system watching because this is what is locked, okay. The time they actually hang around, which they may still be helping, I don't know that part. >> Question: Okay. >> Dah Ming Chiu: This is how much they are actually watching and then you can make sure they helped. Okay. So ->> Question: This is not dead time? >> Dah Ming Chiu: Yeah. >> Question: And this is the VoD ->> Dah Ming Chiu: The VoD -- (talking over each other) -- and the other thing on the right is we can see that definitely there's rush hours in a day or prime time. There's certain time of the day you have a lot more users than other time of the day. So this is again for the whole week and you can see that I think for them it's like lunchtime there's a lot of people, you know, and in the evening and then there's not too much after midnight or something. >> Question: Question. >> Dah Ming Chiu: Yeah? >> Question: I mean, what is the vertical access? Doesn't that fall (inaudible) I mean ->> Dah Ming Chiu: Number of continuous users watching a particular movie. So these are just ->> Question: So 200 to 250? >> Dah Ming Chiu: Yeah. >> Question: Okay. And the total number of users is on the order of 300K? >> Dah Ming Chiu: Right. >> Question: I think they have something like 300 channels or something like that, right? >> Dah Ming Chiu: Yeah, yeah. They have something like 1 or 200 popular movies. I think they adjust that number. Sometimes they have 500 movies or whatever. By the way, I must admit that I haven't -- I'm not an avid user of their system. I've probably tried with my students watch part of a movie once, but -so I'm not too familiar with a lot of the reel system myself. >> Question: These statements are only available for ->> Dah Ming Chiu: Huh? >> Question: -- inside a (inaudible) normal users and outside their PPLive compliment you are not going to get distinction because users are -- this is based on (inaudible). Here people try to crawl that data, right? (Inaudible) ->> Question: You can get something not as accurate ->> Dah Ming Chiu: Not as accurate. >> Question: No. So I think this data is released from the company rather than your students doing that a lot, that is impossible. >> Dah Ming Chiu: (Inaudible) definitely. This is given to us by the (inaudible). Yes, yes. The other thing is the, you know, sort of measurement to help to do the replication job. Okay. So we can look at the movie level what is the supply. You know, how many peers have -- all peers who are viewing some other movie who are storing these three movies that we are interested in. We can see that this is actually, you know, out of 200,000 users you have several thousand users sometimes storing the movies that we are looking at. Sometimes a few hundred. And then you look at which chunk ->> Question: Let me ask you a question. >> Dah Ming Chiu: Okay. >> Question: I mean, I assume the date is similar to the previous slide. I mean, you are studying ->> Dah Ming Chiu: Yeah, yeah. This is from the same data set, yeah, the same traces. >> Question: I mean, we notice some basic issue (inaudible) look at the (inaudible). >> Dah Ming Chiu: Uh-huh. >> Question: -- (inaudible) increased by something like (inaudible). You have something like 5 (inaudible). >> Dah Ming Chiu: Uh-huh. Uh-huh. >> Question: Supposedly this should mean you should at least have 5000 more users watching this. If you look at previous slide, I mean, the top is something like 300. >> Dah Ming Chiu: Uh-huh. >> Question: How do you -- I mean, why is this discrepancy? >> Dah Ming Chiu: I think, well, I think this number must be dependent on the resolution. I mean if you're looking at any particular ->> Question: I mean if the -- if it's a lot dependent on the revenue it means you have lobbied the content (inaudible). >> Dah Ming Chiu: Right. Right. I think that that's certainly a lot of users that are browsing. And then they could -- so in this case they could be storing just a small fraction of the movie. As long as they have one chunk of the movie they're storing, they show up in this figure. >> Question: (Inaudible) -- versions of the movie (inaudible), right? >> Dah Ming Chiu: Right. You can be storing just portion of the movie and then as far as the tracker is concerned you are still storing the whole movie. Yeah. So this -- the second figure shows you which are the chunks that actually get stored. You can see that the first few minutes get stored a lot more. And however, the sufficient number of copies of all the chunks are there. I mean essentially you get about 30%, always covered, of the peers storing that movie. >> Question: If a movie is a very popular movie, right ->> Dah Ming Chiu: Uh-huh. >> Question: -- at least the number of users (inaudible). >> Dah Ming Chiu: Yeah. >> Question: I mean, after 71(phonetic) chunk you don't have any users storing that chunk. >> Dah Ming Chiu: This is -- this figure actually is because these movies are different length. >> Question: Oh. >> Dah Ming Chiu: So this movie is much shorter so this is always get to the -yeah. >> Question: Okay. >> Dah Ming Chiu: Okay. So this one -- this one is called the availability-to-demand ratio. So this is computing the dividing the how many people are storing the movie sort of storing movie versus how many people are watching it. So -- let's see, availability to demand. I'm not sure whether this demand for availability or availability to demand. Anyway, this is trying to capture the ratio of people watching the movie versus storing the movie. Okay. >> Question: All the study is on the basically storage cycle, basically where I have cash in a portion of the movie versus, I mean ->> Dah Ming Chiu: Right. >> Question: Users asking for that. >> Dah Ming Chiu: Right. >> Question: I saw in the peer (inaudible) it is bandwidth, which is more important in their storage, right? Let's say if I allow the user to catch (inaudible) gigabyte ->> Dah Ming Chiu: Uh-huh. >> Question: I can catch something like 10 (inaudible). I think even today I mean when they catch basic (inaudible) they can catch five minutes. So storage basically amount is not that small. >> Dah Ming Chiu: Uh-huh. >> Question: (Inaudible) -- storage in that (inaudible). >> Dah Ming Chiu: Right. >> Question: More important piece here is how much bandwidth is available. So ->> Dah Ming Chiu: Well, I think the bandwidth part is similar to P2P streaming. I think that the new problem for P2P VoD is the storage. How can you manage the storage so that when you have a user coming in to watch a particular movie, at the same time there may not be other people watching the same movie, but you want to make sure that there are other people storing that same movie that these peers want to watch. >> Question: We can talk this offline ->> Dah Ming Chiu: Yeah, yeah, yeah. >> Question: -- but I don't think the storage is that critical. >> Dah Ming Chiu: Okay. Okay. So the next problem, as I said, is the measurement of the users as satisfaction. And this is mainly measured in terms of this thing called fluency, which is the percentage of time you are actually viewing out of the order time, including time buffering and frozen and all these other times. And the next figure shows how many -- so this information is sent from every peer to the logging server at the end of the viewing. Okay. So you -- the left figure shows how many reports you get. You can see when there are more users you get more reports basically. And the right figure is the interesting one. It's what is the (inaudible) you actually get. Okay. So you want to have them all, you know, all around here. Then it's good. But you can see sufficiently users are happy, which is when you have this 15.9 to 1 fluency, so it's a high fluency. But also this -- there's a big, I think this kind of measurement is always by model. So you have a bunch of users which just have hard time starting even. They probably, you know, doing a lot of buffering and so on and then just gave up. So that's why you have some users with very low fluency. And then you have the rest of users they are just -- you should have this kind of increasing curve. So this is a very typical measurement of satisfaction. >> Question: (Inaudible) -- viewing time plus buffering and freezes. >> Dah Ming Chiu: Pardon? >> Question: Fluency is viewing time divided by total time. >> Dah Ming Chiu: Total time, right? >> Question: Total time, you don't mean the total time or the length of the movie, right? >> Dah Ming Chiu: No, no, no. >> Question: You mean viewing time plus buffering plus ->> Dah Ming Chiu: Yeah, exactly, exactly. Yeah, yeah. >> Question: -- by 50% fluency ratio. >> Question: Right. >> Question: Meaning, I mean, during time let's say you watch 100-minute movie. 80% of the time you are viewing. The other -- (talking over each other) ->> Dah Ming Chiu: You are watching a commercial or something. They make you watch commercials sometimes when you are -- yeah, yeah. >> Question: This actually is pretty bad performance. I mean ->> Dah Ming Chiu: Yeah, yeah. >> Question: Yeah, I would think the number of interruptions would be crucial, too, because interrupted every three frames for one frame, that would be much worse than if there was a big gap at the beginning. >> Dah Ming Chiu: Right. >> Question: I get the impression if that's the case, some of its competitors are doing better in the VoD case than this one. >> Dah Ming Chiu: I must say their performance is not that great, but I think probably passable. I think this is, you know -- again, I mean this is just a snapshot of performance at a particular point in deployment. So this is, I think the whole idea of the paper is not just trying to study, you know, exact benchmark in their performance. It's more like describing the design issues and what are the important problems and what things to measure, how to measure them. These are the things I'm trying to deliver here. Okay. >> Question: This is -- (inaudible) ->> Dah Ming Chiu: Yeah. >> Question: And ->> Dah Ming Chiu: Right. This is (inaudible). >> Question: Something like May timeframe? >> Dah Ming Chiu: May timeframe, no, I don't think I have that. This is just showing there are typical enough -- this may not be related to the other figures. This is just sort of another snapshot they're looking at the server, how they are delivering during the day, you know what is there CPU time? You know, sort of usage, memory usage and what kind of things they're using for the server. So this is not that interesting. Now this may be interesting to you. Huh? >> Question: Question on this number. Did that follow the tracker in content provider? So, I mean ->> Dah Ming Chiu: No, I think it has different servers for tracker and different for providing the source. >> Question: So this is just a content provider ->> Dah Ming Chiu: Yeah, yeah, yeah. ->> Question: -- server. >> Question: Will they exhibit? >> Question: I think it's actually interesting because it shows that the server is pretty much maxed out at about 70% utilization. If you're actually running a server, you don't want it to hit 70%. It's pretty much maxed out. You're doing it at these two times then the peaks that you see. >> Dah Ming Chiu: Mmm. >> Question: Corresponding to maximum upload rate. >> Dah Ming Chiu: Yeah. Yeah. Whatever you can make out of this, yeah. So this is the -- some new things collected after the paper was selected. I think that one of the reviewers says, why don't you show us some, you know, sort of what are the typical uplink contributions from different peers and downlink and so on. So this was measured in May of 2008. Okay. So you can see a distribution of what peers are contributing and how much server is contributing. In this one it actually shows the server is doing very well. It's only 8% or something. The peers are -- total distribution of different kind of peer contribution and so I think this is kind of interesting. >> Question: (Inaudible) -- so the download rate of the peers, I mean, have a (inaudible) distribution, right? So really is something like 360 gigabits per second. >> Dah Ming Chiu: Right. >> Question: For this movie, right? >> Dah Ming Chiu: Right. >> Question: You have peers downloading, you have something like 10% of peers downloading above 600 (inaudible). >> Dah Ming Chiu: I don't know why is that, sorry. That's what I don't know. >> Question: Because that is usually indicate ->> Dah Ming Chiu: That's a very small percentage, right? >> Question: 10% of the (inaudible). >> Dah Ming Chiu: Yeah. Yeah. >> Question: And if you look at 360 to 600, depending on how much of the peers actually close to the 360, you may have a lot of peers ->> Dah Ming Chiu: Yeah. I think this one maybe should be broken down into more finer granularity. If it's closer 360, it could be that those peers, they are seeing a lot of losses or they just have to do a lot of retransmission or whatever. I mean, that's possible, as well. Yeah. So then the issue is how to measure the server loading. So the way -- turns out they are defining this server loading is at during the prime time. Okay. The prime time is divided by sort of looking at the prime of day and the most busiest two or three hours. And then -- because if a system is not during prime time you actually don't care so much. The server is already deployed. If they use a higher percentage of servers it's okay. And it's during the prime time you want to make sure the server is not loaded too much. Okay. So that's how it is defined. And as I mentioned earlier, that for P2P streaming they told me that they can achieve very low server loading, maybe less than 1%. Okay. I forgot the exact number, but for P2P VoD initially when the paper was written the server loading was 20 to 30 bits out and by this time it's around 10%. Yes? >> Question: A lot of servers to go through ->> Dah Ming Chiu: Yeah. >> Question: Are they geographically distributed and do they dynamically improve servers ->> Dah Ming Chiu: That is what I don't know. I don't know the details. I know they have probably placed servers in different ISP networks. They actually work with ISPs to decide where to put the servers. The ISP may even provide them some place to put the server that has high bandwidth, uplink, bandwidth and these kinds of things. >> Question: -- (inaudible) manage that (inaudible) ->> Dah Ming Chiu: This has some information about the entity traversal, this 80% of the nodes are behind I-19 boxes and may need three kind of energy boxes. They use something like the stone, you know, protocol to measure that. So the concluding remarks I think the main message, as I said earlier already is that we're just trying to. This is like system paper, we are looking at, you know, relatively large scale P2P VoD deployment at this stage. Later on we'll probably see more, you know, even larger scale. And we look at the design now, the insight we get from the deployment in the PPLive case. We look at important research problems to study and we discuss how to do measurement, what kind of metrics to measure and how to measurement both in terms of measuring the replication for the replication algorithm, as well as for user satisfaction. And that's it. And I think ->> Question: Let me ask a question about when you track this data. >> Dah Ming Chiu: Okay. >> Question: They tell you within the core of this system (inaudible). It seems to me that, I mean, currently the system they have issues if you really want to put all VoD movies on to the system. What I mean is this may be okay for let's say 100 to 500 popular movies. But let's say you (inaudible) website into VoD server. >> Dah Ming Chiu: Uh-huh. >> Question: Then the number of movies may be pretty large. >> Dah Ming Chiu: Right. >> Question: Look at the current algorithms, basically need to track each chunk, where are the peers holding these chunks, right? >> Dah Ming Chiu: No, no, no. The track is only, I think, responsible for giving peers, you know, if you come in and you want to watch a particular channel, a particular movie, which other peers have stored that movie, not at the chunk level. >> Question: So it's basically on the movie level, not the chunk level. >> Dah Ming Chiu: Yeah, yeah. The chunk level is probably just additional information, they're not keeping necessarily up to date. >> Question: Okay. >> Dah Ming Chiu: The chunk level is we use gossip. >> Question: Okay. Okay. Okay. >> Dah Ming Chiu: Yeah. The tracker is doing the movie now. >> Question: So the (inaudible) actually beyond the server pieces so it's basically chunk ->> Dah Ming Chiu: Right, right. Because you exchange the bit map, yeah. >> Question: -- (inaudible) difficulty handle case but never was a common one was the UI pulls it out is people that watch a movie and fast-forward, so they actually only want one frame out of every chunk or you know ->> Dah Ming Chiu: Yeah, they don't have this feature. They don't support this. This may be difficult, yeah. >> Question: Well, that works. (Laughter) >> Dah Ming Chiu: They only allow you to jump to a particular point, but not to fast-forward. Yeah. >> Question: Okay. Let me ask one question. I've heard some speculation that the PPLive system performs well because the -- a lot of vendors subsidize by large open pipes that are like people in universities. All right. Having -- >> Dah Ming Chiu: I think this is true for all the P2P systems. I think especially if you are in China, a lot of the ADSL users, they don't have a lot of uplink bandwidth. >> Question: -- the measurements could either support or disproves this particular speculation. So I know it's speculation, but once it is mentioned I say quantify it. >> Dah Ming Chiu: So this one I think you can see some ->> Question: If you look at this study, I mean the number of kind of peers in the (inaudible). >> Question: Yep. >> Question: So the peers (inaudible), I mean ->> Question: 60%. >> Dah Ming Chiu: About 50%. Having, you know, less than the playback rate roughly. >> Question: Yeah. I ->> Dah Ming Chiu: It's not ->> Question: Okay. Great. >> Question: According to live screening designs ->> Dah Ming Chiu: Uh-huh. >> Question: In which they talk (inaudible) Sitcom Ultra ->> Dah Ming Chiu: Uh-huh. >> Question: They majorly do use basically (inaudible) peers. >> Dah Ming Chiu: Right. >> Question: University peers. Those peer's bandwidth is actually much higher than the one (inaudible). It is almost like 100 megahertz. >> Question: So interesting question is what happens when university network administrators decide this is not going to work and turn it off. >> Dah Ming Chiu: Yeah. I think for this, I mean -- >> Question: (Inaudible) study, you need to inject those studies (inaudible). (Laughter) And also there's difference when there is a (inaudible) downloading because they follow traversal or that traversal (inaudible) or something ->> Question: Usually for the university network we (inaudible) basically (inaudible). They don't have power of traversal (inaudible). >> Question: Well, in any case it's (inaudible). Yes. Okay. Many of the universities can (inaudible). >> Dah Ming Chiu: So do you think we should still go through this part two that -- quickly, maybe about 25. I mean, more about modeling work we're doing. I mean probably some things you already know. >> Question: We have something like 15 minutes. >> Dah Ming Chiu: 15 minutes, I'll quickly flip through them for this. Okay. So I think there's also great academic interest in this P2P area. So the talk I just gave is more like during the system. So during the academic community people are studying basically these two kind of questions. One is: What is the limit? I mean, in terms of theoretically you can do. Second question is sort of what -- how to model the algorithms so you can achieve these limits? I think for people working this field, the secret, as you already know, is for this P2P to work essentially you have to use multiple trees to distribute information and you can build these trees, you know, using this tree-based method of sort of mash, but essentially you provide multiple paths from the server to other peers. >> Question: That's basically (inaudible) -- like in my streaming verse us VoD? >> Dah Ming Chiu: This is streaming. Live streaming. Yeah, yeah. So this is different than when people were doing multi casts. They were focusing on just efficiency, rather than the maximum throughput. The capacity limit basically, you know, you can -- if you don't make these two assumptions it is kind of complex problem because you have to study, you know, how to pack different trees given a physical network. But people typically make these two assumptions. One is what is called uplink sharing problem. Okay. This was mentioned in Moninger's Thesis in 2005. Essentially you find -- you assume that the network is not a bottleneck. So only the uplink appears on the bottleneck. And then you can derive and also you make a fluid assumption, which is to divide a content to as many small pieces as possible. And in the limit you can derive this result which says that you can, the maximum throughput you can achieve is bounded -- in fact, you can achieve this pretty much by this formula is the minimum of the servers uplink or the total uplink, including the server divide by number of receivers. So the -- some other useful results in the theoretical lender case is you have a sarcastic model of peer population, which is by the two and three kind of modeling in 2004. I just want to mention that we also did a work in TyseonP(phonetic) 2006 sort of trying to study in the theoretical case what is the tradeoff between the throughput versus fairness of contribution. You can see the different ways you can achieve different level of throughput. And I think what is -so the question of studying the theoretical capacity limit I think is already -- there is already rich literature. But I think what is beginning to happen and more and more papers focus on studying this modeling these algorithms, okay, of the algorithms. Because we see a lot of really successful deployment such as in PPLive's case. And can we actually, you know, sort of have more rigorous study of the algorithms? The -- for example, basically all the algorithms people are studying are the mesh network case, which is probably more challenging case. The tree case is more predictable, so the algorithm tend to be simpler. And I think one important result is this question about push versus pull based algorithm. Okay. I think there is a nice paper on this topic by these three gentlemen, San Havi(phonetic), Bruce Hyak, and Mosoli(phonetic). And they have a paper in transaction of information theory. Essentially they look at the pull type of algorithm versus push. And then (inaudible) insight is that the push is important in the beginning for the fresh, you know, chunks when they are sending out for the server because it helps the distribution scalability of distribution. And the -- when the -- there is already significant number of pieces out there in peers then the pull method is better. So this paper actually is very nice discussing these issues. We also did paper in ICMP 2007, modeling the streaming case. Okay, that other paper was more in the file downloading, looking at sort of the maximum throughput and the delay and so on. So within a (inaudible) ICMP 2007 created a model to study, to compute continuity. So the intuition we got from there is that intuitively you would think that the greedy algorithm, which is sequentially get everything from sort of sequentially is the right thing to do. But it turns out that if you want to -- a system to scale it's important to do the rarer first. Okay there is a model in the paper. And it turns out the reason the insight work is very similar to the previous work, the paper I just mentioned, because rarer first essentially is very similar to push. It's, you know, if every peer select when they are pulling they select rarer first, it's like pushing the new pieces first. Okay. Giving priority to the new pieces. And then the important result we have in this paper is we show that actually a mix strategy is the best. Actually if you use a mix strategy you can achieve the most from -- in both directions, the dimensions. So this is our model. You have end peers and the server is doing the push to the different peers. And then every -- after every time slot essentially everything is like a sliding window shift over and then you playback one piece and then the assumption we make of course is the simplified assumption is that you have a set of peers which are viewing at the same time. They are the same playback buffer size and a bunch of assumptions. But if you take these assumptions then you can essentially compute the continuity which is the probability of having a particular piece at a time when you need to playback. Okay. And you can compute this for different piece selection algorithms so this is showing the sliding window. Each peer buffers like a sliding window in each time slot. Each peer randomly select a neighbor and then get a piece. And which piece you get is the piece selection algorithm and then you can compute this thing, which is the -- in each time slot the probability a particular piece is gotten from your neighbor and then essentially the probability of buffer position I plus 1 is equal to the probability you have the content in buffer position I in the previous time slot plus the probability you get that piece in the current time slot. Okay. You can essentially, through the model, get each of these quantities and you can then -- it becomes solving a different equation to get the continuity and then we can study using this method, study the greedy algorithm and the rarer first and this illustrates which -- the greedy is getting the one that's closest to playback and the rarer first is getting the newest piece and then you can set up the different equation for each case. And it's -- the reason you get this different equation is showing you some line loss in the paper and then we solve these equations. I won't go through the degradation, but you can get this kind of numerical results to see that the two peer selection strategies. And you can see that in this case the rarer first is doing better. Okay. It's focused on the probability PI when I equal to 40. Okay, which is the playback position. And then we show that we can set up the same thing for the mixed strategy, which is again reduced to a different equation and then you can study the mixed strategy. And after we publish the paper we actually theoretically can prove the mix strategy is always better than both the greedy and the rarer first. The mix is basically just saying we used part of the buffer to use the rarer first and the other part we used the greedy. Okay. And this is a closer look at how these three strategies do given the period of time. Okay. You can see that it's not very clear, but the mixed is almost one. Okay. And so one of the strategies in the mix strategy is you have to decide how big a buffer portion of the buffer to use for the rarer first versus the greedy, which is this parameter M. So how do you set this parameter? So you don't want to have to set this parameter, you want to this parameter is that based on the population size and then you can show that -- can you actually let it adapt by saying just take any M in the middle and then say, let's try to make sure this PM can achieve, you know, some pocket probability, say .3 and by doing this you can essentially -- because the system is not very sensitive to what you pick for a target probability and then you can adapt to the right end given when the population size change. So more recent work we are continuing along this line is to model the, you know, we make a lot of assumptions in that model so we try to relax the assumptions for -- you know, have different peers have unsynchronized playback. They're using different kind of start-up algorithms, with a model to start-up algorithm. And also the -- as well as the piece selection. And so that's one area we're doing work. The second area I already mentioned to come up with a model, like the future generation congestion control for P2P how to do this transmission scheduling. The third area we're working on is the ISP-friendly content distribution. So these are the three areas. So the concluding remark, I think I'm wrapping up just in time now. So I think there's too many algorithms and variations to study, so this area is still very futile for researchers. And maybe there needs to be some kind of common simulation platform like an MS that sort of for P2P. We are actually thinking about doing something in this area, as well if there's enough students. As I said this resource allocation problem is very interesting compared to the congestion control we've been studying a lot. So these are just some thoughts. Yeah? >> Question: Has anybody looked at sort of lowering the line between this and peer-to-peer file sharing, where I can sort of, you know, say what it is that I expect to be watching and the thing can be proactively doing that and trying to get ahead to avoid the stalls? It would seem like if I do that it would encourage -it would be one of the benefits to the network is it would encourage people to leave these applications running whenever their machine is idol. >> Dah Ming Chiu: That's very interesting question. Yeah. I think you are saying essentially maybe you don't need to study the streaming. It's just through file sharing, download the file. >> Question: Well, it seems like you need the on-demand stuff, but it seems like there's also the application of people that just want to download the full movie. >> Dah Ming Chiu: Right. >> Question: And it seems like that by mixing the two you might be better than you can do either of the algorithms separately. >> Dah Ming Chiu: Yeah. Yeah. Interesting thought. I ->> Question: The thing, while the performance differs, I mean, also requiring a movie, usually a different (inaudible). You want the application to have idea (inaudible) basically required to, I mean, allow the user to be comfortable using the application. My observation is it's quite informal in China. Actually the (inaudible) here (inaudible) actually a five-peer communication. A lot of things share (inaudible) -- so majority (inaudible) download. >> Dah Ming Chiu: Uh-huh. >> Question: And then these peer-to-peer streaming applications (inaudible) -live streaming path. (Inaudible) basically streaming (inaudible) open up something like 420 channels and viewer can watch each channel. Each of the channels usually you have something like (inaudible). (Inaudible) in term of the time they are watching this movie. (Inaudible) actually very light (inaudible) ->> Dah Ming Chiu: Uh-huh. >> Question: That's basically the live streaming. More reason we are starting to offer video on-demand service. That's a new service being offered. >> Question: Right. Yeah, seeming like if you put the multiple services on the same server that you might get some benefits like file sharing would get a lot slower at peak video-on-demand times (inaudible). Interesting thing is basically you sort of balance. >> Question: Right. >> Question: And I think (inaudible) even today isn't working that well. (Inaudible) actually using Live streaming service (inaudible) simply because, I mean ->> Question: Well, the things you are trying to do on the machine at the same time. >> Question: It's basically sucking all the bandwidth (inaudible). >> Question: Right. But if you're about to go to dinner, you'd be happy to let it get ahead while you're gone. >> Question: Yes. >> Dah Ming Chiu: Any other questions? >> Question: I want to go back to the PPLive replication strategy. >> Dah Ming Chiu: Yes. >> Question: So is there limitation of the replication strategy particular to the fact that you're seeing movies which are kind of large content as opposed to things like YouTube videos, which might be something like 30 seconds. Do you know to what degree it's dependent on the fact that the length of the movie and also the issue of the fact that you need all the data for file replication and file (inaudible) whereas movies you don't (inaudible). Do you have any sense of how the (inaudible)? >> Dah Ming Chiu: Interesting. I think from the user behavior we see, right, a lot of the users were actually browsing. I mean, they were. So I think, at least the browsing part is similar to the shock (inaudible). In contrast to their system, I think if you want to do it like a YouTube system, right, I think the challenge is more there's so many views versus just having only like a few hundred (inaudible). So the design, I think, it is quite differently. There is a tracker (inaudible). So maybe you could think about doing something like -- something that help you scale up or discover different things. That is very important. >> Jin Li: Any other questions? Thank you. Very, very interesting questions. (Applause) --