ShadowStream: Performance Experimentation as a Capability in Production Internet Live Streaming Networks Present by: Chen Alexandre Tian (HUST) Richard Alimi (Google) Richard Yang (Yale) David Zhang (PPLive) 1 Live Streaming is Widely Used • Many recent major events live streamed on the Internet • Many daily events are streamed as well • Justin.tv, livestream, … 2 State of Art of Live Streaming System Hybrid system (e.g., Adobe Flash 10.1 and later) CDN seeding P2P with BitTorrent-like protocols 3 Performance of Live Streaming System Become Difficult to Understand/Predict System software becoming more complex 4 Internet Environment Complexity ADSL Modem Buffer PowerBoost Inter-ISP throttling …… Misleading results if not considering real network features. 5 Need Evaluation at Right Scale Misleading results if not considering the target scale. 6 Key Idea of ShadowStream The production system provides an ideal evaluation platform: real users, real networks, at scale. 7 Starting Point: Use Experiment Algorithm On Real User First Challenge: How to achieve both accuracy and user protection? User Playpoint Virtual Playpoint Miss record here CDN Protection New pieces inject here Experiment Two seconds later: 8 Issues of CDN Protection • Scale • 100,000 Clients @ 1 mbps rate ->100Gbps • More demand with concurrent test channels • Network bottleneck • There can be bottlenecks from CDN edge servers to streaming clients 9 New Idea: Scaling Up with Stable Protection Observation: there already exists a stable version w/ reasonable performance Issue: Losses of Experiment Accuracy. 10 Why Loss Accuracy? 11 Converge to a Balance Point We should observe m(θ0), but instead we actually observe m(θ’). 12 Putting-Together: Cascading Protection for Accuracy and Scalability Q: Any remaining challenge? 13 Real user behaviors differ from testing behaviors Idea: transparently orchestrate experimental scenarios from existing, already playing clients Virtual arrivals/virtual departures Test specification Triggering Virtual arrival control Virtual departure control 14 Independent Arrivals Achieving Global Arrival Pattern Peer generate arrival times by drawing random numbers independently according to the same cumulative distribution function. 15 From Idea to System Challenge: How to minimize developers’ engineering efforts? 16 Streaming Hypervisor Hypervisor API need for each streaming engine getSysTime() getLagRange(), getMaxStartupDelay() writePiece(), getPieceMap() 17 Computing Windows Bounds • Hypervisor calls getLagRange() 18 Sharing and Information Flow Control 19 Compositional Software framework Example: Adding an admission control component 20 Evaluation: Experiment Accuracy & Protection Only CDN as the Protection: Cascaded Protection: 21 Evaluation: Experimental Opportunities SH Sports channel and HN Satellite channel, pplive, September 6, 2010 22 Evaluation: Accuracy of Distributed Arrivals Arrival function from “Performance and Quality-of-Service Analysis of a Live P2P Video Multicast Session on the Internet”. Sachin Agarwal, Jatinder Pal Singh, Aditya Mavlankar, Pierpaolo Bacchichet, and Bernd Girod, In Proceedings of IWQoS 2008. Springer, June 2008 23 Take Home Idea Many Internet-scale systems are unique systems that are difficult to build/test. The ShadowStream scheme consists of following key ideas: Conduct shadow experiments using real system, real users Protection and accuracy present dual challenges Use Stable for scalable protection Introduce external resources (CDN) to remove interference on competing resources Create shadow behaviors from real users 24 Thanks for coming! Questions? 25 Metric of Live Streaming Performance Piece missing ratio 26 Backup Slides 27 Streaming of the Internet 28 Virtual Sliding Window A streaming engine has two sliding windows: an upload window (P2P) and a download window (CDN and P2P). Each engine call getSysTime() to Hypervisor, based on real system time and time shifted value, Hypervisor assign a virtual system time to each engine. Each engine calculate x(left) and x(right) of download window Each engine advances its sliding window at the channel rate μ pieces per second. 29 30 The reasoning behind •CDN see the original miss-ratio/supply-ratio curve •P2P Protection see the curve minus δ 31 Specification Define multiple classes of clients (e.g., cable or DSL, estimated upload capacity class, or network location) A class-wide arrival rate function λj(t) Client’s lifetime is determined by the distribution Lx 32 Local Replacement for Uncontrolled Early Departures Capturing client state Substitution 33 Triggering Condition Predict(t): autoregressive integrated moving average (ARIMA) method that uses both recent testing channel states and the past history of the same program 34 Independent Arrivals Algorithm 35 CDN Capacity and window length CDN window set to 4 seconds The TCP retransmission timeout is 3 seconds for piece loss 1 extra second for waiting retransmitted piece Window length 36 Starting up the engine When starting a streaming engine x, the Streaming Hypervisor gives x pointers to its download and upload windows. at time a(s), the client join test channel and Stable engine starts. at time a(e) >a(s), the client join testing, the Experiment Engine and CDN Protection Engine start. After starting, an engine begins to download pieces starting from the target playpoint to the end of its download window. The piece before startup should be protected by CDN, which would be counted by CDN capacity 37 calculation ShadowStream Outline Motivation and Challenge Experiment Protection and Accuracy Experiment Orchestration Implementation Evaluation 38 Client Substitution Client substitution delay with client dynamics. 39 Backup Slides 40 Sec. 8: Limitation Discussion (Do we really need this?) If Exp consumes resources while no piece received at all (Give priority to Protection?) Download link are bottleneck 41 Modeling P2P Protection Given experiment engine e, target rate R, the miss ratio is mR,e(θ) , or, me(θ) Given protection engine e, its target rate is me(θ), the required rescue bandwidth is Θk(me(θ),p)* me(θ)= η(e,p,θ) 42 P2P Protection no accurate result •If P1 is the protection, there would exist balance point(s) •If P2 is the protection, there would be a negative feed-back loop •In either cases, there is no accuracy at all 43 44 45 Live Streaming Live Streaming on Internet Live Audio/Video Content Distribution on Internet e.g., NBC Winter Olympics 2010 live Using Microsoft Silverlight® P2P live streaming 46 47 Example: PPLive From PPLive’s Presentation Not Yet! Founded by Graduate Students from Huazhong University of Science & Technology PPLive is An online video broadcasting and advertising network provides online viewing experience comparable to TV An efficient P2P technique platform and test bench Estimated global installed base 75 million Monthly active users* 20 million Daily active users 3.5 million Peak concurrent users 2.2 million Monthly average concurrent users 1.5 million Weekly average usage time 48 11 hours 49 50 51 Challenges How to achieve both experiment accuracy and user protection? How to produce desired experiment pattern? How to minimize developers’ engineering effort? 52 Starting Point: Use Experiment Alg. On Real User 53 A simple example • No user-visible pieces misses • Missing piece 91 is recorded • Piece download assignment is adaptive 54 55 Three issues delete • Information flow control: Although piece 91 is downloaded by the Protection Engine, it should not be labeled as downloaded in the Experiment Engine. • Duplicate avoidance: Since both Experiment Engine and Protection Engine are running, if their download windows overlap, they may download the same piece. • Experiment feasibility: This lag from realtime is determined when client i joins the test channel with the Protection Engine to make experiment and protection feasible. 56