Cloud Streaming
Jingwen Wang
Video content distribution
90% of all the consumer IP traffic is expected
to consist of video content distribution
 Web video like YouTube, P2P video like BitTorrent
 Content
distribution requirements:
 Scalable and secure media storage, processing and distribution
 Anytime, anywhere, any device consumption
 Low latency, global distribution
Cloud Provides a Better way
File Transfer
IT Costs
 Current solution for deliver videos: progressive download via CDN
Non-adaptive codec
Video freeezes
 WANT: a SVC based video proxy that delivers high-quality
Internet streaming adapting to variable conditions
Video transcoding from original formats to SVC
Video streaming to different users under Internet dynamics
on one processor:
 Video transcoding to SVC is highly complex and transcoding speed
is relatively slow
a long duration before a user can access the transcoded video
video freezes because of unavailability of transcoded video data
enable real-time transcoding and allow scalable
support for multiple concurrent videos:
 Use Cloud: CloudStream
Partition a video into clips and maps them to different compute nodes in
order to achieve encoding parallelization
Encoding parallelization:
 Multiple video clips can be mapped to compute nodes at different
 First-task first-server scheme can introduce unbalanced
computation load  transcoding jitter
 The transcoding component should not speed up video encoding at
the expense of degrading the encoded video quality
Streaming jitter:
 Video clips arrive at the streaming component in batches
 Demand surge of network resources leads to some data not arrive
at the user at the expected arrival time
Metrics affecting Streaming Quality
 Access time
Transcoding and streaming latencies
 Video freezes
Transcoding and streaming jitters
 The temporal motion metric TM
 The spatial detail metric SD
Encoding Parallelization
coding structure:
A video  non-overlapping coding-independent GOPs
A picture  layers
A layer  coding-independent slices
A slice  macro-blocks
 Across different compute nodes: inter-node parallelism
 Shared-memory address parallelism inside on compute node:
intra-node parallelism
Multi-level parallelization Scheme
encoding parallelization:
 GOPs: have the largest work granularity
 Inter-node parallelism !
 Slices: independence, relative larger amount of work
 Intra-node parallelism!
 Each slice on a different CPU
Intra-node Parallelism
 Limit the average computation time spend over the GOP to an
upper bound Tth
 Shorten the access time !
 The minimum number of slices encoded in parallel: Mmin
Number of encoded parallel slices in a picture
NMB, i, Nslice, i
Number of MBs or slices in the i-th layer of a picture
TMB, i(M), Tslice, i(M)
Average encoding time of one NM or slice in the i-th layer
with M parallel slices
Tpic, i(M)
Average encoding time of the i-th layer of a picture
Average encoding time of a picture
Average encoding time of a GOP
Inter-node Parallelism
 Achieve real-time transcoding
 Transcoding jitters introduced by variation of GOP encoding time
 Goal:
 Minimize transcoding jitters
 Minimize the number of compute nodes
Estimation of GOP’s Encoding Time
A multi-variable
regression model
 At a given encoding configuration
 Train videos with different video content characteristics TM and
SD to build the regression model
 90% of predicted values of the testing data are fallen within the
10% of error
Problem Formulation
Based on the approximation of each GOP’s encoding time
Given Q jobs
Each job i has a deadline di and a processing time pi
Multiple nodes in parallel, each job is processed with out
preemption on each machine until its completion
 Lateness li can be computed as ci (actual completion time) – di
 Upper bound of lateness: τ
 WANT: bound the lateness of these jobs  find the
minimal number of machines N and minimize τ
 Hallsh-based Mapping
 Lateness-first Mapping
Hallsh-based Mapping
 Set an upper bound of τ and find the minimal number of N satisfies
 Use Hallsh machine scheduling algorithm as a blackbox
minMS2approx algorithm
Pick ε = mini{(di - pi)/τ}
Run HallSh by increasing the number of machines until
the maximum lateness among all jobs satisfies <(1 + ε) *τ,
and set the machine number at this point to be K
HallSh will returns the scheduling results of all jobs. For
a job with lateness over the upper bound on a particular
machine j, move it along with all future jobs on machine K
to a new machine K + j. Then compute the new
completion time for all jobs on this new machine
N is the number of used machines
Lateness-first Mapping
 Compute the minimal number of N based on the deadline of each
job and minimize τ for the given N
 Deciding the minimum N:
Tpic(M)*R < SG *N
 Minimizing τ given N:
For the i-th job in every N jobs, compute its adjusted processing time
p’i=pi – (di – d1)
Sort the n jobs by the reverse order of p’I
Schedule the job with the largest p’I to the first available compute node,
the second largest one to the second available node
Input: 64 480p video GOPs
GOP: 8 pictures
Picture: 4 temporal layers, 2 spatial layers, 1 quality layer
Up tp 4 cores on each compute node
Slices number corresponding to cores
Average encoding time and speedup using up to 4 cores in intra-node
Comparing LFM & HM
can successfully decide the appropriate compute
node number and limit the transcoding jitters
may require greater N in order to achieve the
same level of lateness constraint than LFM
Cloud Download
Cloud Utilities to achieve high-quality content
distribution for unpopular videos
 Video content distribution dominates Internet traffic
 High-quality video content distribution is of great significance
-1. high data health
-2. high data transfer rate
Motivation of Cloud Download
data health
 Data health: number of available full copies of the shared file in a
BitTorrent swarm
 Data health < 1.0 is unhealthy
 Use data health to represent data redundancy level of a video file
data transfer rate
 Enables online video streaming
 Live & VoD
State-of-the-art Techniques: CDN
Distribution Network)
 Strategically deploying edge servers
 Cooperate to replicate or move data according to data popularity
and server load
 User obtains copy from a nearby edge server
limited storage and bandwidth
 Not cost-effective for CDN to replicate unpopular videos the edge
 Charged facility only serving the content providers who have paid
State-of-the-art Techniques: P2P
 End users forming P2P data swarms
 Data directly exchanged between peers
 Real strength shows for popular file sharing
poor performance for unpopular videos
 Too few peers
Low data health
Low data transfer rate
of CDN and P2P work well in distributing
unpopular videos, due to low data health or low data
transfer rate
deployment of cloud utilities provides a
novel perspective to solve the problem:
Cloud Download
High data
rate !
Cloud Download
a user sends video request to the cloud
the cloud downloads the requested
video from the file link and stores it in the cloud
retrieve the requested video from the cloud with
hight data rate via the intra-cloud data transfer
User-side energy Efficiency
download an unpopular video
 A common user keeps his computer (& NIC) powered-on for long
 Much Energy is wasted while waiting
download an unpopular video
 The user can just be “offline”
 When the video is ready, quickly retrieve it in short time
 User-side energy efficient!
Cloud Download: View Startup Delay
The only drawback of Cloud Download:
 For some videos, the user must wait for the cloud to download it:
 View
startup delay
This drawback is effectively alleviated
 By the implicit and secure data reuse among users
 The cloud only downloads a video when it is requested for the first
 Cloud cache!
 Subsequent requests directly satisfied
 Secure because oblivious to users
 Data reuse rate -> 87%
System Architecture
Video request
Data transfer
(high data rate)
Data store/cache
Data download
Check cache
Component Function
Proxy: receive & restrict requests in each ISP
Manager: check cache
Dispatcher: load balance
download data
Cache: store and upload data
Hardware Composition
Building Block
# of servers
ISP Proxy
8 GB
250 GB
1 Gbps (Intranet),
0.3 Gbps (Internet)
Task Manager
8 GB
250 GB
1 Gbps (Intranet)
Task Dispatcher
8 GB
460 GB
1 Gbps (Intranet)
460 GB
1 Gbps (Intranet),
0.325 Gbps
Cloud Cache
400 chunk servers
93 upload servers
3 index servers
8 GB
8 GB
4 TB (chunk server),
1 Gbps (Intranet),
0.3 Gbps (Internet)
GB (upload server)
Cache Capacity Planning &
Replacement Strategy
0.22M daily requests
 Average video size: 379MB
 Video cache duration: <7 days
 Thus, C=372MB*0.22M*7= 584TB
Cache replacement strategies
 17 days trace-driven simulations
 FIFO vs. LRU vs. LFU
 FIFO worst, LFU best!
Performance Evaluation
 Complete running log of the VideoCloud system in 17 days:
Jan.1,2011 – Jan. 17, 2011
 3.87M video requests, around 1.0M unique videos
 Data transfer rate
 View startup delay
 Energy efficiency
Data transfer rate & View startup delay
Energy Efficiency
energy efficiency
 E1: users’ energy consumption using common download
 Eu: users’ energy consumption using cloud download
 User-side energy efficiency =(E1 - Eu)/E1 = 92%
energy efficiency
 Ec: the cloud’s energy consumption
 E2: the total energy consumption of the cloud and users, so E2 = Ec
+ Eu
 Overall energy efficiency = (E1 – E2)/E1 = 86%
Cloud Download application
Transcoding for mobile users
 Mobile user submits a video linnk and the transcoding parameters
to the cloud
 The cloud downloads the video from Internet via cloud download
 The cloud transcodes the downloaded video and transfers the
transcoded video back to user
The QQCyclone platform.
