An Open Architecture for Transport-level Protocol Coordination in Distributed Multimedia Applications DAVID E. OTT and KETAN MAYER-PATEL Department of Computer Science University of North Carolina at Chapel Hill We consider the problem of flow coordination in distributed multimedia applications. Most transport-level protocols are designed to operate independently and lack mechanisms for sharing information with other flows and coordinating data transport in various ways. This limitation becomes problematic in distributed applications that employ numerous flows between two computing clusters sharing the same intermediary forwarding path across the Internet. In this paper, we propose an open architecture that supports the sharing of network state information, peer flow information, and application-specific information. Called simply the Coordination Protocol (CP), the scheme facilitates coordination of network resource usage across flows belonging to the same application, as well as aiding other types of coordination. The effectiveness of our approach is illustrated in the context of multi-streaming in 3D tele-immersion where consistency of network information across flows both greatly improves frame transport synchrony and minimizes buffering delay. Categories and Subject Descriptors: C.2.2 [Computer Communication Networks]: Network Protocols—applications; C.2.4 [Computer Communication Networks]: Distributed Systems—distributed applications General Terms: Design, Algorithms, Performance, Experimentation Additional Key Words and Phrases: Network protocols, distributed applications, flow coordination 1. INTRODUCTION Future distributed multimedia applications will have increasingly sophisticated data transport requirements and place complex demands on network resources. Where one or two data streams was sufficient in the past, future applications will require many streams to handle an ever-growing number of media types and modes of interactivity. Where the endpoints of communication were once single computing hosts, future endpoints will be collections of communication and computing devices. Consider, for example, a complex application known as 3D Tele-immersion, or simply 3DTI [Kum et al. 2003]. In this application, a scene acquisition subsystem This work is supported by the National Science Foundation ITR Program (Award #ANI-0219780). Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. c 2005 ACM 1529-3785/2005/0700-0100 $5.00 ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005, Pages 100–0??. An Open Architecture for Transport-level Protocol Coordination · 101 is comprised of an array of digital cameras and computing hosts set up to capture a remote physical scene from a wide variety of camera angles. Synchronously captured images are multi-streamed to a distributed 3D reconstruction subsystem at a remote location. The subsystem uses pixel correspondence and camera calibration information to extract depth values on a per pixel basis. The resulting view-independent depth streams are used to render a view-dependent scene on a stereoscopic display in real time using head-tracking information from the user. Overall, the application allows two remote participants to interact within a shared 3D space such that each feels a strong mutual sense of presence. 3DTI is significant because it represents a more general class of distributed multimedia applications that we call cluster-to-cluster (C-to-C) applications. We define a cluster simply as a collection of computing and communication devices, or endpoints, that share the same local environment. In a C-to-C application, the endpoints of one cluster communicate with the endpoints of a second remote cluster over a common forwarding path, called the cluster-to-cluster data path. Flows from each cluster have a natural aggregation point (AP) (usually a first-hop router) where data converges to the same forwarding agent on the egress path and diverges to individual endpoints on the ingress path. Figure 1 illustrates this model. Clusters in a C-to-C application are typically under local administrative control and thus can be provisioned to comfortably support the communication needs of the application. In contrast, the C-to-C data path is shared with other Internet flows and typically cannot be provisioned end-to-end. Hence, it represents a significant source of network delay and congestion for application flows. (Wireless environments have somewhat different assumptions and are not treated here.) C-to-C application flows exhibit a number of interesting and significant characteristics. These include: —Independent, but semantically related flows of data. An application may need to prioritize its many streams in a particular way, or divide complex media objects into multiple streams with specific temporal or spatial relationships. —Transport-level heterogeneity. UDP- or RTP-based protocols, for example, might Endpoints Aggregation Point (AP) Aggregation Point (AP) Endpoints C−to−C Data Path Cluster A Cluster B Fig. 1. C-to-C application model. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. 102 · David E. Ott and Ketan Mayer-Patel be used for streaming media while TCP is used for control data. —Complex adaptation requirements. Changes in available bandwidth require coordinated adaptation decisions that reflect the global objectives of the application, its current state, the nature of various flows, and relationships between flows. Flows in 3DTI, for instance, stream video frames taken from the same immersive display environment. As such, the data from each flow share strong temporal and geometric relationships that are exploited by the application in reconstructing 3dimensional space in real time. User interaction may furthermore create additional relationships among flows as the user’s head orientation and position changes their region of interest. Media streams within that region may transport video data at higher resolutions or with more reliability than streams outside that region. A central problem facing C-to-C applications is that of flow coordination. Our work in this area has led us to view this problem in two separate but complementary domains: coordinated bandwidth distribution across flows and context-specific coordination whereby an application defines the semantics of flow coordination according to the specifics of the problem it’s trying to solve. Flows within a C-to-C application share a common intermediary path between clusters. As such, patterns of bandwidth usage and congestion response behavior in individual flows impact directly the performance of peer flows within the same application. In 3DTI, for instance, increases in send rate by one flow may cause congestion events experienced by another flow. Congestion response behavior by the second flow may then result in frame transport asynchrony as streaming rates become unequal and/or measures are taken to retransmit the missing data. (See Section 5.) Ideally, an application would like to make controlled adjustments to some or all of its flows to compensate for changing network conditions and application state. In practice, however, current transport-level protocols available to application designers (e.g., UDP, UDP Lite, UDT, TCP, SCTP, TFRC, RTP-based protocols like RTSP, etc.) operate in isolation from one another, share no consistent view of network conditions, and provide no application control over congestion response behavior. Our goal with respect to the first coordination domain above, then, is to provide support for transport-level protocol coordination such that —Bandwidth is utilized by participant flows in application-controlled ways. —Aggregate traffic is responsive to network congestion and delay. —End-to-end transport-level protocol semantics for individual flows remain intact. The need for coordination, however, goes beyond simply distributing bandwidth among flows in controlled ways. Peer flows of the same application may wish to trade hints to synchronize their transport of semantically related data, coordinate the capture or encoding of media data units, propagate information on application state across all flows, etc. All such cases necessarily rely on context-specific details of the application and the problem it’s trying to solve, including the data types involved and the nature of operations or events to be coordinated. With respect to this second coordination domain, then, we wish to provide a multi-flow application with support for context-specific coordination such that: ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. An Open Architecture for Transport-level Protocol Coordination · 103 —Flows can be aware of peer flows in the same application. —Self-organizing hierarchies can easily be established among flows. —Information can be exchanged among flows with application-defined semantics. —Context-specific coordination can be achieved in a lightweight and decentralized manner. In both coordination problem domains, we emphasize the need for an open architecture solution. Indeed, our interest is not in providing coordination for a specific problem or application flow scenario as much as providing a toolset by which any arbitrary C-to-C application can implement its own coordination scheme in a way that is specific to its data transport and adaptation requirements. In this paper, we present our solution to the problem of flow coordination in cluster-to-cluster applications, known simply as the Coordination Protocol (CP). Our solution is one that exploits the architectural features of the C-to-C problem scenario, using cluster aggregation points as a natural mechanism for information exchange among flows and application packets as carriers of network probe and state information along the cluster-to-cluster data path. We believe that our solution has the power and flexibility to enable a new generation of transport-level protocols that are both network and peer aware, thus providing the tools needed to implement a coordinated response among flows to changing network conditions and application state. The resulting communication performance for the application as a whole can thus far exceed that which is currently possible with today’s uncoordinated transport-level protocols operating in isolation. The organization of this paper is as follows. In Section 2, we discuss related work. In Section 3, we provide a broad overview of the Coordination Protocol (CP). In Section 4, we discuss the issue of aggregate congestion control, including mechanisms for probing the C-to-C data path, estimating available bandwidth, and extending bandwidth estimations for a single flowshare to multiple flowshares. In Section 5, we describe how CP can be applied to solve the problem of synchronized multi-streaming in 3D Tele-immersion (3DTI). Section 6 summarizes our contributions and discusses some future directions. 2. RELATED WORK In this section, we present related work in the areas of aggregate congestion control and state sharing among flows. Our discussion, in part, highlights the void we believe exists in research addressing the issue of flow coordination in distributed multimedia applications. 2.1 Aggregate Congestion Control Several well-known approaches to handling flow aggregates could potentially be useful in the C-to-C application context. One is that of Quality of Service (QoS) for provisioning network resources between clusters [Floyd and Jacobson 1995; Zhang 1995; Georgiadis et al. 1996; Parris et al. 1999]. In particular, consider differentiated services [Black et al. 1998] which associates packets with a particular pre-established service level agreement (SLA). The SLA can take many forms, but generally provides some sort of bandwidth allocation characterized by the parameters of a leaky token bucket. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. 104 · David E. Ott and Ketan Mayer-Patel Use of QoS for provisioning aggregate C-to-C application traffic could potentially eliminate the need for aggregate congestion control. Purchasing a service agreement adequate for the peak bandwidth is likely to be prohibitively expensive, however, since a complex C-to-C application will employ many flows and require considerable network resources. A more economical solution would be to use diffserv to provision for the minimum bandwidth required for the lowest acceptable application performance level, and then make use of best effort, shared bandwidth whenever possible. In this case, coordinated congestion control remains an important problem. Furthermore, limitations in bandwidth of any sort imply the need for coordinated bandwidth allocation across flows. QoS by itself provides no framework for accomplishing this. Another approach that might be considered is that of traffic shaping. In this approach, traffic entering the network is modified to conform to a particular specification or profile. In the C-to-C application context, this shaping would most logically be done at the APs. The traffic shaper is charged with estimating an appropriate congestion controlled rate, buffering packets from individual flows, and transmitting packets that conform to the estimated rate and desired traffic shape profile. The problem of determining the appropriate aggregate rate remains unsolved in this approach, with our proposed mechanisms described in Section 3 being directly applicable. A more serious problem, however, is that traffic shaping is intended to operate in a transparent manner with respect to individual flows. While potentially a feature when flows are unrelated and relative priorities static, C-to-C application flows require information on network performance (i.e., available bandwidth, loss rates, etc.) to make dynamic adaptation decisions that also take into account semantic relationships among flows, changing priority levels, and salient aspects of application state. For example, a flow may adjust its media encoding strategy at key points given changes in available bandwidth and a particular user event. Another approach to handling flow aggregates is TCP trunking as presented by [Kung and Wang 1999]. In this approach, individual flows sending data along the same intermediary path are multiplexed into a single “management connection” in order to apply TCP congestion control over the shared path. The common connection, or trunk as the authors refer to it, provides aggregate congestion control without restricting the participating transport-level protocols used by individual flows to TCP. The drawbacks to this particular approach for the C-to-C application context are numerous. As with traffic shaping, TCP trunking is transparent and thus fails to inform application endpoints of network performance (available bandwidth, loss rates, etc.). This prevents smart adaptation. Second, TCP-trunking reduces aggregate bandwidth to a single flowshare over the C-to-C data path, something we argue in Section 4 unfairly restricts bandwidth for a multi-flow application sharing a bottleneck link with other Internet flows. Third, the approach increases network delay as application packets are buffered at the trunk source waiting to be forwarded in a congestion controlled manner. Finally, the approach once again fails to provide a framework for coordinated bandwidth allocation across flows. Finally, the congestion manager (CM) architecture, proposed in [Balakrishnan ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. An Open Architecture for Transport-level Protocol Coordination · 105 et al. 1999], provides a compelling solution to the problem of applying congestion control to aggregate traffic where flows share the same end-to-end path. The CM architecture consists of a sender and a receiver. At the sender, a congestion controller adjusts the aggregate transmission rate based on its estimate of network congestion, a prober sends periodic probes to the receiver, and a flow scheduler divides available bandwidth among flows and notifies applications when they are permitted to send data. At the receiver, a loss detector maintains loss statistics, a responder maintains statistics on bytes received and responds to CM probes, and a hints dispatcher sends information to the sender informing them of congestion conditions and available bandwidth. An API is presented which allows an application to request information on round trip time and current sending rate, and to set up a callback mechanism to regulate send events according to its apportioned bandwidth. In many ways, the work presented in this paper represents our proposal for applying CM concepts to the C-to-C application model. We agree with CM’s philosophy of putting the application in control, though for CM this means allowing unrelated flows to know the individual bandwidth that is available to them, while for C-to-C applications it means allowing endpoints to know the aggregate bandwidth available to the application. Furthermore, we believe CM’s notion of using additional packet headers for detecting loss and identifying flows is a good one, and this is reflected in our own architecture as described in Section 3. On the other hand, applying the CM architecture to the C-to-C application context is not without its problems and issues. First, CM’s use of a flow scheduler to apportion bandwidth among flows is problematic in the C-to-C context for many of the same reasons given in our discussion of traffic shaping. Likewise, CM’s callback structure for handling application send events is difficult to implement in the Cto-C application context. This is because in CM, flows share the entire end-to-end path. That is, individual flows comprising an aggregate in CM share the same endpoint hosts. For senders on the same host, a callback architecture is reasonably implemented as a simple system call provided by the OS. In contrast, individual flow endpoints of a C-to-C application commonly reside on different computing and communication devices. A callback scheme using send notification messages from the AP to various application endpoints would result in too much communication overhead, making it impractical. Finally, CM is designed to multiplex a single congestion responsive flowshare among application flows sharing the same endto-end path. Again, as in the multiplexing approach, it may be undesirable to constrain a C-to-C application which commonly employs a large number of flows to a single flowshare. 2.2 State Sharing Among Flows Active networking, first proposed by [Tennenhouse and Wetherall 1996] allows custom programs to be installed within the network. Since their original conception, a variety of active networking systems have been built ([Wetherall 1999; Alexander et al. 1997; Decasper et al. 1998], for instance). They are often thought of as a way to implement new and customized network services. In fact, state sharing among C-to-C application flows could be implemented within an active networking framework. Active networking, however, mandates changes to routers along the ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. 106 · David E. Ott and Ketan Mayer-Patel entire network path which severely hinders deployment. In contrast, the solution described in Section 3 requires changes only at the aggregation points which are under local administrative control. [Calvert et al. 2002] describes a lightweight scheme that allows IP packets to manipulate small amounts of temporary state at routers using an ephemeral state store (ESS). State in this scheme is stored and retrieved using (tag, value) pairs where tags are randomly chosen from a very large tag space, thus minimizing the chance of collisions. An instruction set is provided that allows packets to operate on temporary state in predefined ways and using additional parameter values. Some example operations include counting, comparing, and tree-structured computations. Ephemeral state processing (ESP) provides a flexible scheme for solving such problems as multicast feedback thinning, data aggregation across participant flows, and network topology discovery. The scheme presented in Section 3 shares much in common with the ESS approach. Both present open architectures and support the exchange of soft state between flows with arbitrary, application-defined semantics. Furthermore, both provide operations that allow state to be aggregated across flows in various ways. Unlike ESS, however, the approach of Section 3 relies on enhanced forwarding services only at first- and last-hop routers. In addition, it extends the state table notion by providing state information in addition to storing/retrieving deposited state. This information includes both shared network path state (obtained through probing mechanisms) and application flow state. While ESS presents a general infrastructure spread throughout the network, our approach is more tightly coupled with the cluster-to-cluster application architecture and more deployable in that cluster aggregation points are under local administrative control and no additional support is required within the network. 3. COORDINATION PROTOCOL In this section, we outline our proposed solution to the problem of flow coordination in C-to-C applications. Our focus here will be on giving a broad overview. Mechanisms related to aggregate congestion control are treated in greater detail in Section 4. 3.1 Overview The CP protocol architecture is designed with several goals in mind: —To inform endpoints of network conditions over the cluster-to-cluster data path, including aggregate bandwidth available to the application as a whole, —To provide an infrastructure for exchanging state among flows and allowing an application to implement its own flow coordination scheme, and —To avoid the problems of centralized adaptation by relying on individual endpoints rather than scheduling or policing mechanisms at aggregation points. To realize these goals, CP makes use of a shim header inserted by application endpoints into each data packet. Ideally, this header is positioned between the network layer (IP) header and transport layer (TCP, UDP, etc.) header, thus making it transparent to IP routers on the forwarding path and, at the same time, preserving end-to-end transport-level protocol semantics. A UDP-based implementation, ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. An Open Architecture for Transport-level Protocol Coordination From endpoint to AP: IP Header C−to−C App ID Addr CP Header Transport−level Header Addr Addr Addr Packet Data 0 1 2 3 Flow ID V Protocol ID Val Val C−to−C App ID Val 2 Val 3 V Timestamp 0 1 Flow ID Echo Timestamp Echo Delay Seq. No. 107 From AP to endpoint: From AP to AP: Flags · Protocol ID Flags Echo Timestamp Echo Delay Bandwidth Available Loss Rate C−to−C App ID Flow ID V Protocol ID Flags Report 0 VID Report 1 VID Report 2 VID Report 3 VID Fig. 2. CP packet header format. however, placing the CP header in the first several bytes of UDP application data is also possible. This obviates the need for endpoint OS changes and makes the protocol more deployable. CP mechanisms are largely implemented at each aggregation point (AP) where there is a natural convergence of flow data to the same forwarding host. This may be the cluster’s first hop router, or a forwarding agent in front of the first hop router. As mentioned in Section 1, an AP is part of each cluster’s local computing environment and, as such, is under local administrative control. The CP header of packets belonging to the same C-to-C application are processed by the AP during packet forwarding. Essentially, the AP uses information in the CP header to maintain a per-application state table. Flows deposit information into the state table of their local AP as packets traverse from an application endpoint through the AP and on toward the remote cluster. Packets traveling in the reverse direction pick up entries from the state table and report them to the transport-level protocol layered above CP and/or to the application. In addition, the two APs conspire to measure characteristics of the C-to-C data path such as round trip time, packet loss rate, available bandwidth, etc. These measurements are made by exchanging probe information via the CP headers available from application packets traversing the data path in each direction. Measurements use all packets from all flows belonging to the same C-to-C application and thus monitor network conditions in a fine-grained manner. Resulting values are inserted into the state table. Report information is received by an application endpoint on a per packet basis. This information can take several forms, including information on current network conditions on the C-to-C data path (round trip time, loss, available bandwidth), information on peer flows (number of flows, aggregate bandwidth usage), and/or application-specific information exchanged among flows using a format and semantics defined by the application. An endpoint uses a subset of available information to make send rate and other adjustments (e.g., encoding strategy) to meet application-defined goals for network resource allocation and other coordination tasks. It’s important to emphasize that CP is an open architecture. It’s role is to provide information “hints” useful to application endpoints in implementing their own self-designed coordination schemes. In a sense, it is merely an information ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. 0 1 2 3 108 · David E. Ott and Ketan Mayer-Patel 1. Source endpoint writes information into CP header identifying the C−to−C application and flow, and specifying any state it wishes to deposit at the local AP. Src 2. Local AP deposits incoming state into application state table and then overwrites CP header with network probe information. 4. Destination endpoint uses incoming report information to make adaptation decisions using a coodination scheme defined by the application. Dst 3. Remote AP uses incoming probe information to measure delay and loss before overwriting the CP header with report information from the state table. Fig. 3. CP operation. service piggybacked on packets that already traverse the cluster-to-cluster data path. As such, aggregation points do no buffering, scheduling, shaping, or policing of application flows. Instead, coordination is implemented by the application which must configure endpoints to respond to CP information with appropriate send rate and other adjustments that reflect the higher objectives of the application. Figure 2 illustrates the header and its contents at different points on the network path. Figure 3 summarizes CP operation by tracing a packet traversing the path between source and destination endpoints. 3.2 AP State Tables An AP creates a state table for each C-to-C application currently in service that acts as a repository for network and flow information, as well as application-specific information shared between flows in the C-to-C application. The organization of a state table is as follows: —The table is a two dimensional grid of cells each of which can be addressed by an address and an offset. (We will use the notation address.of f set when referring to particular cells.) —There are 256 addresses divided into four types: report pointers, network statistics, flow statistics, and general purpose addresses. —For each address, 256 offsets are defined. The value and semantics of the particular cell located by the offset depend on the address context. Each cell in the table contains a 24-bit value. Our current implementation uses four bytes per cell to align memory access with word boundaries, making the state table a total of 256 KB in size. Even with a number of concurrent C-to-C applications, tables can easily fit into AP memory. An endpoint may read any location (address.offset) in the table by using the ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. An Open Architecture for Transport-level Protocol Coordination · 109 Address GP1 GP250 R1 R2 R3 R4 NET FLOW rtt loss bw num aggtput pktsize Offset 0 1 2 63 64 65 66 sum min max (Unused) 255 Fig. 4. CP state table maintained at each AP. report address mechanism described below. In contrast, an endpoint may only write specific offsets of the report and application addresses; network and flow statistic addresses are assigned by the AP and are read-only. The state table is illustrated in Figure 4. 3.2.1 Setting Cells of the State Table. The CP header of outgoing packets can be used to set the value of up to 4 cells in the state table. When an outbound packet (i.e., a packet leaving a cluster on its way toward the other cluster) arrives at the AP, the CP header includes the following information: —The flow id (f id) of the specific flow to which this packet belongs. Each flow of the application is assigned a unique f id in the range [0, 63]. Assigning f ids to flows may be handled in a number of different ways and is an issue orthogonal to our concerns here. —Four “operation” fields which are used to set the value of specific cells in the state table. The operation field is comprised of two parts. The first is an 8-bit address (Addri ) and the second is a 24-bit value (V ali ). The i subscript is in the range [0, 3] and simply corresponds to the index of the 4 operation fields in the header. Figure 2 illustrates this structure. When an AP receives an outbound packet, each operation field is interpreted in the following way. The cell to be assigned is uniquely identified by Addri .f id. The value of that cell is assigned V ali . In this manner, each flow is uniquely able to assign one of the first 64 cells associated with that address. Although the address specified in the operation field is in the range [0, 255], not all of these addresses are valid for writing (i.e., some of the addresses are readonly). Similarly, since a flow id is restricted to the range [0, 63], in fact only 64 of the offsets associated with a particular writable address can be set. As previously mentioned, the address space is divided into four address types. The mapping between address range and type is illustrated in Figure 4. The semantics of a cell value at a particular offset depends on the specific address type and is described in the following subsections. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. 110 · David E. Ott and Ketan Mayer-Patel 3.2.2 Report Pointers. Four of the writable addresses are known as report pointers. Using the mechanism described above, each flow is able to write a 24-bit value into Rj .f id where Rj is one of the 4 report pointers (i.e., R1, R2, R3, and R4 in Figure 4) and f id is the flow id. The value of these 4 cells (per flow) control how the AP processes inbound packets (i.e., packets arriving from the remote cluster) of a particular flow. When an inbound packet arrives, the AP looks up the value of Rj .f id for each of the four report addresses. The 24-bit value of the cell is interpreted in the following way. The first 8 bits are interpreted as a state table address (addr). The second 8 bits are interpreted as an offset (of f ) for that address. The final 8 bits are interpreted as a validation token (vid). The AP then copies into the CP header the 24-bit value located at (addr.of f ) concatenated with the 8-bit validation token vid. This is done for each of the four report fields. Thus, outbound packets of a flow are used to write a value into each of four report pointers, R1 through R4. These configure the AP to report values in the state table using inbound packets. The validation token has no meaning to the AP per se, but can be used by the application to help disambiguate between different reports. 3.2.3 Network Statistics. One of the addresses in the table is known as the network statistics address (N ET ). This is a read-only address. The offsets of this address correspond to different network statistics about the C-to-C data path as measured by APs across the aggregate of all flows in the C-to-C application including: —Round trip time (N ET.rtt) —Loss (N ET.loss) —Bandwidth available (N ET.bw) This is merely a partial list. In fact, up to 256 network-related statistics are potentially available using the N ET address. N ET.bw provides an estimate of the bandwidth available to a single TCP-compatible flow given the current round trip time, packet loss rate, average packet size, etc. How this estimate is calculated, and how the value can be scaled to n application flows is described in Section 4. 3.2.4 Flow Statistics. While statistics characterizing the C-to-C data path are available through the N ET address, statistics characterizing the application flows themselves are provided by offsets of the flow statistics address F LOW . Offsets of this address include information about: —Number of active flows (F LOW.num) —Throughput (F LOW.tput) —Average packet size (F LOW.pktsize) Again, this is merely a partial list and up to 256 different statistics can be provided. 3.2.5 General Purpose Addresses. The general purpose addresses (i.e., GP 1 through GP 250) in Figure 4 give a cluster-to-cluster application a set of tools for sharing information in a way that facilitates coordination among flows. For ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. An Open Architecture for Transport-level Protocol Coordination · 111 example, general purpose addresses may be used to implement floor control, dynamic priorities, consensus algorithms, dynamic bandwidth allocation, etc. General purpose addresses may also be useful in implementing coordination tasks among endpoints not directly related to networking. Offsets for each general purpose address are divided into two groups: assignable flow offsets and read-only aggregate function offsets. We have already discussed how the offsets equal to each flow id can be written by outbound packets of the corresponding flow. These are the flow offsets. While this accounts for the first 64 offsets of each of general purpose address, the remaining 196 offsets are used to report aggregate functions of these first 64 flows offsets. Some examples are: —Statistical offsets for functions such as sum, min, max, range, mean, and standard deviation. —Logical offsets for functions such as AND, OR, and XOR. —Pointer offsets. For example, the offset of the minimum value, the offset of the maximum value, etc. —Usage offsets. For example, the number of assigned flow offsets or the most recently assigned offset. Operations are implemented using lazy evaluation for efficiency. Which operations to include is an area of active research. Flow offsets are treated as soft state and time out if not refreshed. 3.3 Implementing Flow Coordination While CP provides network and flow information, as well as facilities for exchanging information, it is up to the cluster-to-cluster application to exploit these services to achieve coordination among flows. The details of how an application goes about this may vary widely since much depends on the specifics of the problem an application is trying to solve. Most, however, will want to employ some type of CP-enabled transport protocols that can be configured to participate in one or more applicationspecific coordination schemes. 3.3.1 CP-enabled Transport Protocols. A CP-enabled transport-level protocol provides data transport services to an application such that the flow management is —Network aware. —Peer flow aware. —Coordination policy aware. By “coordination policy aware” we mean that individual adaptation decisions made by the transport-level protocol reflect the larger context of flow coordination as defined by the application. In general, we believe that the expanded operational and informational context of transport-level protocols in the CP problem domain represent a rich frontier for future research. Whether a CP-enabled transport-level protocol is implemented as an application library or an operating system service depends on the implementation details of the CP header. In Section 3, we noted that the CP header logically fits between ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. 112 · David E. Ott and Ketan Mayer-Patel the network (IP) and transport (TCP, UDP, etc.) layers, but that a UDP-based implementation is likewise possible. A transport-level service API could provide a fairly seamless substitute for the current TCP/IP socket interface, providing additional options for setting cluster and flow id values. Or, it could be designed to pass various types of state table information directly to the application, for example, to help regulate media encoding adaptation. Still other transport-level protocols might provide simply a thin layer of mediation for an application to both read and write values from a local AP’s state table, for example, using the information to coordinate media capture events across endpoints. A principal function of a CP-enabled transport-level protocol is to use both CP information and application configuration to regulate a flow’s sending rate as network conditions change on the shared C-to-C data path. How this configuration is accomplished and the degree of transparency to the application are both left to the protocol designer. 3.3.2 Coordination Schemes. A flow “coordination scheme” is an abstraction used by C-to-C application designers to specify the objectives of coordination and the dynamic behavior of individual flows in realizing that objective. The details of a given coordination scheme depend entirely on the problem an application is trying to solve. For example, some applications may employ a centralized control process which interprets changing network information and periodically sends configuration messages to each endpoint using CP state sharing mechanisms. Still others may employ a decentralized approach in which endpoints independently evaluate application and network state information and make appropriate adjustments. Much of our work thus far has focused on coordination schemes that apportion bandwidth across flows in a decentralized way. An important point to note here is that aggregate bandwidth available to the application as a whole (equal to CP’s bandwidth available estimate for a single flow times the number of active flows in the application) may be distributed across endpoints in any manner. That is, it is not necessarily the case that a given application flow receives exactly 1/n of the aggregate bandwidth in an n-flow application. In fact, an application may apportion bandwidth across endpoints in any manner as long as the aggregate bandwidth level (n ∗ N ET.bw) is not exceeded. We believe this to be a powerful feature of our protocol architecture with the potential to dramatically enhance overall application performance in a wide variety of circumstances. In addition to bandwidth distribution, an application may use CP mechanisms to perform one or more types of context-specific coordination as well. That is, an application may use CP state exchange mechanisms to achieve coordination for any arbitrary problem. Some examples include leader election, fault detection and repair, media capture synchronization, coordinated streaming of multiple data types, distributed floor control, dynamic priority assignment, and various types group consensus. 3.3.3 Examples. Here we provide a couple brief examples showing coordinated bandwidth distribution across flows. Examples are “miniature” in the sense that realistic C-to-C applications are likely to have many more flows and networking reACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. An Open Architecture for Transport-level Protocol Coordination · 113 quirements that are more complex and change dynamically. Nonetheless, they serve to illustrate how CP information can be used to coordinate flows in a decentralized manner. Example 1. Flows A, B, and C are always part of the same cluster-to-cluster application, but flows D and E join and leave intermittently. Each requests N ET.bw reports to inform them of the estimated bandwidth available to a single application flow. In addition, they request F LOW.num reports that tell them how many flows are currently part of the application. Since the application is configured to run at no more than 3 Mbps, each flow sends at the rate R = min(3M bps/F LOW.num, N et.bw). Example 2. Flow A is a control flow. Flows B and C are data flows. All flows request N ET.bw and GP 1.f id(A) which inform them of the value flow A has assigned to general purpose address 1 at the offset equal to its flow id. When running, the application has two states defined by the value flow A has assigned to GP 1.f id(A): NORMAL (GP 1.f id(A) = 0) which indicates normal running mode, and UPDATE (GP 1.f id(A) = 1) which indicates that a large amount of control information is being exchanged to update the state of the application. During NORMAL, A sends at the rate R = (3 ∗ N ET.bw) ∗ .1 while B and C each send at no more than R = (3 ∗ N ET.bw) ∗ .45. During UPDATE, A sends at the rate R = (3 ∗ N ET.bw) ∗ .9 while B and C each send at no more than R = (3 ∗ N ET.bw) ∗ .05. These simple examples help to illustrate some of the advantages of the CP state table mechanism. Distributed local decisions can be made in informed ways that result in the appropriate global behavior using AP state table information piggybacked on packets that are already being sent and received as part of the application. Aggregate measures of application performance that can be effectively gathered only at the AP and not at any one end host are made available to the application. AP performance is not a bottleneck because the amount of work done for each forwarded packet is limited to simple accounting and on-demand state table updates. 4. AGGREGATE CONGESTION CONTROL An important issue for C-to-C applications is that of congestion control. While individual flows within the application may use a variety of transport-level protocols, including those without congestion control, it is essential that aggregate application traffic is congestion responsive [Floyd and Fall 1999]. In this section, we describe CP mechanisms for achieving aggregate congestion control. Our scheme provides the following benefits: —Almost any rate-based, single-flow congestion control algorithm may be applied to make aggregate C-to-C traffic congestion responsive. —C-to-C applications may use multiple flow bandwidth shares and still exhibit correct aggregate congestion responsiveness. —C-to-C applications may implement complex, application-specific adaptation schemes in which the behavior of individual flows is decoupled from the behavior of the congestion responsive aggregate flow. Bandwidth filtered loss detection (BFLD) is presented as a technique for making single-flow loss detection algorithms work when aggregate traffic uses multiple flowshares. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. 114 4.1 · David E. Ott and Ketan Mayer-Patel Measuring Network Conditions As described in Section 3, all packets from all flows in a C-to-C application are used by CP to measure network conditions on the shared data path between APs. Probe information is written into the CP header by the AP as a packet is received from a local endpoint and then forwarded to the remote cluster. Likewise probe information is processed by an AP as a packet is received from the remote cluster and then forwarded to a local endpoint. Since aggregate data flow is bi-directional and many packets are available for piggybacking probe information, APs can exchange probe information on a very fine-grained level. To measure RTT, the APs use a timestamp-based mechanism. An AP inserts a timestamp into the CP header of each packet. The remote AP then echoes that value using the next available CP packet traversing the path in the reverse direction, along with a delay value representing the time between which the timestamp was received and the echo packet became available. When this information is received by the original AP, a RTT sample is constructed as RTT = current time - timestamp echo - echo delay. The RTT sample is used to maintain a smoothed weighted average estimate of RTT and RTT variance. (See Figure 2.) To detect loss, each AP inserts a monotonically increasing sequence number in the CP header. This sequence number bears no relationship to additional sequence numbers appearing in the end-to-end transport-level protocol header nested within. As with all CP probe mechanisms, the underlying transport-level protocol remains unaffected as CP operates in a transparent manner. At the receiving AP, losses are detected by observing gaps in the sequence number space. As with RTT, each loss sample is used to maintain a smoothed average estimate of loss and loss variance. To estimate available bandwidth, we leverage previous work on equation-based congestion control [Floyd et al. 2000; Padhye et al. 1998]. In this approach, an analytical model for TCP behavior is derived that can be used to estimate the appropriate TCP-friendly rate given estimates of various channel properties including, RTT, loss rate, and average packet size. While our recent work has made use of TFRC [Handley et al. 2003], we emphasize that almost any rate-based congestion control algorithm could be applied in order to achieve aggregate congestion responsiveness. This is because CP can provide the basic input parameters required for most such algorithms, for instance current RTT, mean packet size, and individual loss events or loss rates. In CP, both loss rate and estimated available bandwidth are calculated by the receiving AP and reported back to the sending AP using the CP header in each packet. For example, in Figure 1, the AP for Cluster B maintains an estimate for available bandwidth from Cluster A to Cluster B and reports this estimate back to endpoints in Cluster A within the CP header of packets traveling back in the other direction. In the same manner, Cluster A maintains an estimate of available bandwidth from Cluster B to Cluster A. 4.2 Single Flowshares In this section, we describe our implementation of CP in ns2 [Breslau et al. 2000] and discuss simulation results for a mock C-to-C application configured to send at an aggregate rate equivalent to a single flowshare. Our results show that CP ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. · An Open Architecture for Transport-level Protocol Coordination S1 A1 S2 A2 AP S Sn I1 I2 AP A Bottleneck Link Fig. 5. Simulation testbed in ns2. An Parameter Value Packet size 1K ACK size 40 B Bottleneck delay 50 ms Bottleneck bandwidth 15 Mb/s Bottleneck queue length 300 Bottleneck queue type RED Simulation duration 180 sec Fig. 6. 115 Simulation parameter settings. performs well when compared to competing flows of the same protocol type. 4.2.1 CP-TFRC. We refer to our ns2 implementation of the TFRC congestion control algorithm in CP as CP-TFRC. (Full details of the TFRC algorithm can be found in [Handley et al. 2003].) For CP-TFRC, a loss rate is calculated by constructing a loss history and identifying loss events. These events are then converted to a loss event rate. Smoothed RTT, loss event rate, and various other values are then used as inputs into the equation: s q X= q (1) 3bp 2) R 2bp + t (3 )p(1 + 32p RT O 3 8 which calculates a TCP-compatible transmission rate X (bytes/sec) where s is the packet size (bytes), R is the round trip time (sec), p is the loss event rate, tRT O is the TCP retransmission timeout (sec), and b is the number of packets acknowledged by a single TCP acknowledgment. Updates in bandwidth availability are made at a frequency of once every RTT. Bandwidth availability is estimated at the remote AP. The resulting bandwidth availability value is placed in the CP header on the reverse path, and simply forwarded by the local AP to application endpoints. 4.2.2 Configuration. Figure 5 shows our ns2 simulation topology. Sending agents, labeled S1 through Sn , transmit data to APS where it is forwarded through a bottleneck link to remote APA and ACK agents A1 through An . For any given simulation, the bottleneck link between I1 and I2 is shared by CP flows transmitting between clusters and competing (i.e., non-CP) TFRC flows. Figure 6 summarizes topology parameters. Links between ACK agents A1 through An are assigned delay values that vary in order to allow some variation in RTT for different end-to-end flows. Flows in our simulated C-to-C application are configured to take an equal portion of the current bandwidth available to the application. That is, if n C-to-C endpoints share bandwidth flowshare B, then each endpoint sends at a rate of B/n. More complex configurations are possible, and the reader is referred to [Ott and MayerPatel 2002] for further illustrations. 4.2.3 Evaluation. Our goal in this section is to compare aggregate CP-TFRC traffic using a single flowshare with competing TFRC flows sharing the same C-to-C data path. Our concern is not evaluating the properties (e.g., TCP-compatibility) of the TFRC congestion control scheme, but rather examining how closely C-toC aggregate traffic conforms to TFRC bandwidth usage patterns. The question ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. · 116 David E. Ott and Ketan Mayer-Patel 2 2 CP-TFRC (aggregate) TFRC (per flow) 1.8 1.6 Normalized throughput 1.6 Normalized throughput CP-TFRC (aggregate) TFRC (avg per flow) 1.8 1.4 1.2 1 0.8 0.6 1.4 1.2 1 0.8 0.6 0.4 0.4 0.2 0.2 0 0 0 10 20 30 40 50 60 Number of competing TFRC connections 70 Fig. 7. TFRC versus CP-TFRC normalized throughput as the number of competing TFRC flows is varied. 0 10 20 30 40 50 Number of CP connections 60 Fig. 8. TFRC versus CP-TFRC normalized throughput as the number of flows in the C-to-C aggregate is varied. of how well CP-TFRC performs with respect to competing TCP flows is left to Section 5 In Figure 7, a mock C-to-C application consisting of 24 flows competes with a varying number of TFRC flows sharing the same cluster-to-cluster data path. Throughput values have been normalized so that a value of 1.0 represents a fair throughput level for a single flow. The performance of TFRC flows is presented in two ways. First, normalized bandwidth of a single run is presented as a series of points representing the normalized bandwidth received by a each competing flow. These points illustrate the range in values realized within a trial. Second, a line connects points representing the average (mean) bandwidth received by competing TFRC flows across 20 different trials of the same configuration. The CP-TFRC line connects points representing the aggregate bandwidth received by 24 CP flows averaged over 20 trials. For each each trial, this aggregate flow competes as only a single flowshare within the simulation. We see from this plot that as the number of competing TFRC flows increases, C-to-C flows receive only slightly less than their fair share. Figure 8 shows per-flow normalized throughput when the number of competing TFRC flows is held constant at 24, and the number of CP flows is increased, but still sharing a single flowshare. Again aggregate CP traffic received very close to its fair share of available bandwidth, with normalized values greater than 0.8 throughout. 4.3 70 Multiple Flowshares In this section, we consider the problem of supporting multiple flowshares. While numerous approaches for applying aggregate congestion control using single flowshares have been suggested as reviewed in Section 2, we are unaware of any approach that considers the multiple flowshare problem. The reason for this is that single-flow congestion control algorithms break when a sender fails to limit their sending rate to the rate calculated by the algorithm. Here we use simulation to show how this is the case for CP-TFRC. After discussing the problem in some detail, we present a new technique, bandwidth filtered loss detection (BFLD) and ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. An Open Architecture for Transport-level Protocol Coordination · 117 2.5 TFRC (avg) CP-TFRC Normalized throughput 2 1.5 1 0.5 0 0 Fig. 9. 5 10 15 Number of flow shares 20 25 Throughput for multiple flowshares (naive approach). demonstrate its effectiveness in enabling multiple flowshares. 4.3.1 Naive Approach. Our goal in this section is to allow C-to-C applications to send the equivalent of m flowshares in aggregate traffic, where m is equal to the number of flows in the application. As mentioned in Section 1, we believe that limiting a C-to-C application to a single flowshare may unfairly limit bandwidth for an application that would otherwise employ multiple independent flows. A naive approach for realizing multiple flowshares is simply to have each C-toC application endpoint multiply the estimated bandwidth availability value B by a factor m. Thus, each endpoint behaves as if the bandwidth available to the application as a whole is mB. One could justify such an approach by arguing that probe information exchanges between APs maintain a closed feedback loop. That is, an increase in aggregate sending rate beyond appropriate levels will result in increases in network delay and loss. In turn, this will cause calculated values of B to decrease, thus responding to current network conditions. Ideally B would settle on some new value which, when multiplied by m, results in the appropriate congestioncontrolled level that would have otherwise been achieved by m independent flows. Figure 9 shows that this is not the case. For each simulation, the number of CP-TFRC and competing TFRC flows is held constant at 24. The number of flowshares used by CP-TFRC traffic is then increased from k = 1 to m using the naive approach. The factor k is given by the x-axis. The normalized fair share ratio (with 1.0 representing perfect fairness) is given by the y-axis. In Figure 9, increases in the number of flowshares cause the average bandwidth received by a competing TFRC flow to drop unacceptably low. By k = 16, TFRC flows receive virtually no bandwidth, and beyond k = 16, growing loss rates eventually trigger the onset of congestion collapse. Additional simulation work with RAP [Rejaie et al. 1999] (not presented in this paper) likewise shows unacceptable results, although with a somewhat different pattern of behavior suggesting that different congestion control schemes result in different types of failure. 4.3.2 The Packet Loss Problem. In the case of CP-TFRC, recall that RTT and loss event rate are the primary inputs to equation 1. We note that increasing the C-to-C aggregate sending rate should have no marked effect on RTT measurements since APs simply use any available CP packets for the purpose of probe information exchanges. In fact, increasing the number of available packets should make RTT ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. 118 · David E. Ott and Ketan Mayer-Patel measurements even more accurate since more packets are available for probing. On the other hand, we note that a large increase in C-to-C aggregate traffic has a drastic effect on loss event rate calculations in CP-TFRC. TFRC marks the beginning of a loss event when a packet loss Pi is detected. The loss event ends when, after a period of one RTT, another packet loss Pj is detected. An inter-loss event interval I is calculated as the difference in sequence numbers between the two lost packets (I = j−i) and, to simplify somewhat, a rate R is calculated by taking the inverse of this value (R = 1/I). Here we note that the effect of drastically increasing the number of packets in the aggregate traffic flow is to increase the inter-loss event interval I; while the likelihood of encountering a packet drop soon after the RTT damping period has expired increases, the number of packet arrivals during the damping period also increases. The result is a larger interval, or a smaller loss event rate, and hence an inflated available bandwidth estimation. This situation is depicted in Figure 10. In a sense, the algorithm suffers from the problem of inappropriate feedback. For CP-TFRC, too many packets received in the damping period used to calculate a loss event rate artificially inflates the inter-loss event interval. The algorithm has been tuned for the appropriate amount of feedback which would be generated by a packet source that is conformant to a single flowshare only. 4.3.3 BFLD. Our solution to the problem of loss detection in a multiple flowshare context is called bandwidth filtered loss detection (BFLD). BFLD works by sub-sampling the space of CP packets in the network, effectively reducing the amount of loss feedback to an appropriate level. Essentially, the congestion control algorithm is driven by a “virtual” packet stream which is stochastically sampled from the actual aggregate packet stream. BFLD makes use of two different bandwidth calculations. First is the available bandwidth, or Bavail , which is calculated by the congestion control algorithm employed at the AP. This represents the congestion responsive sending rate for a single flowshare. Second is the arrival bandwidth, or Barriv . The value Barriv is an estimate of the bandwidth currently being generated by the C-to-C application. From these values, a sampling fraction F is calculated as F = Bavail /Barriv . If Bavail > Barriv , then F is set to 1.0. Conceptually, this value represents the fraction of arriving packets and detected losses to sample in order to create the virtual packet stream that will drive the congestion control algorithm. We refer to Loss Event Interval= 8−2 = 6 Single Flowshare 1 2 3 4 5 6 7 8 9 Loss Event = 1/6 Rate RTT Loss Event Interval= 15−3 = 12 Multiple Flowshare 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1617 18 19 20 21 22 Loss Event = 1/12 Rate RTT Fig. 10. Loss event rate calculation for TFRC. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. An Open Architecture for Transport-level Protocol Coordination · 119 Stochastically chosen to generate virtual packet events. Multiple Flowshare 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1617 18 19 20 21 22 Loss Event Interval= 10−3 = 7 Virtual Flowshare 1 2 3 4 5 6 7 8 9 10 11 12 Loss Event = 1/7 Rate RTT Fig. 11. Virtual packet event stream construction by BFLD. this virtual packet stream as the filtered packet event stream. To determine whether a packet arrival or loss should be included in the filtered packet event stream, a simple stochastic technique is used. Whenever a packet event occurs (i.e., a packet arrives or a packet loss is detected), a random number r is generated in the interval 0 ≤ r ≤ 1.0. If r is in the interval 0 ≤ r ≤ F then an event is generated for the virtual packet event stream, otherwise no virtual packet event is generated. Packets chosen by this filtering mechanism are given a virtual packet sequence number that will be used by the congestion control algorithm for loss detection, computing loss rates, updating loss histories, etc. Figure 11 illustrates the effect of this process. In this figure, we see that a subset of the multiple flowshare packet event stream is stochastically chosen to generate a virtual packet event stream. In this stream, we see virtual sequence numbers assigned to these packet events. As a result, the TFRC calculation for the loss event interval decreases from 12 to 7 remedying the problem illustrated in Figure 10. An interesting feature of this technique is that it can be applied regardless of the number of flowshares used by the C-to-C application. This is because the factor F adjusts with the amount of bandwidth used. 4.3.4 Evaluation. Figure 12 shows the results of applying BFLD to the simulations of Figure 9 in Section 4.3.1. As before, the number of CP-TFRC flows and competing TFRC flows are both held constant at 24, while the number of flow2 TFRC CP-TFRC 1.8 Normalized throughput 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 Fig. 12. 5 10 15 Number of flow shares 20 25 Throughput for multiple flowshares using BFLD. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. 120 · David E. Ott and Ketan Mayer-Patel shares taken by CP-TFRC traffic as an aggregate is increased from k = 1 to m. The results show a dramatic improvement. Normalized throughput for CP-TFRC flowshares is consistently close to .9 while throughput levels achieved by competing TFRC flows are consistently close to 1.0. 5. CASE STUDY: COORDINATED MULTI-STREAMING IN 3DTI The 3D Tele-immersion (3DTI) system, jointly developed by the University of North Carolina at Chapel Hill and the University of Pennsylvania, is an ideal environment for exploring CP capabilities. The application is comprised of two multihost environments, a scene acquisition subsystem and a reconstruction/rendering subsystem, that must exchange data in complex ways over a common Internet path. Data transport, as it turned out, proved to be a difficult challenge to the original 3DTI design team, and our subsequent collaboration does much to showcase how CP can help. In this section, we explore ways in which CP was employed within this context and the resulting improvements in application performance. 5.1 Architecture The scene acquisition subsystem in 3DTI (see Figure 14) is charged with capturing video frames simultaneously on multiple cameras and streaming them to the 3D reconstruction engine at a remote location. The problem of synchronized frame capture is solved using a single triggering mechanism across all cameras. Triggering can be handled periodically or in a synchronous blocking manner in which subsequent frames are triggered only when current frames have been consumed. The triggering mechanism itself can be hardware-based (a shared 1394 Firewire bus) or network-based using message passing. 3DTI uses synchronous blocking and message passing to trigger simultaneous frame capture across all hosts. A master-slave configuration is used in which each camera is attached to a separate Linux host (i.e., slave) that waits for a triggering message to be broadcast by a trigger host (i.e., master). Once a message has been received, a frame is captured and written to the socket layer which handles reliable streaming to an endpoint on the remote reconstruction subsystem. As soon as the write call returns (i.e., the frame can be accommodated in the socket-layer send Fig. 13. 3D Tele-immersion. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. Internet .. . .. . Trigger Server Media Capture · 121 Reconstruction Hosts Capture Hosts An Open Architecture for Transport-level Protocol Coordination Reconstruction/Rendering Fig. 14. 3D Tele-immersion architecture. buffer), a message is sent to the trigger host notifying it that the capture host is ready to capture again. When a message has been received for all hosts, the trigger host broadcasts a new trigger message and the process repeats. The reconstruction/rendering subsystem in 3DTI represents essentially a cluster of data consumers. Using parallel processing, video frames taken from the same instant in time are compared with one another using complex pixel correspondence algorithms. The results, along with camera calibration information, are used to reconstruct depth information on a per pixel basis which is then assembled into view-independent depth streams. Information on user head position and orientation (obtained through head tracking) are then used to render these depth streams in real time as a view-dependent scene in 3D using a stereoscopic display. 5.2 The Multi-streaming Problem Our concern in this section is with in the problem of coordinated multi-streaming between scene acquisition and 3D reconstruction components of the 3DTI architecture. Specifically, we are interested in providing reliable transport of frame ensembles (a set of n video frames captured from n cameras at the same instant in time) such that aggregate streaming is —Responsive to network congestion. —Highly synchronous with respect to frame arrivals. Congestion responsiveness is important not only to prevent unfairness to competing flows and the possibility of congestion collapse[Floyd and Fall 1999], but to minimize unnecessary loss and retransmissions. 3D reconstruction places a high demand on data integrity to be effective, and hence it is a basic requirement in 3DTI that data transport be reliable. Frame synchrony is the notion that frames within the same ensemble are received by the reconstruction subsystem at the same time. A low degree of frame synchrony will result in stalling as the 3D reconstruction pipeline waits for remaining pixel data to arrive, a highly undesirable effect for 3DTI as a real-time, interactive application. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. 122 · David E. Ott and Ketan Mayer-Patel Some key issues in multi-streaming video frames in this context include —Send buffer size, —Choice of transport protocol, —Aggregate responsiveness to network congestion, —Bandwidth utilization, and —Synchronization across flows. In the original 3DTI design, TCP was chosen to be the transport-level protocol for each video stream. TCP, while not typically known as a streaming protocol, was an attractive choice to the 3DTI developers for several reasons. First, it provided in-order, reliable data delivery semantics which, as mentioned in Section 1, is an important requirement in this problem domain. Second, it is congestion responsive. Use of TCP for multi-streaming in 3DTI insures that C-to-C traffic as an aggregate is congestion responsive by virtue of the fact that individual flows are congestion responsive. The original developers had hoped that by using relatively large capacity networks (i.e., Abilene), performance would not be an issue. The resulting application performance, however, was poor, but not necessarily because of bandwidth constraints. Instead, the uncoordinated operation of multiple TCP flows between the acquisition and reconstruction clusters resulted in large end-to-end latencies and asynchronous delivery of frames by different flows. By adding CP mechanisms to the architecture and developing a CP-based, reliable transport-level protocol, we demonstrate how a small bit of coordination between peer flows of a C-to-C application can go a long way toward achieving applicationwide networking goals. 5.3 Multi-streaming with TCP The major disadvantage of TCP in the multi-streaming problem context is that individual flows operate independently of peer flows within the same application. Each TCP stream independently detects congestion and responds to loss events using its well-know algorithms for increasing and decreasing congestion window size. While the result is a congestion responsive aggregate, differences in congestion detection can easily result in a high degree of asynchrony as some flows detect multiple congestion events and respond accordingly, while other flows encounter fewer or no congestion events and maintain a congestion window that is, on average during the streaming interval, larger. The result, for equal size frames across all capture endpoints, is that some flows may end up streaming frames belonging to the same ensemble more quickly at the expense of peer flows that gave up bandwidth in the process. The problem is heightened when video frames are of unequal size. This might be the case when individual capture hosts apply compression as part of the capture process. A flow with more data to send might, in some cases, encounter more congestion events and, as a result, back off more than a flow with less data to send. The result is a high probability of stalling as some flows finish streaming their frame and wait for the remaining flows to complete before the next frame trigger can proceed. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. An Open Architecture for Transport-level Protocol Coordination · 123 The problem of stalling can be mitigated, of course, by increasing socket-level send buffering, but at the expense of increasing end-to-end delay which is highly undesirable since 3DTI is an interactive, real-time application. What is needed, we argue, is an appropriate amount of buffering: large enough to maintain a full data pipeline at all times, but small enough to minimize unnecessary end-to-end delay. Maintaining this balance requires information about conditions on the C-to-C data path, however, something that TCP cannot provide. 5.4 Multi-streaming with CP-RUDP To address these problems, we deployed CP-enabled software routers in front of each cluster to act as APs. Then we developed a new UDP-based protocol called CPRUDP and deployed it on each endpoint host in the application. CP-RUDP is an application-level transport protocol for experimenting with send rate modifications using CP information in the context of multi-stream coordination. Essentially, it provides the same in-order, reliable delivery semantics as TCP, but with the twist that reliability has been completely decoupled from congestion control. This is because the CP layer beneath can now provide the congestion control information needed for adjusting send rate in appropriate ways. In addition, CP-RUDP is a rate-based protocol, while TCP is a window-based protocol. In the context of 3DTI, our work focused on two areas: —Better bandwidth distribution for increased frame arrival synchrony. —Adjusting sender-side buffering to maximize utilization but minimize delay. To accomplish the first goal, we rely on an important property of the CP state table: consistency of information across endpoints. Because the APs are now in charge of measuring network characteristics of the C-to-C data path for the application as a whole, individual flows can employ that information and make rate adjustments confident that peer flows of the same application are getting the same information and are also appropriately responding. In particular, endpoints see the same bandwidth availability estimates, round trip time measurements, loss rate statistics, and other network-based statistics. With this in mind, we found the most effective coordination algorithm for 3DTI’s multi-streaming problem to be the application of two relatively simple strategies. —Each endpoint in the application sends at exactly the rate given by the N et.bw report value. The value, as described in [Ott et al. 2004], incorporates loss and round trip time measurement information on the cluster-to-cluster data path and uses a TCP modeling equation to calculate the instantaneous congestionresponsive sending rate for a single flow [Floyd et al. 2000; Handley et al. 2003]. —Each endpoint uses an adaptive send buffer scheme in which buffer size, B, is continually updated using the expression B = 1.5 ∗ (N et.bw ∗ N et.rtt). In other words, the send buffer size is set to constantly be 1.5 times the bandwidth delay product. By using the bandwidth delay product, we insure that good network utilization is effectively maintained at all times. The 1.5 multiplicative factor is simply a heuristic that insures some additional buffer space for retransmission data and data that is waiting to be acknowledged. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. 124 · David E. Ott and Ketan Mayer-Patel Fig. 15. Experimental network setup. Experimental results demonstrating the effectiveness of this scheme are presented in the following section. 5.5 Experimental Results In this section, we present experimental results demonstrating the effectiveness of flow coordination to the problem of multi-streaming in 3DTI. Included is a description of our experimental setup and performance metrics. Our goal is to compare multi-streaming performance between TCP, a reliable, congestion responsive but uncoordinated transport protocol, and CP-RUDP, an equivalently reliable, congestion responsive transport protocol but with the added feature that it supports flow coordination. Our results show a dramatic improvement in synchronization while maintaining a bandwidth utilization that does not exceed that of TCP. They also underscore the tremendous benefit of information consistency across flows as provided by the CP architecture. 5.5.1 Experimental Setup. Our experimental network setup is shown in Figure 15. CP hosts and their local AP on each side of the network represent two clusters that are part of the same C-to-C application and exchange data with one another. Each endpoint sends and receives data on a 100 Mb/s link to its local AP, a FreeBSD router that has been CP-enabled as described above. Aggreate C-to-C traffic leaves the AP on a 1 Gb/s uplink. At the center of our testbed are two routers connected using two 100 Mb/s Fast Ethernet links. This creates a bottleneck link, and by configuring traffic from opposite directions to use separate links, emulates the full-duplex behavior seen on wide-area network links. In order to calibrate the fairness of application flows to TCP flows sharing the same bottleneck link, we use two sets of hosts (labeled “TCP hosts” in Figure 15) and the well-known utility iperf [Iperf ]. Iperf flows are long-lived TCP flows that compete with application flows on the same bottleneck throughout our experiment. The normalized flowshare metric described in Section 5.5.2 then provides a way of quantifying the results. Also sharing the bottleneck link for many experiments are background flows between traffic hosts on each end of the network. These hosts are used to generate Web traffic at various load levels and their associated patterns of bursty packet loss. More is said about these flows in Section 5.5.4. Finally, network monitoring during experiments is done in two ways. First, tcpdump is used to capture TCP/IP headers from packets traversing the bottleneck, and then later filtered and processed for detailed performance data. Second, a softACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. An Open Architecture for Transport-level Protocol Coordination · 125 Frame Ensemble i Flow 1 Flow 2 Flow 3 Flow 4 Flow 5 Flow 6 Time Trigger Completion asynchrony First flow completes Fig. 16. Last flow completes Completion asynchrony. ware tool is used in conjunction with ALTQ [Kenjiro 1998] extensions to FreeBSD to monitor queue size, packet forwarding events, and packet drop events on the outbound interface of the bottleneck routers. The resulting log information provides packet loss rates with great accuracy. 5.5.2 Performance Metrics. In this section, we define several metrics for measuring multi-streaming performance in 3DTI. These include completion asynchrony, frame ensemble transfer rate, frame ensemble arrival jitter, normalized throughput, end-to-end delay, and stall time. First, define frame ensemble to be a set of n frames captured by n different frame acquisition hosts at the same instant in time. A frame ensemble is generated after each triggering event as described in Section 5.1. To compare the level of synchrony in frame arrivals within the same frame ensemble, we define the metric completion asynchrony for frame ensemble i as follows. Within any given frame ensemble i, there is some receiving host that receives frame i in its entirety first. Let’s call this host Hf,i and the time of completion cf,i . There’s another host that receives frame i in its entirety last (i.e., after all other hosts have already received frame i). Call this host Hl,i and the time of completion cl,i . Completion asynchrony Ci is defined as the time interval between frame completion events cl,i and cf,i . Intuitively, it reflects how staggered frame transfers are across all application flows in receiver-based terms. (See Figure 16.) Ci = cl,i − cf,i (2) An important metric to the application as a whole is the frame ensemble transfer rate which we define as the number of complete frame ensemble arrivals f over time interval p. In general, higher frame ensemble transfer rate numbers indicate better network utilization and the absense of stalling due to frame transfer asynchrony. A similarly important metric is that of frame ensemble arrival jitter, defined as the standard deviation of frame ensemble interarrival intervals ti over a larger run interval p. Small jitter values are important to prevent the reconstruction/rendering pipeline from backing up or starving as the application runs in real-time. To compare the bandwidth taken by flows in the application to that of TCP flows competing over the same bottleneck link, we define average flowshare (F ) to be the mean aggregate throughput divided by the number of flows. The normalized flowshare is then the average flowshare among a subset of flows, for example CPACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. 126 · David E. Ott and Ketan Mayer-Patel RUDP flows (FCP −RU DP ), divided by the average flowshare for all flows (Fall ). (All flows here refers to CP-RUDP flows and competing TCP iperf flows, but not background traffic flows.) FCP −RU DP FCP −RU DP = (3) Fall 1.0 represents an ideal fair share. A value greater than 1.0 indicates that CP-RUDP flows on an average have received more than their fair share, while for less than 1.0 the reverse is true. The transmission time for each frame, including send and receive buffering, is averaged for each frame into a per-ensemble mean delay value. End-to-end delay, then, is defined as the mean delay value dmean,i across all frame ensembles of the run interval p. Delay values reflect a variety of factors including frame size, buffering at the sender, network queueing delay, and the number of retransmissions required to reliably transmit frame data in its entirety. Finally, the time interval between frame send events by each flow is typically small unless stalling occurs. A flow is said to stall when it completes its transmission of the current frame and must wait for the next trigger event to begin sending a new frame. Each frame ensemble has a mean stall interval smean,i measured simply as the average time between subsequent frame transfers for each flow. Stall time is defined as the mean stall interval smean,i across all frame ensembles of the run interval p. 5.5.3 Frame Size Experiments. In this section we compare the performance of TCP multi-streaming with that of CP-RUDP under conditions of varying frame size. That is, frames within each ensemble have the same fixed size which furthermore remains fixed throughout the run. This size, however, is varied from run to run to determine its overall effect upon multi-streaming performance. TCP runs are divided into two cases: large send buffer size (1 MB) and small send buffer size (64 KB). For convenience, we will refer to these configurations as TCP-large and TCP-small respectively. Results for send buffer configurations between these two extremes for any given metric simply show values in between the results that will be presented. To generate controlled loss, we used the dummynet [Rizzo ] traffic shaping utility found in FreeBSD 4.5. Dummynet provides support for classifying packets and dividing them into flows. A pipe abstraction is then applied that emulates link characteristics including bandwidth, propagation delay, queue size, and packet loss rate. We enabled dummynet on bottleneck routers and configured it to produce packet loss at the rate of one percent. Runs lasted for 15 minutes during which the initial 5 minutes were spent on ramp-up and stabilization, and the subsequent 10 minutes were used to collect performance data. Trials with longer stabilization and run intervals did not show significantly different results. Completion asynchrony results in Figure 17 (a) show a dramatic difference between TCP-large and CP-RUDP. This gap is drastically reduced by decreasing send buffer size as seen in the much improved performance of TCP-small. But frame ensemble rates given in Figure 17 (b) show that the tradeoff is in network utilization. While TCP-small showed lower completion asynchrony values compared to TCPACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. An Open Architecture for Transport-level Protocol Coordination · 127 Fig. 17. Frame size results. (a) Completion asynchrony, (b) frame ensemble transfer rate, and (c) frame ensemble arrival jitter versus frame size. Fig. 18. Frame size results (cont’d). (a) Normalized flowshare, (b) end-to-end delay, and (c) stall time versus frame size. large, overall frame ensemble rates are significantly lower than that of TCP-large. This tradeoff is underscored in other results as well. In Figure 18 (a), we see that the larger send buffer size of TCP-large improves network utilization and fairness between TCP-large flows and competing TCP iperf flows. This is underscrored by the stall time results in Figure 18 (c); TCP-small shows larger stall time values than TCP-large flows which buffer considerably more video data and are thus better at keeping the transmission pipe full at all times. End-to-end delay results in Figure 18 (b), however, show that this is achieved to the expense of end-to-end delay. That is, TCP-large results in significantly higher end-to-end delay compared to TCP-small. Furthermore, TCP-small reduces frame ensemble arrival jitter, important for maintaining a full reconstruction pipeline with minimal backlog. By comparison, CP-RUDP offers the best of both worlds. On the one hand, it shows low completion asynchrony, low end-to-end delay, and low frame ensemble arrival jitter. On the other hand, it shows high network utilization, high frame ensemble rates, and very low stall time rates. While TCP send buffer size can be tuned to achieve an average performance that improves on each extreme, CP-RUDP equals or beats the best that TCP can do on all fronts. 5.5.4 Load Experiments. While testing CP performance using dummynet is instructive, a random loss model is wholly unrealistic. In reality, losses induced by drop tail queues in Internet routers are bursty and correlated. To better capture this dynamic, we tested TCP and CP-RUDP performance against various background traffic workloads using a Web traffic generator known as thttp. T http uses empirical distributions from [Smith et al. 2001] to emulate the beACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. 128 · David E. Ott and Ketan Mayer-Patel Fig. 19. Load results. (a) Completion asynchrony, (b) frame ensemble transfer rate, and (c) frame ensemble arrival jitter versus thttp background traffic load. Fig. 20. Load results (cont’d). (a) Normalized flowshare, (b) end-to-end delay, and (c) stall time versus thttp background traffic load. havior of Web browsers and the traffic that browsers and servers generate on the Internet. Distributions are sampled to determine the number and size of HTTP requests for a given page, the size of a response, the amount of “think time” before a new page is requested, etc. A single instance of thttp may be configured to emulate the behavior of hundreds of Web browsers and significant levels of TCP traffic with real-world characteristics. Among these characteristics are heavy-tailed distributions in flow ON and OFF times, and significant long range dependence in packet arrival processes at network routers. We ran four thttp servers and four clients on each set of traffic hosts seen in Figure 15. Emulated Web traffic was given a 20 minute ramp-up interval and competed with TCP and CP-RUDP flows on the bottleneck link in both directions. We varied the number of browsers emulated from 1000 to 6000. Resulting loss rates are between .005 and .05 as measured at bottleneck router queues. Figure 19 (a) shows that as background TCP traffic increases, completion asynchrony remains consistently low for CP-RUDP. Furthermore, end-to-end delay values (Figure 20 (b)) and stall time values (Figure 20 (c)) are insensitive to traffic increases and remain low throughout. Frame ensemble arrival jitter (Figure 19 (c)) likewise is consistently lower than TCP-large and TCP-small, and remains insensitive to increases in TCP background traffic. In constrast, completion asynchrony (Figure 19 (a)) increases for both TCP configurations, with TCP-large showing a stark increase and TCP-small only a slight increase. Similarly, end-to-end delay values (Figure 20 (b)) show TCP-large as increasing markedly as background traffic load increases, while TCP-small shows only a small increase. Both show substantial increases in both stall time values ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. An Open Architecture for Transport-level Protocol Coordination · 129 Fig. 21. Variable frame size results. (a) Completion asynchrony, (b) frame ensemble transfer rate, and (c) frame ensemble arrival jitter versus frame size. Fig. 22. Variable frame size results (cont’d). (a) Normalized flowshare, (b) end-to-end delay, and (c) stall time versus frame size. (Figure 20 (c)) and frame ensemble arrival jitter (Figure 19 (c)). Once again, TCP-small is able to minimize completion asynchrony and endto-end delay over TCP-large only at the expense of lower frame ensemble rates (Figure 19 (b)) and poor network utilization (Figure 20 (a)). In contrast, CPRUDP achieves the best of both worlds showing the best frame ensemble rate and normalized flowshare values of any configuration. 5.5.5 Variable Frame Size Experiments. Finally, we look at the effect of variable frame size on transfer asynchrony. Here, variable frame size refers to differences in frame size within the same frame ensemble. This situation might occur, for example, when frames from the user’s region of interest are captured in higher resolution than those outside this region. To generate variable sized frames, we divide flows in half, designating one half to be flows that will stream larger frames and the other half to be flows that will stream smaller frames. Frame size is determined by using a constant mean value µ (25 KB), and then varying a frame size dispersion factor f which is applied as follows: F ramesize = µ ± (f ∗ µ) (4) CP-RUDP, which handles streaming for each flow, has been modified in this context to take bandwidth proportional to its frame size. It was mentioned in Section 3 and Section 4 that an important feature of CP is that it supports the decoupling of individual flow behavior from aggregate congestion response behavior. In this scenario, each flow is configured to send at N ET.bw ∗ (Si /Sall ) where Si is the frame size for flow i and Sall is the aggregate amount of data to send for the entire frame ensemble. Note that frame sizes may change dynamically across flows using ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. 130 · David E. Ott and Ketan Mayer-Patel this scheme, and that the sum bandwidth used remains n ∗ N ET.bw throughout (where n is the number of flows). Looking at the results in Figure 19 and Figure 20, we note (1) that CP-RUDP values remain generally insensitive to increases in the frame size dispersion factor, and (2) CP-RUDP significantly outperforms TCP (with a send buffer of 400 KB) in virtually every metric. In support of the latter point, we note the significantly lower completion asynchrony values (Figure 19 (a)), lower end-to-end delay (Figure 20 (b)), lower stall time (Figure 20 (c)), higher normalized flowshare values (Figure 20 (a)), higher frame ensemble rate (Figure 19 (b)), and lower frame ensemble arrival jitter (Figure 19 (c)). 6. CONCLUSIONS In this paper, we motivate the need for coordination among peer flows in a broad class of futuristic multimedia applications that we call cluster-to-cluster (C-to-C) applications. These applications involve multi-stream communication between clusters of computing resources. One such application at the focus of our attention is the 3D Tele-immersion (3DTI) system developed jointly by UNC and Penn. To address the transport-level coordination issues of C-to-C applications, we have developed a protocol architecture called the Coordination Protocol (CP). CP provides support for sharing information among flows, including network information, flow information, and application-defined state information. The result is an open architecture useful in addressing a wide variety of flow coordination problems, including coordinated bandwidth distribution. CP exploits various features of the C-to-C problem architecture, using cluster aggregation points as a natural mechanism for information exchange among flows and application packets as carriers of network probe and state information along the cluster-to-cluster data path. We have shown how network probe information can be used to drive a bandwidth estimation algorithm which can then be scaled for multiple flowshares to provide a flexible scheme for achieving aggregate congestion control. We have used CP to develop a new reliable streaming protocol called CP-RUDP that addresses the synchronization requirements found in 3DTI. We present results showing how CP-RUDP is able to dramatically improve multistream synchronization within the context of this application, while at the same time minimizing end-to-end delay and maintaining high frame ensemble transfer rates. 6.1 Future Directions CP infrastructure continues to evolve, and finding the right set of network and flow information to support flow coordination is no easy task. Work in the future will continue to look at state table content, including operations for aggregating application-defined state information in useful ways. CP-enabled transport-level protocols that are peer-aware and behave in coordinated ways is another rich area of work. Likewise, new applications with novel coordination requirements will naturally generate future work on coordination schemes that rely on CP support. Finally, wireless cluster-to-cluster applications represent an interesting challenge to our protocol architecture. In this case, the assumption that endpoint-to-AP communication takes place with little loss or delay (due to provisioning) is no longer ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. An Open Architecture for Transport-level Protocol Coordination · 131 true. One idea is to design application endpoints and/or transport-level protocols that can use the CP framework to discriminate between local (i.e., wireless) and AP-to-AP sources of delay and loss. This can be done by comparing end-to-end network measurements with reported CP measurements and using discrepancies as an indication of conditons on the wireless portion of the path. REFERENCES Alexander, D. et al. 1997. Active bridging. Proceedings of SIGCOMM’97 , 101–111. Balakrishnan, H., Rahul, H. S., and Seshan, S. 1999. An integrated congestion management architecture for internet hosts. Proceedings of ACM SIGCOMM . Black, D., Carlson, M., Davies, E., Wang, Z., and Weiss, W. 1998. RFC 2475: An Architecture for Differentiated Services. Internet Engineering Task Force. Breslau, L., Estrin, D., Fall, K., Floyd, S., Heidemann, J., Helmy, A., Huang, P., McCanne, S., Varadhan, K., Xu, Y., and Yu, H. 2000. Advances in network simulation. IEEE Computer 33, 5 (May), 59–67. Calvert, K. L., Griffioen, J., and Wen, S. 2002. Lightweight network support for scalable end-to-end services. In Proceedings of ACM SIGCOMM. Decasper, D. et al. 1998. Router plugins: A software architecture for next generation routers. Proceedings of SIGCOMM’98 , 229–240. Floyd, S. and Fall, K. R. 1999. Promoting the use of end-to-end congestion control in the internet. IEEE/ACM Transactions on Networking 7, 4, 458–472. Floyd, S., Handley, M., Padhye, J., and Widmer, J. 2000. Equation-based congestion control for unicast applications. Proceedings of ACM SIGCOMM , 43–56. Floyd, S. and Jacobson, V. 1995. Link-sharing and resource management models for packet networks. IEEE/ACM Transactions on Networking 1, 4, 365–386. Georgiadis, L., Guérin, R., Peris, V., and Sivarajan, K. 1996. Efficient network QoS provisioning based on per node traffic shaping. IEEE/ACM Transactions on Networking 4, 4, 482–501. Handley, M., Floyd, S., Padhye, J., and Widmer, J. 2003. RFC 3448: TCP Friendly Rate Control (TFRC): Protocol Specification. Internet Engineering Task Force. Iperf. http://dast.nlanr.net/Projects/Iperf. Kenjiro, C. 1998. A framework for alternate queueing: Towards traffic management by PC-UNIX based routers. In USENIX 1998. 247–258. Kum, S.-U., Mayer-Patel, K., and Fuchs, H. 2003. Real-time compression for dynamic 3D environments. ACM Multimedia 2003 . Kung, H. and Wang, S. 1999. TCP trunking: Design, implementation and performance. Proc. of ICNP ’99 . Ott, D. and Mayer-Patel, K. 2002. A mechanism for TCP-friendly transport-level protocol coordination. USENIX 2002 . Ott, D., Sparks, T., and Mayer-Patel, K. 2004. Aggregate congestion control for distributed multimedia applications. Proceedings of IEEE INFOCOM ’04 . Padhye, J., Firoiu, V., Towsley, D., and Kurose, J. 1998. Modeling TCP throughput: A simple model and its empirical validation. Proceedings of ACM SIGCOMM . Parris, M., Jeffay, K., and Smith, F. 1999. Lightweight active router-queue management for multimedia networking. Rejaie, R., Handley, M., and Estrin, D. 1999. RAP: An end-to-end rate-based congestion control mechanism for realtime streams in the internet. Proc. of IEEE INFOCOM . Rizzo, L. http://info.iet.unipi.it/ luigi/ip dummynet/. Smith, F., Campos, F. H., Jeffay, K., and Ott, D. 2001. What TCP/IP protocol headers can tell us about the web. In ACM SIGMETRICS. 245–256. Tennenhouse, D. L. and Wetherall, D. 1996. Towards an active network architecture. Multimedia Computing and Networking. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005. 132 · David E. Ott and Ketan Mayer-Patel Wetherall, D. 1999. Active network vision and reality: lessons from a capsule-based system. Operating Systems Review 34, 5 (December), 64–79. Zhang, H. 1995. Service disciplines for guaranteed performance service in packet-switching networks. Proceedings of IEEE 83, 10 (October), 1374–96. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, 01 2005.