Performance Evaluation and Comparison of Middle-Ware based QoS support for RealTime/Multimedia applications versus using Modified OS Kernels By Saif Raza Hasan University of Massachusetts, Amherst, MA, USA 2001 1 Table of Contents 1. Introduction ..................................................................................................................... 3 2. System Description ......................................................................................................... 3 2.1 QLinux Features ........................................................................................................ 3 2.2 TAO / Corba Features................................................................................................ 4 3. Experimental Setup ......................................................................................................... 6 3.1 Echo / Null RPC ........................................................................................................ 6 3.2 Simple HTTP Client-Server ...................................................................................... 9 3.3 Concurrent HTTP Client-Server .............................................................................. 10 3.4 Compute Intensive HTTP Client-Server ................................................................. 10 3.5 Long File/ MPEG Download ................................................................................... 11 3.6 Video Streaming Without QoS Specification .......................................................... 13 3.7 Video Streaming With QoS Specification ............................................................... 16 3.8 Video Streaming performance in presence of another application.......................... 19 3.9 Industrial Control System Simulation ..................................................................... 20 4. Summary & Discussion of Qualitative Comparisons ................................................... 21 4.1 QLinux ..................................................................................................................... 22 4.2 TAO / CORBA ........................................................................................................ 22 References ......................................................................................................................... 24 2 1. Introduction Most common off-the-shelf operating systems (COTS) such as Windows-NT, Linux and most Unises are not designed to provide applications with QoS guarantees for access to system resources such as network bandwidth or CPU time. This leads to poor performance of Real-Time / Multimedia applications which have deadlines that must be met in a guaranteed fashion on these systems. In trying to address this problem, two approaches are common. One is to modify the OS itself and incorporate mechanisms for guaranteeing QoS and modifying applications to use these mechanisms. The other alternative is to use a middle-ware which runs on top of the unmodified operating and provides an API for application programmers that allows them to specify QoS requirements, and the middleware attempts to meet those requirements. In this project we have compare the performance and evaluated some of the trade-offs involved in these two approaches. We have used the QLinux kernel which supports HSFQ for process and packet scheduling as an example of a modified COTS system and the TAO ORB as the example for a Real Time Middleware. TAO is an open-source implementation of the Corba specification and contains optimizations and built in support for providing QoS to it’s applications. We’ve developed various applications and performed experimental evaluation of their performance to obtain a quantitative and qualitative comparison of the two systems. 2. System Description The systems that we are evaluating and comparing in this project are QLinux and TAO (a Real Time CORBA implementation). The two systems represent two alternative approaches to providing QoS guarantees to Soft Real Time / Multimedia applications. The mechanisms and features provided by the two systems are different and in this section describe them in detail. 2.1 QLinux Features QLinux is a Linux kernel that can provide quality of service guarantees. QLinux, based on the Linux 2.2.x kernel, combines some of the latest innovations in operating systems research. It includes the following features: Hierarchical Start Time Fair Queuing (H-SFQ) CPU scheduler Hierarchical Start Time Fair Queuing (H-SFQ) network packet scheduler The H-SFQ CPU scheduler enables hierarchical scheduling of applications by fairly allocating cpu bandwidth to individual applications and application classes. The H-SFQ packet scheduler provides rate guarantees and fair allocation of bandwidth to packets from individual flows as well as flow aggregates (classes). The kernel also supports Lazy Receiver Processing (LRP) of packets, however, this feature was turned off in the kernel we used for our experiments. 3 In the QLinux framework for CPU scheduling[1], the hierarchical partitioning requirements are specified by a tree structure. Each thread in the system belongs to exactly one leaf node. Each non-leaf node represents an aggregation of threads and hence an application class in the system. Each node in the tree has a weight that determines the percentage of its parent node’s bandwidth that should be allocated to it. Also, each node has a scheduler: whereas the scheduler of a leaf node schedules all the threads that belong to the leaf node, the scheduler of an intermediate node schedules it’s child nodes. Given such a scheduling structure, the scheduling of threads occurs hierarchically. The weights associated with the nodes determine the proportion of the CPU bandwidth available to the threads/application class represented by that node. Similarly for the case of the network packet scheduler[2], it is possible to create a hierarchical structure that represents the distribution of available network bandwidth to the various flows in the system. This is done by attaching queues to leaf nodes in the scheduling hierarchy and associating sockets with the respective queues. These manipulations are all performed through special system calls made by an application running as root. This application can be the same or different as the one requiring the QoS guarantees. 2.2 TAO / Corba Features TAO[5] is a freely available, open-source, and standards-compliant real-time implementation of CORBA that provides efficient, predictable, and scalable quality of service (QoS) end-to-end. Unlike conventional implementations of CORBA, which are inefficient, unpredictable, nonscalable, and often non-portable, TAO applies the best software practices and patterns to automate the delivery of high-performance and real-time QoS to distributed applications. Many types of real-time applications can benefit from the flexibility of the features provided by the TAO ORB and its CORBA services. In general, these applications require predictable timing characteristics and robustness since they are used in mission-critical real-time systems. Other real-time applications require low development cost and fast time to market. Traditionally, the barrier to viable real-time CORBA has been that many real-time challenges are associated with end-to-end system design aspects that transcend the layering boundaries traditionally associated with CORBA. That's why TAO integrates the network interfaces, OS I/O subsystem, ORB, and middleware services in order to provide an end-to-end solution. For instance, consider the CORBA Event Service[6], which simplifies application software by supporting decoupled suppliers and consumers, asynchronous event delivery, and distributed group communication. TAO enhances the standard CORBA Event Service to provide important features, such as real-time event dispatching and scheduling, periodic event processing, efficient event filtering and correlation mechanisms, and multicast protocols required by real-time applications. 4 TAO is carefully designed using optimization principle patterns that substantially improve the efficiency, predictability, and scalability of the following characteristics that are crucial to highperformance and real-time applications: TAO uses active demultiplexing and perfect hashing optimizations to dispatch requests to objects in constant, i.e., O(1), time regardless of the number of objects, IDL interface operations, or nested POAs. TAO's IDL compiler generate either compiled and/or interpreted stubs and skeletons, which enables applications to make fine-grained time/space tradeoffs in the presentation layer. TAO's concurrency model is carefully design to minimize context switching, synchronization, dynamic memory allocation, and data movement. In particular, TAO's concurrency model can be configured to incur only 1 context switch, 0 memory allocations, and 0 locks in the ``fast path'' through the ORB. TAO uses a non-multiplexed connection model that avoids priority inversion and behaves predictably when used with multi-rate real-time applications. TAO's I/O subsystem is designed to minimize priority inversion interrupt overhead over high-speed ATM networks and real-time interconnects, such as VME. It also runs very efficiently over standard TCP/IP protocols, as well. TAO provides all these optimizations within the standard CORBA 2.x reference model. 5 3. Experimental Setup To evaluate the two systems, we developed and tested a set of applications for them and performed various performance measurements. For this purpose we used a fire-walled network of identical Linux workstations. The machines were all 300 MHz Pentium – II’s with 192MB RAM connected to each other through a 10Mbps Ethernet link and running Redhat Linux 6.1. Applications were client-server type and were tested in two configurations – client running on the same host as the server and client running on another host in the network (remote client case). The following applications were developed for both systems - (QLinux as well as Corba) to compare various aspects of performance of these two systems. Since the two systems support slightly different application design frameworks, care was taken to keep the corresponding applications on both systems comparable. 3.1 Echo / Null RPC This application was designed to measure the response time and the corresponding over head for a single unit of synchronous communication. In this application, the client would send a small (usually empty string) string to the server which would then respond by sending the same string back. The metric in this application was the response time of a single request as observed from the client side. In the case of the application QLinux the client and server communicate through a simple, connection oriented (TCP) socket. The server launches a new thread whenever a client connects to it, to service the request. The server then reads the string sent by the client from the socket and writes it back to the socket immediately. The client sends 100 sequential requests to the server, records the response times for each of these and then exits. As this application does not involve and QoS specific services, it was run on a plain Linux kernel. In the Corba based application, we used the concept of a distributed object to implement the Echo server. The server implements the interface of the Echo object (the interface consists of a single function that takes in a string argument and returns the same string). The server creates an instance of the Echo object, writes it's unique identification string (IOR) to a disk file and then waits for client requests. The client reads the Server IOR string from the disk file to obtain a reference to the Echo object. It then performs the Echo function by means of an RPC type mechanism. The actual sending and receiving of the data between the client and the server takes place through the ORB (Object Request Broker) and the details of this are completely transparent to the programmer. For a fair evaluation, we configured the server to operate in a 'thread per request' mode. Other modes such as reactive, 'thread per connection' or pool of threads are also available without needing any special coding effort from the developer. Also, this application does not use any 6 QoS related features of the TAO ORB, but it does benefit from any special optimizations built into the ORB. The following table indicates the average response times for the Echo application for the two systems: Local Client Non Corba Corba with first request Corba without first request 513.55 7.01 Remote Client 767.45 4.97 648.99 119.42 897.93 119.65 588.101 4.31 836.899 3.04 As can be seen from the table, on the average, the response time for the Corba based echo application was larger than for the simple socket based ones. This is because of the extra overhead which the middle-ware involves. An strace of the Corba application revealed that although it too uses TCP based socket for it's communication, the marshalling, demarshalling of arguments, extra header information that needs to be sent and the allocation and deallocation of the paramters at the server-side was the cause of the larger response time for the CORBA application. Figures 1 and 2 also show that in the Corba case, the response time for the very first request made on the remote object takes an order of magnitude time more than subsequent requests to the same object. This difference can become significant in case the application design is such that only one request is made per remote object (i.e. a remote object reference is never reused. In this case, every request would experience the extra overhead of the very first request. 7 Non Corba Null RPC Response Time Corba Null RPC Response Times 1000 8000 900 7000 800 6000 700 600 Local Remote 4000 usecs usecs 5000 Local Remote 500 400 3000 300 2000 200 1000 100 Figure 1: Corba Null RPC Response Time 97 93 89 85 81 77 73 69 65 61 57 53 49 45 41 37 33 29 25 21 9 17 5 13 0 1 97 93 89 85 81 77 73 69 65 61 57 53 49 45 41 37 33 29 25 21 9 17 13 5 1 0 Figure 2: Socket Null RPC Response Times The explanation for this can be seen from the following ‘strace’ of the client application for the first two requests sent to the server: gettimeofday({989986519, 979772}, NULL) = 0 brk(0x80e7000) = 0x80e7000 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 7 connect(7, {sin_family=AF_INET, sin_port=htons(4937), sin_addr=inet_addr("192.168.42.3")}, 16) = 0 fcntl(7, F_GETFL) = 0x2 (flags O_RDWR) fcntl(7, F_SETFL, O_RDWR) = 0 fcntl(7, F_GETFL) = 0x2 (flags O_RDWR) fcntl(7, F_SETFL, O_RDWR) = 0 setsockopt(7, SOL_SOCKET, SO_SNDBUF, [65535], 4) = 0 setsockopt(7, SOL_SOCKET, SO_RCVBUF, [65535], 4) = 0 setsockopt(7, IPPROTO_TCP1, [1], 4) = 0 getpid() = 20221 fcntl(7, F_SETFD, FD_CLOEXEC) = 0 getpeername(7, {sin_family=AF_INET, sin_port=htons(4937), sin_addr=inet_addr("192.168.42.3")}, [16]) = 0 rt_sigprocmask(SIG_BLOCK, ~[RT_1], [RT_0], 8) = 0 rt_sigprocmask(SIG_SETMASK, [RT_0], NULL, 8) = 0 writev(7, [{"GIOP\1\1\1\0F\0\0\0\0\0\0\0\0\0\0\0\1\237\0@\33\0\0\0\24"..., 82}], 1) = 82 rt_sigprocmask(SIG_BLOCK, ~[RT_1], [RT_0], 8) = 0 rt_sigprocmask(SIG_SETMASK, [RT_0], NULL, 8) = 0 select(8, [5 7], NULL, NULL, NULL) = 1 (in [7]) read(7, "GIOP\1\1\1\1\26\0\0\0", 12) = 12 read(7, "\0\0\0\0\0\0\0\0\0\0\0\0\6\0\0\0Hello\0", 22) = 22 gettimeofday({989986519, 987282}, NULL) = 0 fstat(1, {st_mode=S_IFREG|0640, st_size=20519, ...}) = 0 mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40923000 write(1, "7510\n", 57510 ) = 5 gettimeofday({989986519, 988551}, NULL) = 0 writev(7, [{"GIOP\1\1\1\0F\0\0\0\0\0\0\0\1\0\0\0\1\237\0@\33\0\0\0\24"..., 82}], 1) = 82 rt_sigprocmask(SIG_BLOCK, ~[RT_1], [RT_0], 8) = 0 rt_sigprocmask(SIG_SETMASK, [RT_0], NULL, 8) = 0 8 select(8, [5 7], NULL, NULL, NULL) = 1 (in [7]) read(7, "GIOP\1\1\1\1\26\0\0\0", 12) = 12 read(7, "\0\0\0\0\1\0\0\0\0\0\0\0\6\0\0\0Hello\0", 22) = 22 gettimeofday({989986519, 990551}, NULL) = 0 The code between the 2 pairs of gettimeofday system calls shows the system calls made by the client for the first and the second requests to the same object. As can be seen, the client creates a socket and connects to the server only after the first request is made, for subsequent calls the connection is kept open and reused. Hence the first request involves a greater over head. We expect a similar behavior at the server side contributing to the patter, when requests are received, however, this could not be confirmed as strace failed to work correctly on the server (which is a multithreaded application). 3.2 Simple HTTP Client-Server This application models a simple HTTP server that only receives requests for static HTML pages or objects. The client sends a properly formatted HTTP request to the server requesting a file. In our system, the client can request one of 8 possible files varying in size from 20kB to 1.29MB. The server receives the HTTP request, parses it, reads the requested file from it’s local disk, includes it in an HTTP response and sends it back to the client. For this experiment, there is only one client thread that sequentially makes 100 requests for a file of the same size. To avoid disk-caching effects, the server has 100 copies of files of each possible size and reads a different one every time. The QLinux application again uses simple TCP based server that creates a new thread for every request. The new thread services the HTTP request and also logs the time taken for disk I/O and for processing the HTTP request. This can be used in conjunction with the response time measured at the client to determine the time spent by the request and response in the network. The Corba version of the server uses a remote object with ‘thread per request’ policy to perform the same operation as the QLinux server. The following plots indicate that the performance of the two systems in terms of response time are mostly comparable, though the QLinux application performs marginally better. Local HTTP Response Times Remote HTTP Response Time 700000 2500000 600000 2000000 500000 400000 300000 (usecs) (usecs) 1500000 Corba Non Corba Corba Non Corba 1000000 200000 500000 100000 0 0 0 0 200 400 600 800 1000 1200 1400 200 400 600 800 1000 1200 1400 File Size (kB) Filesize (kB) Figure 3: Local HTTP Response Times Figure 4: Remote HTTP Response Times 9 3.3 Concurrent HTTP Client-Server This experiment differs from the previous one in that, instead of having a single client thread, the client launches a number of threads (1-4), each of which make 100 requests to the same HTTP server concurrently. This forces the server to service 1-4 requests simultaneously. In both applications, the server remains unchanged from the previous experiment, and the clients were modified so that the client launches a certain no of threads on startup each of which would make requests to the server. These experiments were done for varying no. of client threads (1-4), varying file sizes (20 kB to 1.28Mb) and with the client on a host local or remote to the server. The figures5-8 summarize these results for the following representative cases. No of Client Threads = 4 and varying the filesize Filesize=640kB, varying the no. of client threads. Local Concurrent HTTP Response Time, Filesize=640kB 600000 1000000 500000 800000 400000 Corba Non Corba 600000 (usecs) (usecs) Local HTTP Response Times, Concurrency=4 1200000 Corba Non Corba 300000 400000 200000 200000 100000 0 0 0 200 400 600 800 1000 1200 0 1400 0.5 1 1.5 2 2.5 3 3.5 4 4.5 No of concurrent clients Filesize (kB) Figure 6: Remote HTTP Response Times Figure 5: Local HTTP Response Times Remote Concurrent HTTP Response Times, Filesize=640kB Remote HTTP Response Times,Concurrency = 4 3500000 7000000 3000000 6000000 2500000 5000000 (usecs) Corba Non Corba (usecs) 2000000 4000000 Corba Non Corba 3000000 1500000 2000000 1000000 1000000 500000 0 0 0 200 400 600 800 1000 1200 1400 File size (kB) Figure 7: Local HTTP Response Times 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 No of concurrent clients Figure 8: Remote HTTP Response Times 3.4 Compute Intensive HTTP Client-Server This again is a variation on the HTTP theme. This is a simulation of cgi-bin processing request. In this case the client sends an HTTP request to the server. In response, the server reads a small file from disk (40kB), performs some computation (a complicated loop) and then sends the file it read from disk to the client. The server is designed so that the time taken for the computation constitutes most of the request processing time. In this 10 case as well the applications were created by simple modifications to HTTP client-server described in the preceding experiments. The figures 9,10 demonstrate the variation of the average response time for a single request as the no. of clients making concurrent requests to the server is increased from 1 – 15. As can be seen from the figures, the response time for these computation-intensive requests increases almost linearly as the number of simultaneous clients is increased. The Corba application because of the extra overheads mentioned earlier performs slightly poorer than the simple socket based one. Also since this application does not involve transfer of large amounts data over the network, the difference in the response times for the local client and the remote client was not too substantial. Remote Comp Intensive Response Time Local Computation Response Times 1400000 1200000 1200000 1000000 1000000 800000 (usecs) (usecs) 800000 Corba Non Corba 600000 Corba Non Corba 600000 400000 400000 200000 200000 0 0 0 2 4 6 8 10 12 14 16 No of Clients Figure 9: Local HTTP Response Times, with concurrency 0 2 4 6 8 10 12 14 16 No of Clients Figure 10: Remote HTTP Response Times, with concurrency To simulate the computation at the server, we initially used a simple counting for-loop which counted to a large no. However since the TAO/ORB by default uses –O3 level of compiler optimization ( g++ compiler) , this simple loop got optmized out. So we had to make the loop more complicated and make it manipulate more dummy variables to prevent the compiler from optimizing it out. 3.5 Long File/ MPEG Download One of the models for serving a video stream is to assume that the client has sufficient buffers to accommodate the entire video file. In this model the server simply sends the video file to the client as fast as possible and letting the client buffer it and play it back whenever it wants. This model isn’t fundamentally different from the simple download of a large file. Since the video file is not being streamed at a real time rate and there are no deadlines associated with the arrival of data, the only parameter of interest is the time taken to download the entire file (basically throughput). In this experiment, we’ve used this model and measured the time taken to download an MPEG-1 video file, with the server sending the video frames as fast as possible. We used an MPEG-1 video file consisting of 5962 frames. Instead of actually using the video data, we generated a profile of the MPEG sequence that listed the size of each frame in the video. During initialization, the server reads this profile from disk and stores the data about frame sizes in an array. When transmitting, the server simply sends a blank block of data of the same size as the corresponding frame to the client. This approach 11 eliminates the disk from the service loop and allows us to measure the pure network and application overhead of transmitting data. In the QLinux application, the server listens for client connections on a TCP socket. When a client connects to the server, the server spawns a new thread which starts sending the client data equivalent to the video frame sizes as soon as possible. The stream is preceded by a 4 byte header indicating the number of frames in the stream. Each frame in the stream is preceded by a 4 byte header specifying the size of the following frame in bytes. These headers are necessary so that client knows how much data to receive and when to stop receiving the data. The client logs the arrival time of each frame when all the data for that frame is arrived. When using Corba, for this kind of application, the synchronous RPC mechanism of the basic Corba server is not suitable because of the amount of data sent from the server to the client and the manner in which it is sent. TAO includes an Audio/Video Streaming Service[3], which supports the semantics for setting up and tearing down Streaming connections, however, the service focuses itself in the control/signaling involved during streaming and does not deal with the actual transport/QoS related issues. The Real-Time Event Channel Service provided in the TAO implementation of CORBA is an ideal choice for the kind of communication required for this application. The Event Channel service needs to be started before either client or the server. The server starts up and creates an instance of a video supplier object and listens for requests from clients through an RPC call. When a client wants to receive a video file from a server, it first launches a separate thread that connects to the Event Channel as a consumer. Since the RTES supports filtering of events at the consumer by event types, the consumer thread specifies to the event channel the ID of events it wishes to receive. The consumer thread then waits for events to arrive. Meanwhile the other client thread obtains a reference to the video supplier object and performance an RPC on it, providing it with the event ID it is interested in receiving and telling the supplier to start sending the video frames. The supplier on receiving the RPC from the client, connects to the event channel as a supplier of events with the ID specified by the client. The first event it sends to the client contains the number of frames it will be sending. Next the supplier pushes events into the Event Channel containing one frame each. In the general case, the Event Channel can run on any host on the distributed system and still be used as a communication channel between a supplier and consumer on any other hosts. However to allow a fairer comparison with the QLinux application, for our experiments, the event channel was run on the same host as the supplier. The following plots show the frame arrival times at the consumer for the QLinux and Corba applications. It can be seen that the performance of Corba is poorer than the QLinux application because of the extra overhead involved in doing the data transfer via an event channel. The difference in the two is much more for the case when the client and 12 server are on running on the same machine. The reason for this is that in the case with the remote client, the delay incurred in the network (which is approximately the same for both applications) is much more substantial than the overhead due to the event channel. Remote MPEG Download Local MPEG Download 450000000 140000000 400000000 120000000 350000000 80000000 Corba Non Corba 60000000 Frame arrival times (usec) Frame arrival times (usecs) 100000000 300000000 250000000 Corba NonCorba 200000000 150000000 40000000 100000000 20000000 50000000 5911 5714 5517 5320 5123 4926 4729 4532 4335 4138 3941 3744 3547 3350 3153 2956 2759 2562 2365 2168 1971 1774 1577 986 1380 789 592 1183 1 395 5941 5743 5545 5347 5149 4951 4753 4555 4357 4159 3961 3763 3565 3367 3169 2971 2773 2575 2377 2179 1981 1783 991 1585 1387 793 1189 1 595 397 199 Figure 11: Frame Arrival Times, Local Download 198 0 0 Figure 12: Frame Arrival Times, Remote Download To estimate the overhead of the event channel, we measured transmission delays for 1000 byte data blocks through direct socket transfers vs through an event channel. In the case of supplier and receiver on the same machine, we get the following numbers: Corba (via Event Channel) - 3134.22 17.10 usecs Socket based direct - 1279.32 203.33 usecs These numbers clearly indicate the extra overhead that the Event Channel adds to the data transmission. 3.6 Video Streaming Without QoS Specification This experiment is a baseline evaluation of the performance of the two systems for streaming a video sequence at a Real-Time rate. The application are mostly similar to the case of the long MPEG download, except that instead of transmitting all frames as soon as possible, the server transmits frames in rounds of 30 frames every second. Since the video sequence is a 30fps one, this transmission corresponds to the frame rate. Although both systems allow specification of QoS parameters (in different ways), these were not enabled for this set of experiments. We experimented by increasing the number of concurrent streams till the server was unable to meet the frame rate. In the QLinux application apart from the rate at which the server sends the frames, there were no other changes to the application from the previous experiment . The Corba Real Time Event Service also needs the TAO Scheduling Service to be running for it to work. Also the supplier and consumer need to locate the Event Channel to be able to connect to it. In the previous experiments, this was done by using each service’s unique string which can be used to locate it (the IOR). However in a real-life Corba based distributed system, the system runs a “Naming Service” (similar to a DNS kind of look up). Each CORBA object (i.e. service providing entity) registers itself with 13 the Naming Service and which binds a name to it. Since the Naming Service runs in a well-known location, a client wishing to access a particular object can perform a lookup with the naming service and get a reference to the object. So for this and subsequent applications, we ran a Naming Service on the server machine. Then the Scheduling Service is launched which registers itself as “SchedulingService”. Then the Event Channel Service launches itself, try to locate the Scheduling Service by doing a lookup on the Naming Service for “SchedulingService” and on success register itself as “EventService”. Once this is done, the Supplier and the Consumer can connect to the event channel by first locating it by doing a lookup for “EventService”. This means that server machine is running at least 4 Corba applications at any time (Naming Service, Scheduling Service, Event Service & the Supplier). Since these are large applications, they do put a significant load on the CPU. Another important difference between this application and the MPEG download is that the supplier instead of pushing a single event for each frame to be transmitted, creates an event set consisting of 30 events corresponding to the 30 frames of that round, and pushes the whole event set at once. It is worth noting that although the application framework allows for a consumer to receive multiple events (an event set) at a time, we found that even though we were pushing an event set of 30 events at supplier end, the consumer was still getting one event at a time. We also ran these applications for the case of multiple concurrent streams being served to different clients through the same event channel. As described earlier, the event channel has support for filtering event delivery to consumers. Since the supplier is run in ‘thread per request’ mode, a new supplier thread is created for each client that wants to receive a video. These supplier threads connect to the same event channel and push events into it separately. Since each client (consumer) provides the supplier with a different event ID, the events being pushed into the channel by the various supplier threads are tagged with their respective Ids and are filtered by the event channel so that each consumer only gets the events belonging to its own stream. The event channel uses a pool of threads to do the dispatching of the events and the size of the pool can be configured when the Event Service is started. For our experiments, we set the number of threads to be the same as the no of concurrent clients we would be using for each experiment. The Metric: For comparing the performance of the two applications we used a metric based on the time taken to receive a round. When transmitting a stream, the server sends a round of 30 frames every second. If it completes transmitting the round before it’s time to send the next round, the sending thread sleeps for that duration. If, however, it is unable to send a round in one second, it starts sending the next round as soon it finishes with the current one. The client, who is logging the frame arrival time of each frame, is interested in receiving a complete round (30 frames) every second to be able to play the video stream continuously. Therefore, at the client side we measure the amount of time 14 taken to receive an entire round. This is the time from the arrival of the first frame to the arrival of the arrival of the last (30th) frame of the round. If this time is less that 1 second, then we can say that this round was received in-time for playback. If the time is greater than one second, this corresponds to a missed deadline. One of the effects that this metric does not measure is that if one of the earlier rounds is delayed and misses its deadline, it is likely going to make the subsequent rounds also miss their deadlines. By using the time to receive each frame separately, we factor out the cumulative effect of deadline misses. Therefore the two metrics we are interested in are: 1. Average Round Duration (as measured from Client) with a increasing number of Streams 2. Distribution of Round Duration (around deadline) and effect of increasing number of Streams Local Streaming No QoS Remote Streaming No QoS 70000 1200000 1000000 50000 40000 Corba Non Corba 30000 20000 Average Round duration (usecs) Average Round Duration (usecs) 60000 800000 Corba Non Corba 600000 400000 200000 10000 0 0 0 1 2 3 4 5 6 No of Streams Figure 13: Average Round Durations, Local Client, No QoS 0 1 2 3 4 5 6 No of streams Figure 14: Average Round Durations, Remote Client No QoS As can be seen from figure 12, when the client and server are running on the same host, the QLinux application not only outperforms the Corba application, but also is not affected significantly by increasing the number of simultaneous streams. The reason for this is that for the local streaming, the overhead due to the network is minimal, but since the QLinux applications are simpler and less memory and CPU intensive, they are not affected as much as Corba applications when a large no. are running on the same system. However in figure 13, for the remote client, the performance of the two systems is almost comparable. Both applications are unable to meet the frame rate (corresponding to an average round duration of 1 second) for 5 concurrent streams. The following figures 15,16,17,18 show the distribution of Round Durations for the remote client and the cases of 4 and 5 concurrent streams. We also experimented by changing the round size from 30 frames to 15 frames. In this case the server is required to send rounds of 15 frames every 500 msecs. We found that because of the tighter deadline and possibly the 10msec granularity of the Linux kernel clock, the server was unable to support even 4 concurrent streams for remote clients. The round durations for these cases are shown in figures 19 and 20 15 Corba Remote 5, No QoS 0.16 0.16 0.14 0.14 0.12 0.12 0.1 Frequency Round Duration More 1400000 1300000 1200000 1100000 900000 1000000 800000 700000 600000 500000 400000 0 More 1400000 1300000 1200000 1100000 900000 1000000 800000 700000 600000 500000 0 400000 0.02 0 300000 0.04 0.02 200000 0.06 0.04 0 0.06 300000 0.08 200000 Frequency 0.08 100000 0.1 Fraction 0.18 100000 Fraction Corba Remote 4, No QoS 0.18 Round Durations Figure 15: Corba Round Durations 4 Streams Figure 16: Corba Round Durations 5 Streams Non Corba Remote 4, No QoS Non Corba Remote 5, No QoS 0.2 0.14 0.18 0.12 0.16 0.1 0.14 0.1 Frequency Fraction Fraction 0.12 0.08 Frequency 0.06 0.08 0.06 0.04 0.04 0.02 0.02 Round Durations More 1400000 1300000 1200000 1100000 1000000 900000 800000 700000 600000 500000 400000 300000 200000 0 100000 More 1400000 1300000 1200000 1100000 1000000 900000 800000 700000 600000 500000 400000 300000 200000 0 0 100000 0 Round Durations Figure 17: Non Corba Round Durations 4 Streams Figure 18: Non Corba Round Durations 5 Streams RoundSize=15 Corba Remote No QoS RoundSize=15 Non Corba Remote No QoS 0.16 0.2 0.18 0.14 0.16 0.12 0.14 0.12 0.08 Frequency Fraction Fraction 0.1 0.1 Frequency 0.08 0.06 0.06 0.04 0.04 0.02 0.02 Round Durations Figure 19: Corba Round Durations 4 Streams, Round size=15 frames More 700000 650000 600000 550000 500000 450000 400000 350000 300000 250000 200000 150000 100000 0 50000 More 700000 650000 600000 550000 500000 450000 400000 350000 300000 250000 200000 150000 100000 0 0 50000 0 Round Durations Figure 20: Non Corba Round Durations 4 Streams, Round size=15 frames 3.7 Video Streaming With QoS Specification After testing the raw performance of streaming applications on both systems features, we repeated the experiments of the previous sections, this time with the QoS features enabled. This also involved modifying the application to specify the QoS guarantees it needs from the system. 16 In the case of the QLinux application, the server modifies the process scheduling hierarchy to create an SFQ scheduling class with weight 1. A default best effort class with weight 1 is automatically created when the kernel is booted. For each thread that the server creates to send a stream, it creates a new leaf node with weight 1 in the SFQ class and attaches the service thread to that node. Thus, in the server application, the thread that listens for client connections, runs in the best effort class and the threads responsible for streaming the video run in the SFQ class, all with the same weight. Similarly, the server also creates an SFQ class in the packet scheduling hierarchy and attaches the new socket created for a client connection to a newly created queue with weight 1 in the SFQ class. The mechanism for providing QoS specification is much simpler and more intuitive in the case of TAO. All the server program needs to do is to associate the following QoS parameters with the a certain type of event when connecting to an Event Channel as a supplier (or consumer): Worst Case Time (to perform task): set to 1 sec Typical Time (to perform task): set to 1 sec Cached Time: set to 1 sec Time Period of task: set to 1 sec No of threads (of event channel to use for this task): set to 1 The Real Time Event Service uses these parameters to get a schedule from the Scheduling Service and schedules the Event Channel’s pool of threads using that schedule. The following figures show the variation in the Average Round Duration as the number of concurrent clients increases. This is for the round size of 30 frames with a 1 second period. Remote Streaming with QoS 1200000 50000 1000000 40000 800000 Corba No Corba 30000 Avg Round Durations Avg Round Durations Local Streaming with QoS 60000 20000 400000 10000 200000 0 Corba Non Corba 600000 0 0 1 2 3 4 5 6 No of Streams Figure 21: Average Round Durations, Local Client with QoS 0 1 2 3 4 5 6 No of Streams Figure 22: Average Round Duration Remote Client with QoS As can be seen from the above figures, the relative behavior of the two applications in terms of the average Round Duration does not change too much from the case when the QoS was not enabled. For local performance, QLinux still out-performs CORBA and in the case of remote client, although CORBA does better when the number of clients is low, the QLinux application scales better. 17 The following plots show the distributions of the Round Durations for these applications with 4 and 5 remote clients. It can be seen that QLinux with QoS enabled does manage to shorten the tail of the distribution noticeably, thus ensuring that a greater percentage of rounds are able to arrive within the deadline. Corba Remote 4 with QoS Corba Remote 5 with QoS 0.18 0.2 0.16 0.18 0.16 0.14 0.14 0.12 Frequency Fraction Fractions 0.12 0.1 0.08 0.1 Frequency 0.08 0.06 0.06 0.04 Round Durations More 1400000 1300000 1200000 1100000 900000 1000000 800000 Round Durations Figure 23: Corba RoundDurations,4 Streams with QoS Figure 24: Corba RoundDurations,5 Streams with QoS Non Corba Remote 4 with QoS Non Corba Remote 5 with QoS 0.2 0.2 0.18 0.18 0.16 0.16 0.14 0.14 0.12 0.1 Round Durations Figure 25: Non Corba RoundDurations,4 Streams with QoS More 1400000 1300000 1200000 1100000 1000000 900000 800000 700000 600000 500000 400000 300000 Frequency 0 More 1400000 1300000 1200000 1100000 1000000 900000 800000 700000 600000 0 500000 0.02 0 400000 0.04 0.02 300000 0.06 0.04 200000 0.06 0 0.08 100000 0.08 200000 Frequency 100000 0.1 Fraction 0.12 Fraction 700000 600000 500000 400000 300000 200000 0 More 1400000 1300000 1200000 1100000 900000 1000000 800000 700000 600000 500000 400000 300000 200000 0 0 100000 0.02 0 100000 0.04 0.02 Round Durations Figure 26: Non Corba RoundDurations,5 Streams with QoS We also observed the effect of decreasing the round size to 15 frames transmitted every 500 msecs for 4 concurrent remote clients. Again we see that with QoS, the QLinux application does a better job of ensuring that a greater number rounds are received within their deadlines. The improvement from the case without QoS is also more noticeable with QLinux. 18 RoundSize=15 Non Corba Remote QoS 0.18 0.16 0.16 0.14 0.14 0.12 0.12 0.1 Frequency 0.08 Fraction 0.1 Frequency 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0 Figure 27: Corba Round Durations 4 Streams, Round size=15 frames More 700000 650000 600000 550000 500000 450000 400000 350000 300000 250000 200000 150000 100000 0 More 700000 650000 600000 550000 500000 450000 400000 350000 300000 250000 200000 150000 100000 0 50000 0 Round Duration 50000 Fractions RoundSize=15 Corba Remote QoS 0.18 Round Duration Figure 28: Non Corba Round Durations 4 Streams, Round size=15 frames 3.8 Video Streaming performance in presence of another application In this experiment, we’ve attempted to measure how the performance of a time critical application is affected by the presence of another non-time critical application. This was done by running the Streaming Server and the HTTP server on the same host. 4 streaming clients and a certain number of HTTP clients make concurrent requests to their respective servers. We observe the effect of increasing the number of HTTP clients on the Average Round Duration measured by the streaming clients. The results of some preliminary experiments are presented below in figures 29 and 30 19 Application Mix, 4 Streams + HTTP, with QoS 1400000 1200000 1200000 1000000 1000000 800000 Corba Non Corba 600000 Avg Round Durations Avg Round Durations Application Mix, 4 Streams + HTTP No QoS 1400000 800000 Corba Non Corba 600000 400000 400000 200000 200000 0 0 0 2 4 6 8 10 12 No of HTTP Clients Figure 29: Average Round Durations, 4 Client No QoS, increasing no of HTTP Clients 0 2 4 6 8 10 12 No of HTTP Clients Figure 30: Average Round Durations, 4 Client with QoS, increasing no of HTTP Clients The experiments involving these applications are being continued by others and will be included in a later document. 3.9 Industrial Control System Simulation This application simulates an industrial control console system[4]. The application consists of 2 distinct components. The first one models a supervisor’s command console in which a client program from time to time issues a command to a remote actuator device (simulated by a remote server). The device performs the operation requested by the client (simulated by a brief busy-wait) and then sends an acknowledgement back to the client. The operator is simulated by automatically generated requests at random time intervals (100 – 300 msecs). The second component is a remote-sensing kind of application, in which a sensor (simulated by a supplier) periodically (every 20 msecs) sends an update of a measurement to a remote monitoring system (simulated by a consumer). The QLinux version of this application consists of two client-server pairs, one for the Command Console and the other for the remote sensor. Both applications are nonconcurrent and use a TCP based socket for communication. In the command control application, the server on receiving a connection waits for the client to send an operation on the open socket, performs a brief busy wait and sends back a single byte acknowledgement. The client issues commands at random time intervals and records the response time for requests. In the remote-sensing application, the server on getting a connection periodically sends a byte of data to the client, which measures the arrival times of these updates. For TAO, the Command Console application uses the RPC-like mechanism to send a request and receive a response. The remote-sensing application uses the Real Time event channel for communication between the sensor and the monitoring system. 20 The metrics of interest in this application are the jitter in the arrival of sensor updates and the response times for the Command-Console requests. These experiments were only performed for the remote client and QoS was enabled for Remote Sensing Application. The average response times for the Command-Console application requests are summarized below: Response Time (usecs) CORBA QLinux Remote Sensor NOT Running 67974.33 96.09 67617.63 12.97 Remote Sensor Running 70960.69 443.86 67853.9 11.53 As can be seen from the above table, the performance of the Command-Console application in the absence of the Remote-Sensing application for both QLinux and CORBA is comparable, however the degradation in performance when the Remote Sensing Application is added to the server load, is more significant in the case of CORBA than QLinux because of the extra overhead associated with CORBA applications. In the case of the Remote Sensing applications, we measure the Update sending times at the sender and the Update receiving time at the client. The differences between successive update sending times gives an indication of how well the threads of the sending application are scheduled. Ideally these differences should be constant. These experiments were performed with the Command-Console application also running simultaneously on the same hosts. The differences between update arrival times at the client also incorporate the jitter introduced by the communication medium between the two applications. These measurements are summarized in the following table: Avg. Update Differences (usecs) Sending Side CORBA QLinux 29996.37 9.21 29997.65 4.63 Receiving Side 29992.54 7.80 29997.63 5.70 As can be seen, there is really not much to choose between the performance of the two systems. They both manage to give an almost identical level of performance. 4. Summary & Discussion of Qualitative Comparisons One of the objectives of this work was to provide, apart from the quantitative comparison of the two systems presented above, also give a qualitative evaluation of the relative ease of use of the two systems and identify any potential stumbling blocks. 21 In our experience with developing corresponding applications for both QLinux and TAO, we observed the following pros and cons of the two systems. 4.1 QLinux One of the biggest advantages that QLinux offers it’s users is the ability to use the same, API which most systems programmers are already familiar with (basic socket based programming). Using the additional QoS features available, does not add significantly to the application complexity. This makes the task of porting already existing applications to use these features reasonably straightforward. The developer has to learn only a few new system calls, however, the documentation (man pages), for these leaves some room for improvement. This is a significant advantage in the usability of the system. On the other hand, in the simplicity of it’s API, lies also a slight drawback, in that it does not provide the user with any high-level interfaces to simplify the task of developing networked/distributed applications. Also since QLinux deals with QoS only in terms of HSFQ scheduling parameters (i.e. bandwidth share), the task of actually specifying the QoS for a given applications is not straightforward. This specially so since it is common to describe QoS parameters in terms of deadlines, end-end delays and periodicity which do not fit directly into the QLinux framework. The other noticeable drawbacks of the QLinux API is that it requires that all QoS related system calls have to be executed in super-user mode. Thus making large scale in a simple way impractical. A solution to this problem would be to develop run-time services which run as the super-user and provide an interface for other user’s applications to request QoS for themselves. It would also be possible to do some admission control in this way. Also, the error reporting mechanisms for the system calls for CPU HSFQ and Link HSFQ are neither consistent nor convenient. Link HSFQ system calls use system messages to report erroneous or exceptional conditions which makes it hard to write clean and robust applications. 4.2 TAO / CORBA Since TAO is first a Distributed Object Computing (DOC) middle-ware and then a RealTime system, it suffers from the problem of being a large, complicated and a very general-purpose application development framework. This implies that any person wishing to use TAO for a Real-Time application has to first overcome the significant hurdle of understanding the various concepts related to DOC and the large amount of details associated with a specific DOC system (such as CORBA). For small, simple applications, this overhead can be prohibitive. Also since the use of CORBA requires the developer to follow a certain application development framework. It makes the task of porting existing applications, which use conventional socket based communication, to use TAO a very complicated and often impractical task. 22 However, once this initial hurdle is overcome, the middle-ware provides many features to make the task of developing distributed applications simpler and faster. Since the middleware frees the developer of the lower level implementation details (the socket programming), it substantially reduces development time for simple applications because of the time saved in writing hard to debug, error prone networking code. This freedom, however, comes at the cost of flexibility. Thus while it is extremely easy and quick to develop applications which conform well to the models commonly used in DOC. Developing an application that has a unique/uncommon communication model might involve a lot of unnecessary complications. Building TAO and TAO based applications is surprisingly memory and CPU intensive. A minimal build of ACE, TAO and TAO OrbServices took close to 4 hours on an otherwise lightly loaded Pentium-II, 300 Mhz machine with 192MB RAM. It takes 5-10 minutes to build from scratch the simplest TAO based application when a normal socket based one takes only a few seconds. For the purpose of specifying QoS guarantees, TAO provides a rather intuitive and easy to understand interface which involves directly specifying deadlines, periodicity of events, etc. Also the ORB supports various kinds of concurrency models in servers and the services that can be used by simply providing command line parameters or configuration files during startup. 23 References [1] P. Goyal and X. Guo and H.M. Vin, A Hierarchical CPU Scheduler for Multimedia Operating Systems, Proceedings of 2nd Symposium on Operating System Design and Implementation (OSDI'96), Seattle, WA, pages 107-122, October 1996. [2] P. Goyal and H. M. Vin and H. Cheng, Start-time Fair Queuing: A Scheduling Algorithm for Integrated Services Packet Switching Networks, Proceedings of ACM SIGCOMM'96, pages 157-168, August 1996 [3] Sumedh Mungee, Nagarajan Surendran, Douglas C. Schmidt, The Design and Performance of a CORBA Audio/Video Streaming Service [4] Krithi Ramamritham, Chia Shen, Oscar Gonzalez, Shubo Sen and Shreedar Shirgurkar, Using Windows NT for Real-Time Applications: Experimental Observations and Recommendations, Proceedings of IEEE RTAS'98 [5] Douglas C. Schmidt, TAO Overview http://www.cs.wustl.edu/~schmidt/TAO-intro.html [6] Douglas C. Schmidt, Using the Real Time Event Service http://www.cs.wustl.edu/~schmidt/events_tutorial.html 24