Variations in Performance and Scalability when Migrating n-Tier Applications to Different Clouds Deepal Jayasinghe, Simon Malkowski, Qingyang Wang, Jack Li, Pengcheng Xiong, and Calton Pu. Center for Experimental Research in Computer Systems, Georgia Institute of Technology 266 Ferst Drive, Atlanta, GA 30332-0765, USA. {deepal, zmon, qywang, jyl, pxiong3, calton}@cc.gatech.edu Abstract—The increasing popularity of computing clouds continues to drive both industry and research to provide answers to a large variety of new and challenging questions. We aim to answer some of these questions by evaluating performance and scalability when an n-tier application is migrated from a traditional datacenter environment to an IaaS cloud. We used a representative n-tier macro-benchmark (RUBBoS) and compared its performance and scalability in three different testbeds: Amazon EC2, Open Cirrus (an open scientific research cloud), and Emulab (academic research testbed). Interestingly, we found that the best-performing configuration in Emulab can become the worst-performing configuration in EC2. Subsequently, we identified the bottleneck components, high context switch overhead and network driver processing overhead, to be at the system level. These overhead problems were confirmed at a finer granularity through micro-benchmark experiments that measure component performance directly. We describe concrete alternative approaches as practical solutions for resolving these problems. Keywords-Benchmarking, Clouds, Datacenter, EC2, Emulab, IaaS, n-Tier, Open Cirrus, Performance, RUBBoS, Scalability. I. I NTRODUCTION The flexibility and scalability of commercial cloud infrastructures make them an attractive application migration target; however, due to the associated complexity, it is difficult to make newly migrated applications run efficiently in clouds. For example, while clouds are good candidates for supplementary system platforms during occasional overload of Internet applications (e.g., electronic commerce), reports on Amazon EC2 consistently mention that network latency may affect overall system performance considerably [26], [28], [31]. Such issues are compounded by dependencies among system components as requests are passed among the tiers, which are characteristic of real n-tier applications. Despite some published best-practices for the popular cloud environments (e.g., [9]), the tradeoff between guaranteed performance (e.g., bounded response time) and economical efficiency (e.g., high utilization for sustained loads) remains a serious challenge for mission-critical applications. In this paper we analyze the performance and scalability when migrating n-tier applications from a traditional datacenter to an Infrastructure as a Service (IaaS) cloud. We use a representative n-tier macro-benchmark application (RUBBoS [7]) and perform a large-scale experimental study on three testbeds. We use Emulab [4] (a more traditional compute cluster environment) as our reference testbed and compare the performance and scalability characteristics to Open Cirrus (a scientific research cloud with high isolation) and to Amazon EC2 [8] (a popular commercial cloud). Our experiments cover scale-out scenarios under varying hardware and software configurations with up to 44 concurrent servers. These RUBBoS experiments are generated and executed automatically using the Elba toolkit [1], [21], [23]. Concretely, we use automated experiment management tools, which set-up, execute, monitor, and analyze large-scale application deployment scenarios. For the detailed analysis of non-trivial performance issues and system bottlenecks, we use micro-benchmarks (both standard and custom) to zoom into individual system components and confirm our initial hypotheses at finer granularity. In the course of our analysis, we found that configurations that work well in one environment may cause significant performance problems when deployed in a different cloud. In fact, we found that the best-performing RUBBoS configuration in Emulab can become the worst-performing configuration in EC2 due to a combination of factors. These factors such as network driver overhead and thread context-switching overhead are often subtle and not directly controllable by users. We provide a set of candidate solutions to overcome the observed performance problems. More generally, this study shows that clouds are a relatively immature technology and significantly more experimental analysis will be necessary in order for public clouds to become truly suitability for mission-critical applications. The remainder of this paper is structured as follows. Section II provides an overview of the benchmark application, the experimental setup, and the tools used during the experiments. In Section III we analyze performance and scalability in the three platforms. In Section IV we discuss the performance impact associate with multithreading, and Section V provides results on network overhead. Related work is summarized in Section VI, and Section VII concludes the paper. II. E XPERIMENTAL S ETTING In this section we discuss our experimental approach. Section II-A introduces our experiment procedure and automated framework. In Section II-B and Section II-C we provide a brief overview for macro-benchmark application and MySQL cluster respectively, and in Section II-D we present our experimental infrastructure, tools, and notations. A. Experimental Automated Framework To enable experimental research at the scale of thousands of experiments, we created a set of software tools in the Elba project [1] [21] [23] to automate the process of setting-up, executing, and monitoring an experiment in addition to the logging of results and statistical analysis. Our experimental cycle with Elba tools consists of three stages, and each stage consists of a number of sub-steps. For each new target platform (e.g., EC2), we start with the current Fig. 1. A typical RUBBoS deployment with MySQL Cluster Elba code generation templates, which have been successfully TABLE I used in previous experiments and environments (e.g., Emulab). H ARDWARE CONFIGURATIONS IN THREE TESTBEDS . In the first stage, we enhance our code generator input template Platform Type Processor Memory Arch with the API specifications of the new hardware and software Amazon EC2 Small 1 EC2 Compute Unit 1.7 GB 32-bit platform on which experiments are run. The last step of stage Large 4 EC2 Compute Units 7.5 GB 64-bit one uses the templates to generate the scripts that will create Ex Large 8 EC2 Compute Units 15 GB 64-bit Cluster 33.5 EC2 Compute Units 23 GB 64-bit and execute the experiments. Emulab PC3000 3GHz 2 GB 64-bit In the second stage, the first step runs the initialization Open Cirrus X3210 3.00GHz (Quad Core) 4 GB 64-bit scripts to set up the experimental environment. These scripts take into account the dependencies among the system components, including hardware and hypervisor (usually invariant limited our discussion to browse-only workloads in this paper through the experiments), operating system configuration, and mainly due to the space constraints. server configurations such as the database load on the database C. MySQL Cluster server. In the second step, we run the planned experiments acMySQL Cluster [2] is an open source transactional database cording to the availability of hardware resources. For example, we usually run the experiments by increasing the workload. designed for scalable high performance access to data through For each workload, we run the easily scalable (browse only) both data partitioning and data replication. In our experiment scenario first, followed by read/write scenarios. To minimize we used an “in-memory” version of MySQL Cluster and the cache inter-dependencies across experiments, after each batch NDBCLUSTER storage engine. MySQL Cluster is impleof experiments, we finish the data collection, ramp-down the mented by a combination of three types of nodes (Management system, stops all servers, and start the next batch with sufficient nodes, SQL nodes and Data nodes). A Management node ramp-up time. The iterations continue until all the experiments maintains the configuration of SQL nodes and Data nodes, plus starting and stopping of those nodes. A Data node stores are finished. The third stage consists of data analysis. In the first step of cluster data. And an SQL node is the clustering middleware this stage, we copy all the collected data from the experimental that handles the internal data routing for partitioning, and environment to our data warehouse for analysis. In the second consistency among replicas for replication. Figure 1 illustrates step, we run a number of statistical tools to analyze application three types of nodes and how they are connected. The total behavior. If performance limitations are found, we run tools number of Data nodes is the product of the replication factor such as intervention and correlation analysis to find resource and the number of partitions. For example, with replication bottlenecks, including multi-bottlenecks where dependencies factor of 2 and database divided into 2 partitions, the resulting link multiple resources that limit their utilization. Sometimes MySQL Cluster will have four Data nodes. additional data is needed, either due to data quality problems D. Overview of Experiments (e.g., interference during an experiment) or for a more refined The experiments discussed in this paper were run in three analysis (e.g., previous uncovered configuration settings). testbeds (i.e., Emulab [4], EC2 and Open Cirrus). Table I B. Macro-benchmark Application shows the server types we used in three platforms (EC2 RUBBoS [7] is an n-tier electronic commerce system modeled Compute Unit provides the equivalent CPU capacity of a 1.0– on bulletin board news sites such as Slashdot, and it is 1.2 GHz 2007 Xeon processor). We evaluated both horizontal widely used in large scale experiment studies. We modified and vertical scalability on EC2, in contrast we limited our the benchmark to suite for our experiment workloads and analysis only to horizontal scalability in Emulab and Open to collect various statistics. The workload consists of 24 Cirrus. By horizontal scalability we mean the increase of different interactions such as register user, view story and post the number of nodes in the application (e.g., from 12 large comments. The benchmark can be deployed as a three-tier or instances to 16 large instances), and by vertical scalability four-tier system and places a high load on the database tier. we mean the same number of nodes but with better hardware A typical RUBBoS deployment with MySQL cluster is shown settings (e.g., from 12 large instances to 12 ex-large instances). in Figure 1. Also, benchmark includes two kinds of workload The nodes were connected over 1GB Ethernet in Emulab, modes: browse-only and read/write interaction mixes. We have Infiniband in Open Cirrus, and for EC2 the connection types were not exposed. To focus this study on the differences Figure 3(a)). To resolve it, we scale-up our experiment and among three platforms, we allocated each server to a dedicated introduce two additional Data nodes (1-2-2-4-em configuphysical node. Specifically, the sharing of physical resources ration) by partitioning the database into two groups. Running (e.g., CPU) in a node is among the threads within each server, experiments with 4 Data nodes and 2 SQL nodes, we observed not between servers. For each concrete hardware configuration that knee point shifted from 3000 to 5000. Next, we increased we observed the most appropriate software configuration (e.g., the number of SQL nodes to 4 to get 1-2-4-4-em configunumber of threads and thread pool sizes) by using the approach ration, likewise we continued this scaling-up process until we mentioned in our previous work [34]. resolved the hardware bottleneck. Through our analysis we Each of our macro-benchmark experiment trials consisted found 1-4-16-16-em as the smallest configuration which of three periods: 8-minute ramp-up, a 12-minute run period, can sustain our workload without system hardware resources and a 30-second ramp-down. Performance measurements (e.g., being bottlenecked (software resource started to saturate [34]). CPU or network bandwidth utilization) were taken during the Notably, as shown in Figure 2(a) our benchmark application Run period using lightweight system monitoring utilities (e.g., scales well on Emulab. To better explain the observed performance characdstat and sar) with a granularity of one second. While the macro-benchmark data gave us the interesting teristics on Emulab we have provided CPU utilizations bottleneck phenomena described in this paper, we often need for 1-2-2-2-em and 1-2-2-4-em configurations, Figmore detailed data to confirm concrete hypotheses on the non- ure 3(a), 3(b) and 3(c) illustrate the CPU utilization in the trivial causes of such phenomena. We have used both existing Tomcat Server, the SQL nodes and the Data nodes respectively. tools and custom tools to obtain such micro scale data. We As shown in the Figure 3(b), for both configurations SQL node employed a number of hardware monitoring and soft resource CPU has become the bottleneck, and as a result we have a flat curve for throughput values (see Figure 2(a)). monitoring tools such as LMBench [5] and NetPipe [6]. We used the same methodology and performed a similar set Notation: We use a notation #A-#T-#M-#D-ppp to represent each concrete configuration. #A denotes the no. of Web of experiments on Open Cirrus cloud. We achieved comparaservers, #T for no. of Tomcat servers, #M for no. of SQL tively higher throughput values, and more importantly, Open nodes and #D for no. of Data nodes in the experiment. A Cirrus shows much better scalability. Due to space limitations short string (ppp) denotes the platform (ec2 for EC2, em we have only provided the observed throughput values (see for Emulab, and oc for Open Cirrus). For example, Figure 1 Figure 2(b)). Notably, in Open Cirrus we observed very low shows a 1-2-2-4-em configuration with 1 Web server, 2 resource utilization (e.g., less than 30% CPU utilization), thus, Tomcat servers, 2 SQL nodes and 4 data nodes in Emulab. In we have omitted utilization graphs. Our analysis on two reference platforms illustrated better EC2 we add the instance (node) type at the end (small, large, application scalability. More specifically, we did not observe exlarge, or cluster). any performance degradation when migrating n-tier applicaIII. C LOUD C OMPARISON tion from a traditional datacenter to an IaaS cloud. In this section we present our experimental results. In Section III-A we present performance and scalability charac- B. Amazon EC2 Performance 1) Horizontal Scale-Out (same nodes): We used the same teristics on reference platforms (i.e., Emulab) and compare it with Open Cirrus, and Section III-B provides scalability methodology as in Section III-A, and performed a similar set of experiments in EC2. We kept as many experimental characteristics and performance issues on EC2. parameters untouched as possible (except platform specific A. Performance on Reference Platforms parameters). Our initial expectation was to achieve better or To compare and contrast the performance complications of similar results to those from Emulab since we used better commercial clouds against other alternative platforms it is hardware in EC2. Surprisingly, our assumptions were shown to essential to have a good reference, Emulab and Open Cirrus be wrong when we observed quite different behavior in EC2. were used for this purpose. Methodically, we start with the As shown in Figure 2(a) and (b) benchmark application scales smallest possible configuration and then gradually scale up well on both Emulab and Open Cirrus, increasing the number based on the observed results. of Data nodes gives better performance; however, in EC2 More concretely, we started on Emulab with 1-2-2-2-em increasing the number of Data nodes reduces the performance. configuration; the configuration consists of two SQL nodes As shown in Figure 2(c) scaling from 1-2-2-2-large and two Data nodes, and we performed experiments from to 1-2-2-4-large significantly reduced the throughput 1000 to 10,000 workloads (i.e., concurrent clients) with 1000 and moving from 1-2-2-4-large to 1-2-2-8-large workload increments. Each client runs as a separate thread further reduces the throughput. More specifically, the scalable and generates HTTP request using an exponential distribution application on Emulab and Open Cirrus shows poor scalability with average think time of 7s. As shown in Figure 2(a), on EC2. for 1-2-2-2-em configuration throughput saturated at 3000 Our analysis of system monitoring data in EC2 shows workloads. Detailed analysis of monitoring data shows that interesting results. Figure 3(a), 3(b) and 3(c) illustrate CPU SQL node CPU utilization had become the bottleneck (see utilization for both Emulab and EC2. As we have already (a) Emulab scale out Fig. 2. (b) Open Cirrus scale out (c) EC2 scale out Horizontal scalability (same node type) analysis with heterogeneous throughput performance characteristics for three testbeds. (a) Tomcat CPU Fig. 3. (b) SQL CPU (c) Data node CPU Average CPU utilization for 1-2-2-2 and 1-2-2-4 configurations for Emulab and EC2. discussed, in Emulab the key limiting factor was CPU consumption (mostly saturated). Scaling up and down causes a CPU bottleneck shift from one tier to another or a concurrent bottleneck phenomenon in several tiers, however in EC2 we observed a completely different utilization pattern and found that all the server types have very low CPU utilization. used only two cluster instances. In our colocation we deployed two sets of 1-Tomcat, 1-SQL node, and 1-Data node in a single cluster instance (i.e., total of two nodes). The representative benchmark shows better horizontal scalability on Emulab and Open Cirrus, better vertical scalability on EC2, and in contrast poor horizontal scalability on EC2. As illustrated in Figure 3(a), (b) and (c), the EC2 performance 2) Vertical Scalability (different nodes): Here we present degradation was not caused by CPU utilization. For our experour analysis results for vertical scalability on EC2. We started iments, we collected over 24 types of monitoring data (e.g., our analysis with 1-2-2-2-ec2 on small instances (similar CPU, Memory, Disk read/write, network read/write, context hardware configurations as the PC3000s on Emulab) and switches, interrupts), and thus, we extended our analysis on observed very poor performance (as compared to EC2’s horiother monitoring data and micro-benchmarks. We observed zontal scalability), our results are shown in Figure 4(a). Node two key issues: overhead associated with concurrent threads level data analysis revealed that for small instances, the Data (e.g., context switching and scheduling overhead) and network node has considerably different CPU and I/O characteristics driver overhead (e.g., latency and queuing delays). compared to that of large instances. There are two main reasons causing this issue: first, as Guohui et al. [28] described, IV. M ULTITHREADING C OSTS IN EC2 small instances always get 40% to 50% of the physical CPU sharing, and second, differences in memory configurations Here we present multithreading overhead analysis for EC2. We (Table I(b)). According to MySQL cluster documentation, discuss observed application instability due to an increased small instances does not have sufficent memory to keep the number of threads, in Section IV-A we quantify context complete database in memory, thus, it needs to use virtual switching overhead, and Section IV-B shows how to redesign memory. As a pratical solution to this problem, we moved the application to guard against the multithreading issues. to large instances and followed the same procedure; likewise, Multi-tier systems are inherently difficult to analyze due to we continued our process for the other EC2 instance types. complex dependencies of the system. As previously discussed As shown in Figure 4(a), EC2 shows good vertical scalability (see Section III-B), the benchmark application saturated in as compared to its horizontal scalability. As illustrated in EC2 due to effects of initially hidden resources utilization. the figure we achieved the highest throughput with cluster Consequently, we extended our systematic analysis with deinstances (cluster instances are the most powerful instances tailed micro level analysis. Subsequently, we observed an inavailable at the time we perform our experiments). In addition, teresting phenomenon when using multithreading applications for 1-2-2-2-cluster we used software colocation, and on EC2. To illustrate, we selected a RUBBoS client workload (a) EC2 - Throughput (b) RTT Comparison (c) Throughput Comparison Fig. 4. Detailed EC2 Throughput and RTT analysis: (a) Vertical Scalability (Scales well); (b) RTT comparison of Apache, Client, and Client with reduced threads; (c) Throughput—before and after reducing thread overhead. generator and measured the round trip time (RTT) for each and Figure 5(b) we can observe a close relationship between request when increasing the number of concurrent threads. EC2 throughput and the number of context switches. We observed a sudden increase in RTT when the number of To analyze context switching overhead from the resource threads increases. Typically, these types of observations lead perspective, we calculated the overhead as CPU percentages. to assume server saturation; despite this, we observed the issue In our experiments we measured the number of switches per to be caused by instability at the client. second (see Figure 5(b)), using LMBench we approximated To further illustrate the observed phenomena we used the time for a single context switch (against no. of threads), RUBBoS application and logged the response time for each and finally combining the two values we estimated the conindividual request at the client nodes and the Apache web text switching overhead for an experiment cycle. Figure 5(c) server. Next, we used these two logs and calculated average illustrates the calculated overhead. As shown in the figure, for RTT separately. The results are shown in Figure 4(b). For the workloads which are higher than 3000, context switching higher workloads, the calculated RTT using client logs differs uses 10% of the CPU resources for both Emulab and EC2. significantly from the Apache logs. In contrast, for lower Note that the overhead is much more significant in EC2, workloads, the RTT difference is negligible (e.g., RTT for because the number of context switches and the throughput workloads less than 3000). Figure 4(b) illustrates how work- are significantly lower while having similar overhead. load increases (i.e., number of concurrent threads) causes the recording of larger and inaccurate RTTs at the client. Using B. Proposed Alternative Approach micro-benchmark studies we observed that the recorded RTT We experimentally showed how multithreading can affect apdifference is largely due to I/O intensive thread dispatching plication performance in cloud environments. Here we propose overheads, which leads to the unstable behavior in the client. how to redesign the application to resolve the observed issues. First, the most trivial solution in commercial clouds is to A. Context Switching Overhead rent more nodes so that each gets a lower number of users To further analyze the previously observed phenomenon (threads). This approach helps to reduce the overhead caused we used LMbench [5]. We started with the default settings by concurrent threads (see Section IV-A), and eventually of LMbench and performed all supported tests on Emulab brings the application into a stable state. While this solution and EC2. We found surprising results for context switching is acceptable for occasional users that run an application only overhead. It recorded the context switching overhead in EC2 a few times, for long term users who run their applications to be twice as high as in Emulab. Figure 5(a) shows time often or continuously, this is not a viable option mainly due for a single switch when varying the number of threads. As to the associated renting cost. generally accepted, measuring context switching time with Second, overhead can be reduced by reducing the number high accuracy is a difficult task. Nevertheless, analysis shows of threads and increasing the amount of work done by a single that for a few number of threads (< 40) EC2 takes less time thread. For example, we redesigned RUBBoS workload genfor a switch compared to Emulab; in contrast, with a greater erator to generate 7 times more messages (with same number number of threads, EC2 takes considerably longer time. of threads) by reducing the average client think time (e.g., In our benchmark, the client is configured to have more than reducing think time from 7s to 1s). This code-level change is 250 concurrent threads; therefore, context switching becomes analogous to loop-unrolling. Our solutions show significant a key issue. To illustrate the importance of this issue, we cal- improvements, both in RTT (Figure 4(b)) and throughputs culated the average number of context switches when running (Figure 4(c)). We confirmed our hypothesis by calculating the benchmark application. Figure 5(b) shows the calculated the average response time using Apache logs and the clients’ averages for both EC2 and Emulab. As shown in the figure logs. With the thread reduction, the recorded response time Emulab has a significantly higher number of context switches at a client is much closer to that of the Apache logs. The compared to that of EC2. Moreover, combining Figure 2(a), (c) remaining difference can be explained with the queuing effects (a) Time per a switch (measured by LMBench) Fig. 5. (b) Context switches per second (c) Fraction of CPU spent on switching Comparative context switching analysis for Emulab and EC2. that naturally take place between the two tiers. While this solution is practical in many cases, it depends on the nature of the application whether the desired concurrency level can still be reached in this manner. results. In this process, especially in the read-only scenario, the message sizes of generated output are significantly higher than the incoming message. Therefore, when the message rate is high, the Data node generates more data. However, when there is a problem at the transmitting side, the Data node V. N ETWORK D RIVER OVERHEAD cannot send the data as required, which then results in a long We increased the number of client nodes to 10 and resolved queue at the network buffers. Next, in the other tiers (e.g., the client issue, and subsequently shifted the bottleneck to Tomcat and Apache HTTPd), the connections start to wait as the back-end. In Section V-A we show the network traffic well. Therefore, the overall message transmission rate reduces. increase through database partitioning and subsequent network Eventually this behavior affects the entire system performance. To further analyze observed hypothesis we used Nettransmission overhead on EC2. In Section V-B we provide our solutions to achieve higher performance in production clouds. PIPE [6] and measured the achievable bandwidth. NetPIPE is a protocol-independent performance tool that visually represents A. Network Driver Overhead the network performance under a variety of conditions. Our results show that the two systems have similar bandwidth, We extended our analysis to the network layer, and our thus, confirming that the sending buffer issue is not caused results revealed two important findings. First, increasing the by network bandwidth. We then used the Unix Ping program and measured the number of Data nodes (partitioning the database) caused a RTT between the two Data nodes in two environments. First, significant increase in the network traffic generated at the Data we evaluated ping RTT without generating any application nodes. Second, we observed significantly higher transmission queue sizes for EC2 compared to that of Emulab. We observed load, thus, making ping program the only process. Second, that moving from 2 Data nodes to 4 Data nodes doubles we evaluated RTT while generating representative application the network traffic. Nevertheless, for 1-2-2-2 configuration load. We observed the following very interesting behavior. both Emulab and EC2 shows similar network traffic patterns, When there is no load EC2 shows significantly less RTT and also somewhat similar throughput values. However, for compared (approximately 20 times less) to Emulab (see Fig1-2-2-4 shows different behaviors both in network traffic ure 6(b)). In contrast, when populating the database, EC2 shows very large RTT that varies between 0.4 ms to 30ms (see and throughput (Figure 2(a) and (c)). We selected the 1-2-2-4-em for Emulab and the Figure 6(c)), but Emulab shows more similar results for both 1-2-2-4-large for EC2 and used our network analysis cases (average of 0.2 ms). This provides additional evidence tools to observe network behavior. Through our data, we to confirm network driver overhead on EC2 and potential observed a surprising behavior in the Data node server. As performance degradation. shown in Figure 6(a), we observed that the sending queue size (at the network layer) of the Data node in EC2 is much B. Proposed Alternative Approach As we discussed in Section V, due to this network overhead, higher than in Emulab. In contrast, the receiving sides of both EC2 and Emulab show similar characteristic, and both have the scalable software system becomes not scalable in EC2. We experimentally evaluated ways to overcome this issue. In negligible queue sizes. In our benchmark, the Data node is the first server which general, MySQL Cluster is a sophisticated but heavily used starts to generate a large amount of data in the request prop- database middleware in industry. It has a number of advantages agation chain. Initially, users send a HTTP request to Apache including availability, reliability, and scalability compared to HTTPd, and then it forwards request to Tomcat. Tomcat simpler replication solutions. Unfortunately, due to higher processes the request and generates SQL statements that are network traffic generation of MySQL cluster, the complete sent to the SQL servers. The SQL servers send the query to the application shows poor scalability. Data nodes, which then processes those queries and generates To illustrate the significance and to provide an alternate (a) Data node sending queue length analysis Fig. 6. (a) Throughput improvement Fig. 7. (b) Ping RTT without load (EC2 is lower) (c) Ping RTT with load (Emulab is lower) Analysis of network driver overhead and characteristics for Emulab and EC2. (b) Database tier—total traffic Performance comparison C-JDBC vs. MySQL Cluster. solution we used the C-JDBC [30] middleware and performed the same set of experiments as in Section III-A on EC2. The observed throughput values are shown in Figure 7(a). As shown in the figure C-JDBC shows very good scalability and achieves very high throughput. Next, we measured the amount of data generated at the database tier by the two approaches, and our results are shown in Figure 7(b). As demonstrated in the figure, C-JDBC generates a significantly smaller amount of data compared to MySQL Cluster. Consequently, the middleware results in lower pressure on the network, which causes better performance. In contrast, MySQL Cluster produced a large network traffic and higher pressure on the network driver. VI. R ELATED W ORK Traditionally, performance analysis in IT systems builds models based on expert knowledge and uses a small set of experimental data to parameterize them [15], [16], [33]. The most popular representative of such models is queuing theory. Queuing networks have been widely applied in many performance prediction methodologies [17], [18]. These approaches are often constrained because of their rigid assumptions when handling n-tier systems due to the complex dependencies. As an illustration of significant characteristics that are hard to capture with traditional analysis, consider the significance of context switching—as shown in this paper—towards system performance when a large number of threads is involved. While the increasing popularity of cloud computing has spawned very interesting research on private and public clouds, to the best of our knowledge, no previous work has evaluated IaaS clouds using complex n-tier systems to find performance and scalability issues at all tiers. Jim et al. presented a method for achieving optimization in clouds by using performance models in the development, deployment, and operation of applications that run in the cloud [10]. They illustrated the architecture of the cloud, the services offered by the cloud to support optimization, and the methodology used by developers to enable runtime optimization of the clouds. Thomas et al. analyzed the cloud for fundamental risks that may arise from sharing physical infrastructure between mutually distrustful users, even when their actions are isolated through machine virtualization [14]. Ward et al. discussed the issues and risk associated with migrating workloads to clouds, and the authors also proposed an automated framework for “smooth” migration to cloud [25]. In contrast, we have studied the potential scalability and performance issues after migration. Mayur et al. evaluated Amazon S3 as a black box and formulated recommendations for integrating S3 with science applications and for designing future storage utilities targeting this class of applications [13]. They discussed how costs can be reduced by exploiting data usage and application characteristics to improve performance and, more importantly, by introducing user managed collaborative caching in the system. Despite the apparent differences, this work has some important similarities to our work: first, both approaches evaluate commercial clouds; second, both papers use real applications and not simulations; third, both approaches derive recommendations to address the uncovered issues. Guohui et al. have deeply analyzed the network overhead on EC2 and presented a quantitative study on end-to-end network performance among different EC2 instances [28]. Furthermore, Walker has presented a performance analysis and shown that a performance gap exists between performing HPC computations on a traditional scientific cluster and on an EC2 provisioned scientific clusters [26]. Similar to our findings, his experiments also revealed the network as one of the key overhead sources. As part of our analysis, we have shown the significance of virtualization and associated network overhead. Padma et al. have looked at both the transmit and receive aspects of this workload. Concretely, they analyzed virtualization overheads and found that in both cases (i.e., transmit and receive) the limitations of VM scaling are due to dom0 bottlenecks for servicing interrupts [3]. Koh et al. also analyzed the I/O overhead caused by virtualization and proposed an outsourcing approach to improve the performance [27]. VII. C ONCLUSION In this paper we presented an experimental analysis of performance and scalability when n-tier applications are migrated to IaaS clouds. We employed the RUBBoS benchmark to measure the performance on Emulab, Open Cirus, and EC2. Especially the comparison of EC2 and Emulab yielded some surprising results. In fact, the best-performing configuration in Emulab became the worst-performing configuration in EC2 due to a combination of several factors. Moreover, in EC2 the network sending buffers limited the overall system performance. For computation of intransitive workloads, a higher number of concurrent threads performed better in EC2 while for network based workloads, high threading numbers in EC2 showed a significantly lower performance. Our data also exemplified the significance of context switching overheads. We provided a set of suitable candidate solutions to overcome the observed performance problems. More generally, this work enhances the understanding of the risks and rewards when migrating n-tier application workloads into clouds and shows that clouds will require a variety of further experimental analysis to be fully understood and accepted as a mature technological alternative. In addition, experiment results show the neediness of refactoring n-tier applications before they are deployed on commercial clouds especially when one has to satisfy production SLAs. VIII. ACKNOWLEDGMENT This research has been partially funded by National Science Foundation by IUCRC, CyberTrust, CISE/CRI, and NetSE programs, National Institutes of Health grant U54 RR 02438001, PHS Grant (UL1 RR025008, KL2 RR025009 or TL1 RR025010) from the Clinical and Translational Science Award program, National Center for Research Resources, and gifts, grants, or contracts from Wipro Technologies, Fujitsu Labs, Amazon Web Services in Education program, and Georgia Tech Foundation through the John P. Imlay, Jr. Chair endowment. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or other funding agencies and companies mentioned above. R EFERENCES [1] The Elba project. http://www.cc.gatech.edu/systems/projects/Elba/. [2] MySQL Cluster. http://www.mysql.com/products/database/cluster/. [3] P Apparao, S Makineni, D Newell. Characterization of network processing overheads in Xen. In Second International Workshop on VTDC 2006. [4] Emulab - Network Emulation Testbed. http://www.emulab.net. [5] LMbench - Tools for performance analysis. http://www.bitmover.com/ lmbench/. [6] NetPIPE - A Network Protocol Independent Performance Evaluator. http://www.scl.ameslab.gov/netpipe/. [7] RUBBoS: Bulletin board benchmark. http://jmob.objectweb.org/rubbos. html. [8] Amazon Elastic Compute Cloud EC2. http://aws.amazon.com/ec2/. [9] T. Economist. A special report on corporate it. http://www.economist. com/specialReports/showsurvey.cfm?issue=20081025, October 2008. [10] J. (Zhanwen) Li, J. Chinneck, M. Litoiu and G. Iszlai. Performance Model Driven QoS Guarantees and Optimization in Clouds. In CLOUD’09, Vancouver, Canada, May, 2009. [11] A. Lenk, M. Klems, J. Nimis, S. Tai, and T. Sandholm. What’s Inside the Cloud? An Architectural Map of the Cloud Landscape. In CLOUD’09. [12] Edward P. Holden, Jai W. Kang, Dianne P. Bills and M. Ilyassov. Databases in the Cloud: a Work in Progress. In SIGITE’09, Fairfax, Virginia, USA, October, 2009. [13] M. Palankar, A. Iamnitchi, M. Ripeanu, and S. Garfinkel. Amazon S3 for Science Grids: a Viable Solution?. In DADC’08, Boston, USA. [14] T. Ristenpart, E. Tromer, H. Shacham, and S. Savage. Hey, You, Get Off of My Cloud:Exploring Information Leakage in Third-Party Compute Clouds. In CCS’09, Chicago, USA, November, 2009. [15] R. Jain. The art of computer systems performance analysis: techniques for Experimental design, measurement, simulation, and modeling. John Wiley & Sons, Inc., New York, NY, USA, 1991. [16] D. J. Lilja. Measuring Computer Performance - A Practitioner’s Guide. Cambridge University Press, New York, NY, USA, 2000. [17] C. Stewart and K. Shen. Performance modeling and system management for multi-component online services. In NSDI ’05. [18] B. Urgaonkar, G. Pacifici, P. Shenoy, M. Spreitzer, and A. Tantawi. An analytical model for multi-tier internet services and its applications. SIGMETRICS Perform. Eval. Rev., 33(1):291–302, 2005. [19] I. Cohen, M. Goldszmidt, T. Kelly, J. Symons, and J. S. Chase. Correlating instrumentation data to system states: a building block for automated diagnosis and control. In OSDI’04. [20] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In SOSP ’03. [21] G. Swint, G. Jung, C. Pu, A. Sahai. Automated Staging for Built-toOrder Application Systems. In NOMS 2006, Vancouver, Canada. [22] G. Swint, C. Pu, C. Consel, G. Jung, A. Sahai, W. Yan, Y. Koh, Q. Wu. Clearwater - Extensible, Flexible, Modular Code Generation. In ASE 2005. [23] S. Akhil, C. Pu, G. Jung, Q. Wu, W. Yan, G. Swint. Towards Automated Deployment of Built-to-Order Systems. In DSOM 2005. [24] M. Welsh, D. Culler, E. Brewer. SEDA: An Architecture for Well Conditioned, Scalable Internet Services. In SOSP18. [25] C. Ward, N. Aravamudan, K. Bhattacharya, K. Cheng, R. Filepp, R. Kearney, B. Peterson, L. Shwartz. Workload Migration into Clouds Challenges, Experiences, Opportunities. In CLOUD2010. [26] Edward Walker. Benchmarking Amazon EC2 for High Performance Scientific Computing. In USENIX,;login, vol. 33(5), Oct 2008. [27] Y. Koh, C. Pu, Y. Shinjo, H. Eiraku, G. Saito, and D. Nobori. Improving Virtualized Windows Network Performance by Delegating Network Processing. NCA 2009. [28] Guohui Wang T. S. Eugene Ng. The Impact of Virtualization on Network Performance of Amazon EC2 Data Center. INFOCOM 2010. [29] M. Ronstrom, J. Oreland. Recovery Principles of MySQL Cluster 5.1. VLDB 2005, Trondheim, Norway. [30] E. Cecchet, J. Marguerite, and W. Zwaenepoel. C-jdbc: Flexible database clustering middleware. In In Proceedings of the USENIX 2004 Annual Technical Conference, 2004. [31] N. Yigitbasi, A. Iosup, S. Ostermann, and D.H.J. Epema. C-Meter: A Framework for Performance Analysis of Computing Clouds. In Cloud 2009. [32] Malkowski, S., Hedwig, M., Jayasinghe, D., Pu, C., and Neumann, D. CloudXplor: A tool for configuration planning in clouds based on empirical data. In ACM SAC2010. [33] K. Xiong, H. Perros. Service Performance and Analysis in Cloud Computing. In CLOUD2010. [34] Q. Wang, S. Malkowski, D. Jayasinghe, P. Xiong, C. Pu, Y. Kanemasa, M. Kawaba, L. Harada. The Impact of Soft Resource Allocation on n-Tier Application Scalability. In IPDPS 2011.