Scalable and Highly Available Infrastructure for J2EE Applications A Case Study: ETA- Education and Training Administration System Embry-Riddle Aeronautical University written by John Vaughan, DataRoad, Inc. Marty Smith, Embry-Riddle Aeronautical University Introduction In this white paper we will discuss the development of a highly available, scalable and secure infrastructure designed to support the operation of a web-based J2EE application. The project involved implementation of a flight training management application for Embry-Riddle Aeronautical University. This application provides management for flight training operations to a variety of organizations and is the main focus of efforts at Embry-Riddle to standardize, automate and secure control over the activities of flight training globally. Applications that provide these services must be able to combine existing information with new business functions that deliver services to a broad range of users. These services need to be: Highly available, to meet the demands of a extended business environment Secure, to protect privacy and integrity of data Reliable and scalable, to guarantee that business transactions are accurately and promptly processed In reality, Java technology is only as scalable, available, and manageable as the infrastructure on which it runs. When the platform can’t keep up with growth in the number of users, transactions per user, or transaction bandwidth, applications perform poorly and websites slow to a crawl. Like any other enterprise application, a server-side Java application can be brought down by a hardware fault, a software fault, a network fault, or an environment fault. Whatever the reason, there is no room for downtime in an e-business environment. Properly designed data centers address the network and environment fault issues by providing redundant power and internet connectivity. This paper presents a case study of implementing a highly available and scalable solution that combines Oracle9iAS, Oracle9i RAC, SSL accelerators, and hardware load balancers. This solution was designed and implemented for Embry-Riddle Aeronautical University, the largest flight-training school in the world. ERAU's J2EE application supports every aspect of the school's flight training program, so scalability and 24/7 availability are critical. Embry-Riddle Approaches the Future Embry-Riddle AeronauticalUniversity is the largest independent aeronautical university in the world. The not-for-profit institution educates more than 24,000 students annually through thirty degree programs. Its ROTC detachments train more Air Force pilots and commissioned officers than any other institution except the Air Force Academy. The flagship program at Embry-Riddle, however, is the education of commercial pilots. This training is a demanding process that requires a comprehensive and continuously reviewed program of advanced learning. Consequently the university is launching a new curriculum to revolutionize pilot training brining information technology to bear on the issues involved. Embry-Riddle oversees all the usual elements of an academic program such as student records, human resources, financials and facilities. But when it comes to sending a student airborne, the school must also track a host of factors that can change by the minute, such as weather, air traffic, condition of the plane or simulator, and the health and certifications of both the student and instructor/pilot. For every sortie, the instructor, student and craft must meet specific qualifications. Relying on manual methods, Embry-Riddle staff scramble to match craft, student and instructor. Compounding the challenges is the scale of Embry-Riddle’s operation, which includes campuses in Daytona and Prescott, Arizona, 130 education centers and a distance-learning network. At the Daytona campus alone, 80 instructor/pilots supervise 550 student flights daily using a fleet of 139 instructional aircraft. Until recently the tracking of training programs worldwide has largely been a paper and pencil procedure. This practice is inexpensive and requires little training to maintain but is fraught with opportunity for error that can, inevitably, lead to inaccurate information being provided to the training institution, trainers and students alike. Their approach to dealing with the increasingly dynamic requirements of this state of affairs was to develop an automated, Internet-capable information management system for tracking flighttraining data for their students. This allowed Embry-Riddle to re-engineer the training of commercial pilots to provide them with a better education at less cost. This new curriculum blends all of the required skills into one seamless course. This innovative approach to curriculum management is Embry-Riddle’s pioneering Education and Training Administration (ETA) system, which applies “just-in-time” methods to orchestrate the costly human and capital assets required for flight training. Embry-Riddle expects the ETA system to enhance instructional quality while reducing student expenses and institutional overhead. As was mentioned earlier, until ETA, Flight Training Management was largely organized through pencil and paper operations. These operations are antiquated, time consuming and inaccurate. Difficulty to maintaining currency of data and human error compound the problem. This method also was very fragmented and lacked in comprehensive communications across disciplines and organizations. The priority then was to develop a system that would keep the current operations running smoothly while improving on the integrity, availability and accuracy of the data being managed. This was absolutely essential to the overall perception of the solution as being beyond criticism and doubt. To accomplish this, the ETA system would need to exceed expectations for all Service Levels, Academic/classroom support, Student Services and Daily campus support. It would need to be available 24 x 7, anytime, anywhere and the network, infrastructure, and applications would need to operate flawlessly. The answer to meet these demands was to develop a real-time information management system for tracking flight-training data that is highly available, highly scalable and immediate with fast web access any time, anywhere. The system also needed to address usability issues with userfriendly interfaces and the portal based individualization. The data must be continually updated and current. The system must also be secure with authentication and intrusion proof data. Education and Training Administration – Aviation Learning Management System To meet the challenge of establishing a system that would provide such a system Embry-Riddle Aeronautical University, DataRoad, Inc. and Talon Systems collaborate to produce an Aviation Learning Management System called ETA: Education and Training Administration. This system is the most comprehensive Flight Training Management Program ever and serves as an enterprise model for those using it. ETA is a Flight Training Management Tool; a 100% Internet-based J2EE application accessed through standard web browser. It is completely electronic and supports concurrent operations at Daytona, Prescott, Affiliate Operations and now the US Air Force Academy. The number of locations it supports is continuing to grow. Integrating data from Embry-Riddle’s maintenance, HR, payroll, accounting and student record systems, the ETA system provides students, instructor/pilots and managers with secure, Webbased access to all of the information and tools they require to participate in the new curriculum. The ETA system translates Embry-Riddle’s new course for commercial pilots into a continuum of stages, lessons and units, each structured by line-item objectives. Students can extract customized training plans that guide and track their progress through the curriculum. An electronic grade sheet automatically posts any incomplete line items until their satisfactory completion. Whether a lesson takes place in a cockpit or a classroom, the ETA system identifies and schedules all human and capital resources required to fulfill the session’s line-item objectives and confirms the readiness of these resources. Flagging any issues, the system automatically checks all relevant details, including the student’s prerequisite courses, flight hours and registration, financial and health records; the instructor’s pilot ratings and certifications for the prescribed craft and sortie; and the maintenance status of the vehicle. While translating documents into real-time data, the ETA system also streamlines execution of paper-laden processes—from FAA-mandated safety and security documentation to EmbryRiddle’s own internally generated paperwork. Far more than scheduling software, the ETA system is a repository of real-time information and tools The ETA system’s tools reinforce best practices in managing human and capital assets, enabling not-for-profit Embry-Riddle to more efficiently deploy its educational resources. Embry-Riddle worked with Talon Systems LLC to develop the entirely Web-based(J2EE) system. Oracle partner DataRoad, Inc. designed and implemented the infrastructure for ETA at one of its secure data centers in Jacksonville, Florida. This infrastructure uses Oracle 9i Application Server(9iAS) and Oracle 9i RDBMS software, HP Servers and Alteon Load Balancers and SSL accelerators. Preparing for Growth Due to the growth potential of the user base for the ETA system scalability and high availability were essential. As more users come on to the system it must scale up appropriately and be available immediately, 24x7. Why? Embry-Riddle provides multi-national flight training at both day and night in multiple time zones. For the ETA project, the system runs on a real-time, 24x7 platform that utilizes Oracle 9i Real Application Clusters (RAC) database, HP TruCluster Server software, and DataRoad’s technical experience to provide a highly available, highly scalable solution to meet Embry-Riddle’s needs. Unique to the solution is the single-system manageability of the software, which makes operating multiple servers as simple and economical as managing one. DataRoad’s end-to-end solution exploits all of these advantages to efficiently meet EmbryRiddle’s requirements for high availability, security and data integrity. DataRoad provides a dedicated platform for the ETA system that comprises servers, software, and networking. DataRoad hosts and administers both the system and the application, which users access through a secure VPN. Definitions Prior to discussing specific configurations it is important to discuss general architecture terms and definitions appropriate for the deployment of highly available and scalable infrastructure. Firewalls Firewalls are devices that restrict access between different LAN segments for security purposes. Firewalls perform this function by analyzing traffic and can make restrictions based on IP address, port, protocol used, protocol transitions and message content. For example, Check Point Firewall-1 products provide a software solution that includes a feature called "stateful inspection" that can restrict access based on illegal Internet protocol transitions. Cisco's PIX is an example of an integrated hardware-software firewall solution. Some devices that are called firewalls are software-only products that are loaded into client or server machines. These may be useful but are inadequate for corporate firewalls that should always be deployed in separate machines than those deploying application or infrastructure software. Firewalls are a main defense for sites providing Internet access. Different firewall products vary considerably in features and performance. Appropriate use of firewalls can protect against many common vulnerabilities by prohibiting Internet access to services such as FTP or rsh (especially if such services were inadvertently left running on Internet servers). Load Balancers Load balancers have two essential functions. The first is to load balance traffic across multiple servers thus resulting in better scalability. In high traffic situations this can be very important. The second essential function is to provide fault tolerance for servers. In this case the load balancer ensures that a single failing server does not result in loss of a critical resource. The load balancer accomplishes this by routing new requests to alternate servers if one server fails. So, Load balancing hardware is used both to provide scalability by spreading load across multiple processors and also to provide fault tolerance in case of processor failures. Load balancers typically are able to route traffic in both situations where the infrastructure keeps application state also in situations were it does not keep state. In the case of stateless communication the load balancer can route to any of its managed servers since there is no state in any particular server that is needed to correctly process the message. This is generally more efficient since requests can always go to the least busy server but stateless operation often puts an unacceptable burden on application writers. Many Oracle products require that the infrastructure maintain application state. For transactions where the infrastructure keeps state, load balancers switch incoming messages to the server containing the state. Switching criteria are determined by analyzing cookies, headers or other attributes. Sometimes only a single server contains the state. In that case processor failures result in the failure of all transactions that have state in the failed processor and such transactions must be restarted. In some situations there are preferred processors but all processors can obtain the state. When failures occur in these situations, a redirect due to failure will result in successful processing although there may be added overhead for transactions that had state in failed processors. SSL Accelerators In many sites, SSL key exchange operations can dominate CPU usage. For such sites HTTPS accelerator appliances can result in significant cost reductions and improved performance. Expanding HTTPS use improves security. Where HTTPS use is limited by performance considerations, HTTPS accelerators should be considered. The term "sticky" or "persistent" transaction is often used to denote transactions that should be routed to particular, load balancer managed hardware containing intermediate application transaction state. There are different types of SSL Accelerators. One type is basically a math coprocessor that offloads expensive cryptographic operations from general purpose CPUs . A second type is a stand-alone device that converts HTTPS to HTTP protocols. That is to say, it takes incoming HTTPS protocols and converts them to HTTP. Since the SSL processing of the HTTPS protocol can consume a large percentage or even most of a CPUs time, offloading SSL processing may result in a significant reduction in the number of CPUs required to support a workload. Such reduction can result in both cost savings as well as improved scalability. A current problem with HTTPS to HTTP appliances occurs when client side X.509 certificates are used. This is because these appliances terminate the SSL session and there is no standard way to provide the client side X.509 certificate information with the forwarded message. If client side certificates are only used to allow/deny access to a site or virtual host this may be acceptable. However if the application or other infrastructure items need certificate information, custom solutions are currently required. Since client side certificates are infrequently used at this time, this consideration is not important for most sites. Customers interested in use of X.509 client side certificates with such devices should contact Oracle or appliance providers as progress toward standard, supported solutions is being made. Clustering Clustering, while complex in practice, is fairly simple in definition. Clustering is the grouping together of hardware and software into nodes that work together as a single system to ensure that an application remains online for users during excessive loads, or if one of the nodes fails. Clustering enables you to construct a multi-node system that makes several independent servers appear like one. Multiple servers are connected together to form a single integrated system. If any part of the systems goes down – either intentionally or unintentionally - failover masks the failure to the end users, thereby making the system more available. The down member of the cluster is then reactivated, if possible, through a restart. This reduces the need for administrator intervention. The system can also be scaled more effectively support more users through load balancing. Advanced tools for managing the cluster also assist in monitoring the activities of the system and alerting administrators to potential issues. Availability High Availability requires a variety of approaches to deliver. Each goes hand in hand to contribute to a highly available service to the end user. As mention earlier, in clustered environment multiple servers act in concert with each other to present a single source. For a member within the clustered environment to take the place of another that is experiencing trouble, the state of requests must be shared across all members of the cluster. When a new cluster member takes over for a failing member the process is executed more smoothly due to the share requests. In the event of a failure, transparent failover enables a member of a clustered system to take the place of another member without the end user being aware that a change is taking place; in essence totally transparent. This gives users a sense of continuity to the system. The individual member of the cluster experiencing the downtime does not effect the operation on the user side at all. Once failover is executed and the system is stable again, which happens rapidly, quick automated restarts then take place. The down member is identified and restarted automatically. If it cannot be restarted an error is generated and administrators are notified of the situation for further attention. This process reduces the need for direct intervention on the administrative level, thereby minimizing downtime and increasing availability. In the event that the system has a serious failure that requires significant downtime, the cluster can gracefully degrade the service provided to the end user. This provides a limited level of service, rather than presenting a total failure. Single points of failure are also reduced or eliminated thereby limiting the risks of significant failure and unnecessary downtime. High availability is also improved through the use of load balancers. Load balancing is necessary because multiple servers servicing one application can quickly be overwhelmed and crash if the workload is not split up. Load balancing divides work between two or more computers. The work gets done in the same amount of time without any one computer getting overloaded. Cluster resources are dynamically re-balanced for optimal cluster utilization. Scalability Scalability is also essential to maintaining acceptable levels of service while keeping costs under control. System growth must be progressive and easily expandable to meet increasing demand from the user base. A clustered environment provides the most appropriate solution. Nodes added to the cluster are automatically utilized; no manual re-allocation of resources is needed. This enables low cost incremental scaling, allowing DataRoad to reduce the hosting expenses to Embry-Riddle by using only the server power it needs at any time. Due to the flexible nature of the environment, more equipment can be brought online at a moments notice to address any scaling requirements the system may demand. This provides ERAU with a “Scale as you grow” option that minimizes the initial capital outlay for equipment, thereby significantly reducing hosting fees. This approach provides for more effective costing of hosting fees based upon real and not just anticipated growth. In a rapidly changing environment, opportunities for growth appear and disappear rapidly. It is difficult to accurately predict the demand for a database or application server two years out, yet having too little computing horsepower at any given time is unacceptable. Even if growth is initially underestimated, the scalability of the system will allow for cost effective sizing of infrastructure. Real Application Clusters give scalability on demand because it is no longer necessary to predict scalability needs. Application Server (9iAS) This paper focuses on the ‘core’ components of Oracle9iAS Release2. Hence, a reference to Oracle9iAS in this paper in general is a reference to Oracle9iAS Release 2 J2EE and Web Cache Install. The components that fall in the core category are: Web Cache: This is typically the first component of Oracle9iAS to receive the request. For both static and dynamic requests, it can cache the result and then replay the results, thus reducing the workload of the machines behind. In addition, these Web Cache instances can themselves be clustered. Oracle HTTP Server (OHS): This is the next in line after Web Cache to receive a request – this sub-system comprises a web server (based on Apache), a perl execution environment, and a PLSQL and OC4J routing system. Oracle Containers for J2EE (OC4J): This is the J2EE compliant container in Oracle9iAS. It provides clustering capabilities for the J2EE components – Servlets, JSP, and EJB. It also contains other mechanisms, such as Java Object Cache, which provides distributed caching capabilities. Real Application Clusters (9iRAC) Real Application Clusters is an option for an Oracle9i database. Oracle9i Real Application Clusters provides both scalability and availability as a single, easy to manage database product. With Oracle9i Real Application Clusters, your enterprise database delivers scale out economics with the ease of use and power of a scale up approach. For any database application, a Real Application Cluster database looks just like an Oracle9i database on a single server. Real Application Clusters supports all types of applications, from update-intensive online transaction processing to read-intensive data warehousing. Oracle9i Real Application Clusters database not only appears like a standard Oracle9i database to users, but the same maintenance tools and practices used for a single Oracle9i database can be used on the entire cluster. All of the standard backup and recovery operations, including the use of Recovery Manager, work transparently with Real Application Clusters. All SQL operations, including data definition language and integrity constraints, are also identical for both configurations. Real Application Clusters provides rapid, automatic failover for users if their servers go down. This automatic failover capability can prevent having to go through a complex serious of operations to restore access to a database, actions which, if not performed promptly or correctly, can increase the duration of downtime or even jeopardize the integrity of your data. The Solution