Site Visit Report Edinburgh University SOA Review Doc Ref: CONS/300399457/SVR/001 06-Sep-2013 SITE VISIT REPORT Client: Project: Edinburgh University SOA Review Oracle Consultants: Date(s) of work covered by this report: Alan Maxwell 28-Aug-2013 Description of work performed, including any advice given: Oracle Project Code: 300399457 Total Billable Time: 1 day The purpose of the visit was to carry out an initial high level review of the SOA 11g work currently being undertaken within the University’s Information Services Team. This constitutes, at least initially, a migration of existing SOA Suite 10g logic to release 11g, but will in future include further services developed to meet current or future business needs. The review was based on a series of discussions with various team members, and covered 3 main aspects : SOA infrastructure (ie installation and configuration) Development and deployment Run time monitoring All of these areas were addressed during the day. The limited time meant that the review was necessarily high level, but the main conclusions are discussed below for each area. SOA infrastructure The SOA infrastructure comprises 3 environments – development, test and production. Test and production are replicas of each other. Each environment contains one WebLogic domain including 3 servers : one Admin server, one Managed Server for SOA, and one Managed Server for BAM (which is not used at present). The underlying database is a single instance (non-RAC) Oracle one in each case. The databases run on physical servers, and the WebLogic servers run on virtualized servers implemented on VMWare. Although each environment is single server throughout, and therefore does not have high availability features, there is a plan for a disaster recovery capability. On the database level, this will involve use of Active Data Guard, and at the application server level, this will involve the use of a product called VEEAM to replicate the VM data to the standby site. In addition to the components already described, the production installation (at least) includes a load balancing router, which acts as an SSL termination point. There is also an Oracle HTTP Server (OHS) installation located between the load balancer and the SOA Suite domain, acting purely as a passthrough proxy at present. There are several observations and recommendations to be made from this brief description : Use of VMWare. Oracle Support has a published policy on support for products running on VMWare, which is described in Support Note 249212.1. One of the first paragraphs in the note states : ‘Oracle has not certified any of its products on VMware virtualized environments. Oracle Support will assist customers running Oracle products on VMware in the following manner: Oracle will only provide support for issues that either are known to occur on the native OS, or can be demonstrated not to be as a result of running on VMware.’ Given this situation, it is also the case that use of VEEAM as a disaster recovery solution for the middle tier is not supported either. The supported solution, as described in the current Fusion Middleware Disaster Recovery Guide at : 1 Document1 (Issue 1) Site Visit Report Edinburgh University SOA Review Doc Ref: CONS/300399457/SVR/001 06-Sep-2013 http://docs.oracle.com/cd/E28280_01/doc.1111/e15250/intro.htm#sthref4 involves the use of Data Guard to replicate the database contents, and disk replication technology to support the middle tier data movement. Versions. The version of Fusion Middleware which is in use is release 11.1.1.6. The currently shipping version is 11.1.1.7. Oracle Support has a stated policy and timeline for different levels of support, contained in Note 944866.1. To quote one of the relevant sections : ‘Grace Period: ….. You have up to one year from the initial release of the patch set to install the new patch set, and can receive new bug fixes for the previous patch set during that time. The patch set grace period became effective with the release of FMW 10g ….. This continues through the 11g patch sets (e.g. 11.1.1.3 is supported for one year from the initial release of 11.1.1.4, etc.).’ In effect, this means that once 11.1.1.7 has been available for a year (which is likely to be March-April 2014), the degree of bug fix support available for release 11.1.1.6 will change. As a result, there would be a recommendation to move to release 11.1.1.7 at a suitable point, preferably before the support date mentioned above. This could be done now (by running the upgrade procedure for SOA Suite and the infrastructure database contents), or at a suitable point in the system’s lifecycle (eg creation of a new production environment – see below). Recommendation (added after site visit) : Consider development of plan for migration to current 11.1.1.7 release of SOA Suite. High Availability/Scalability. The current production environment contains only one managed server running SOA Suite (and also uses a single instance, non-RAC database). As the team are well aware, this imposes restrictions on the environment’s availability (since the managed server and the database both constitute single points of failure), and its scalability, since there is no ability to add more managed servers to increase capacity if this proves necessary. The reasons for the decision to go into production in this configuration are well understood, and will not be rehearsed here. In addition, the nature of the initial workload, where most SOA composites will be triggered by a database adapter poll, and so initiated by the SOA Suite itself, means that outages will not have such a dramatic impact on the external applications – the changed records will simply accumulate in the database until SOA Suite restarts. Nevertheless, there are also some SOAP based Web Services being offered by the SOA environment and used by external applications, which means that a service outage will already be visible to other systems. In addition, this type of traffic may increase over time. Furthermore, at the time of writing, no performance testing has been carried out. There is a possibility that the new environment may already be required to process traffic volumes which exceed the capacity of a single server. (Note : According to the team, there are moves afoot to reduce the throughput in SOA, by reducing the number of source database updates which trigger SOA processes. This may reduce the initial capacity related risks.) As a result, the recommendation would be to move to a multi managed server configuration as soon as possible, both for availability and scalability reasons. Depending on how this is carried out, this could entail a simple copy of the existing managed server configuration, or it could involve creation of a new environment, especially if Oracle’s recommended approach of ‘Whole Server Migration’ is adopted. The pros and cons of each approach were briefly discussed on site. This point should be explored in more detail during a planned visit by another Oracle consultant in the near future (time permitting). Recommendation : Explore options, and develop plan, to migrate production environment to a 2 Document1 (Issue 1) Site Visit Report Edinburgh University SOA Review Doc Ref: CONS/300399457/SVR/001 06-Sep-2013 configuration with superior high availability and scalability characteristics. Disaster Recovery Environment. While it is good that a disaster recovery environment is being set up, there are well known issues which need to be considered as part of the decision to fail over to a DR site. For example : o Assuming that the database and middleware components both replicate asynchronously and independently, there may be timing and therefore data discrepancies between the two tiers after a failover. This can lead to issues in the immediate post failover period. o In addition, should other applications which interact with SOA Suite failover as well, then there can be similar timing issues related to the (presumably asynchronous) failover mechanisms used by these other applications. This can lead to ‘impossible’ situations, such as SOA Suite receiving the ‘same’ update twice, because the source application has recovered to a different point in time than SOA Suite. These considerations are not specific to Edinburgh University, nor are they specific to SOA Suite. They are instead related to creation of a distributed disaster recovery environment by asynchronous data replication. However, they do mean that the decision to fail over from the primary to the standby site is not something which should be taken lightly. In particular, it would not be wise to consider the DR site as representing a suitable alternative to including High Availability features in the primary site. The observations above are based on a brief discussion with the team members looking after the creation of the environments. However, the environment setup was not reviewed in detail, on time grounds. It has been agreed between Oracle and Edinburgh University that another consultant, who specializes in this area, should review the production environment setup at least. Development and deployment Again, this area was briefly reviewed, mainly on the basis of a discussion with the team leader and one of the team members. The outline summary from this discussion is : The development aspects which the team are considering, such as service versioning, use of a generic error handling component, automated unit testing, and automation of deployment as far as possible, are representative of good practices which the author has either seen or has recommended to other customers. Based on this high level discussion, the decision was taken not to devote more of the time available to review development and deployment practices in more detail. This could certainly be done, but a full review (covering areas such as development and coding standards, design and development approach, procedures and approaches to testing of all types, and deployment practices) would take a significant amount of effort – well in excess of the budget available at present. Instead, the remaining time spent with the development team was spent in answering some specific questions which the team had sent prior to the visit. The answers were given verbally on the day. In some cases, follow-up actions were taken by the author to send other information after the visit. This will be done under separate cover. In addition to these existing questions, some recommendations were given to the team to consider other areas, such as use of a continuous integration server, and use of the MetaData Store (MDS) to hold common artefacts such as WSDLs and XSDs. Run time monitoring Again, this area was discussed in the context of current SOA 10g monitoring practices. The key points are : The current monitoring approach, from the description given, appears to work for the current scenario. 3 Document1 (Issue 1) Site Visit Report Edinburgh University SOA Review Doc Ref: CONS/300399457/SVR/001 06-Sep-2013 However, as a general observation, the current monitoring approach is not necessarily something which could be ‘scaled up’ to deal with a larger service portfolio. o There is a significant degree of manual monitoring required, including inspection of database states in the applications integrated with SOA. o In addition, there is a close working relationship with the business users for the key applications involved – something which could not necessarily be guaranteed if the application landscape, in terms of SOA integration, were to be expanded. In addition to the monitoring of external applications, the current approach involves using the ESB and BPEL consoles in 10g to track the behaviour of the environment as a whole, as well as specific process instances. In SOA Suite 11g, the same type of console based monitoring approach is available. There are some enhancements which are available immediately, such as the ability to track processes end to end across multiple composite instances. There are also some enhancements which can be leveraged with code changes. One key one, which would be recommended, and is very straightforward, would be the use of so-called ‘composite sensors’ to allow key data item values to be recorded for each instances. The composite instances can then be searched on these data item values, greatly assisting in locating specific process instances (the ‘where is order number XXX’ type of question). Composite sensors are described in the SOA Suite Developer’s Guide at : http://docs.oracle.com/cd/E28280_01/dev.1111/e10224/sca_compsensors.htm#CIHGIDDE Recommendation : Include composite sensors in all 11g composites, or at least those which represent ‘entry points’ for external applications. In addition to these ‘like for like’ monitoring features, there are also some other areas which could be considered. Examples include : o Integration with Oracle Business Activity Monitor (BAM). If the composite applications are instrumented to send data to BAM, then this can be used to display near real time information on business process execution behaviour (data can be sent from SOA to a BAM report in a few seconds). Although BAM is presented as a tool for business level monitoring, business issues (such as non-completion of processes in time) frequently have underlying technology causes. As a result, BAM information is potentially valuable to IT support staff. o Oracle Enterprise Manager Grid Control can be used to monitor the behaviour of the entire Oracle environment, both database and middle tier, from the infrastructure through to the different service engines (BPEL, Mediator etc) within SOA Suite. In addition, with the use of the SOA Management Pack, extra facilities such as end to end transaction tracing and monitoring are available. o Both BAM and Enterprise Manager include the ability to carry out automated pro-active monitoring, albeit in different contexts (business oriented and technical infrastructure oriented). This kind of capability is likely to be required if the SOA environment starts to grow in scale. In terms of costs, BAM is part of SOA Suite, and so is already available, although it may require some code additions to integrate BAM and SOA in the optimal fashion. Enterprise Manager is a distinct product from SOA Suite, and has its own licencing model. In addition, installation and configuration of a product as sophisticated and feature-rich as Enterprise Manager to monitor a relatively small environment may be felt to be a disproportionate action. 4 Document1 (Issue 1) Site Visit Report Edinburgh University SOA Review Doc Ref: CONS/300399457/SVR/001 06-Sep-2013 However, it is possible that the University may already have Enterprise Manager licences and/or expertise as part of its monitoring solution for other parts of its Oracle estate. If this is the case, then it may be possible to leverage existing installation(s) to monitor the new environment as well. As can be seen, there is scope for monitoring the new 11g environment in a similar way to the current 10g one, and that may be sufficient for the initial stages. However, once some expertise has been gained in 11g, some more effort should be devoted to exploring potential extensions to the current monitoring approach. Recommendation : As 11g expertise grows, review default monitoring options available for suitability. Recommendation : Explore use of BAM for monitoring key activities in environment. Recommendation : Review Enterprise Manager as a potential external monitoring solution. On a more specific topic, there are some issues being experienced with 10g components (ESB services) ceasing to operate, but without giving any external signs of failure. It is not clear if these are product issues, or expected behaviour. ESB in its current form does not exist in SOA 11g, with other components taking its place. As a result, it is not possible to say whether these issues would persist in an 11g implementation. It would be possible for Oracle Consulting to review the issues being encountered with ESB in 10g, but the feeling was that this was probably not worthwhile. The issues are known and can be worked round, for the limited ESB 10g lifetime left. Security From a brief discussion on security within the SOA environment, the following points were noted : The University has a policy of using SSL for internal traffic, with the exception of the ‘last mile’ within a trusted network. In the case of the SOA environment, the SSL offload takes place at a load balancing router, and plain text traffic is used between the router, HTTP server and SOA Suite. While perfectly feasible, and by no means unique, the author found this approach slightly surprising. The university is effectively treating most of its own intranet as an untrusted and insecure network, which is not necessarily what would have been expected. No further comment will be made. In terms of SOA service security, any SOAP accessible web service has an authentication policy, at least, defined on it. The credentials being used correspond to a small number of ‘service’ accounts – in effect, there is currently one service account per consuming application. There is no perceived requirement, now or in the future, to propagate the ‘true’ user identity from the calling application to the called service. This is not an unusual situation in this kind of environment, although the approach of not propagating user identities should be considered and documented as a conscious decision. In addition, a form of role based access control is in operation. Each SOAP service has an authorization policy attached to it, restricting access to the service to users in a particular group. A different group is used for each service. When an application wishes to use a particular SOA service, that application’s ‘service’ account is added to the relevant group, permitting access to the service. This approach can be made to work, but there are potential concerns with scalability and manageability as the number of services and applications grows. It is currently believed that most of the access control decisions will be related to a particular ‘School’ within the University, and that each consuming application will be associated with a particular school. However, there is also a desire to monitor usage by individual applications. 5 Document1 (Issue 1) Site Visit Report Edinburgh University SOA Review Doc Ref: CONS/300399457/SVR/001 06-Sep-2013 One possible solution would be to define a series of ‘roles’ (in practice, LDAP groups) for each School, and control service access on the basis of this more limited set of roles. The monitoring of individual applications could be done by (eg) including a common, University specific, ‘header’ element in every message, and including the application name as an element in this header – allowing usage by application to be monitored, while still retaining the school level access controls. Recommendation : Review current security policy for manageability/scalability, and map against defined business requirements. Any Problems or Issues raised (business or technical) and actions taken during this visit: See above Conclusions and any client response: See above Future plans (e.g. next visit or follow-up actions): It was agreed on the day that another visit, by a second Oracle consultant, should be scheduled, to review the production setup in more detail. While it is perfectly possible that such a review might not find anything of significance, it is still felt to be a prudent step to take by both Oracle and the University. The author undertook to send some relevant links to blog entries etc to some of the people involved in the discussions during the day. This will be done in a separate email. In addition, once the monitoring requirements in 11g become clearer, it would potentially be useful to have an Oracle consultant visit and review the requirements in the light of BAM and Enterprise Manager’s capabilities. Details of any deliverables given to the client or any documentation/software left on the client’s machine (including location and filenames): No specific deliverables. Knowledge Transfer (eg. Explain what actions you have taken to share knowledge of the work you have undertaken): The day was based around a series of discussions with University team members, so knowledge transfer took place during these discussions. This report approved by: Ian Fiddes (deemed approved, since no comments received) 6 September 2013 6 Document1 (Issue 1)