CSA Guidance Version 3 Domain 9: Incident Response Incident Response (IR) is one of the cornerstones of information security management: even the most diligent planning, implementation, and execution of preventive security controls cannot completely eliminate the possibility of an attack on the Confidentiality, Integrity, or Availability of information assets. One of the central questions for organizations moving into the cloud must therefore be: what must be done to enable efficient and effective handling of security incidents that involve resources in the cloud? Cloud computing does not necessitate a new conceptual framework for Incident Response; rather it requires that the organization appropriately map its extant IR programs, processes and tools to the specific operating environment it embraces. This is consistent with the guidance found throughout this document; a gap analysis of the controls that encompass your organization’s incident response function should be carried out in a similar fashion. This domain seeks to identify those gaps pertinent to Incident Response that are created by the unique characteristics of cloud computing. Security professionals may use this as a reference when developing response plans and conducting other activities during the preparation phase of the IR lifecycle. Overview. This domain is organized in accord with the commonly accepted Incident Response Lifecycle as described in NIST 800-61[1]. After establishing the characteristics of cloud computing that impact Incident Response most directly, each subsequent section addresses a phase of the lifecycle and explores the potential considerations for responders. 1. Introduction 1.1 Cloud Computing Characteristics that Impact Incident Response Although cloud computing brings change on many levels, certain facets of the cloud ecosystem bear more direct challenges to Incident Response activities than others[2]. First, the on demand self-service nature of cloud computing environments means that a cloud customer may find it hard or even impossible to receive the required co-operation from their cloud service provider (CSP) when handling a security incident. Depending on the service and deployment models used, interaction with the Incident Response function at the CSP will vary. Indeed, the extent to which security incident detection, analysis, containment, Copyright © 2011 Cloud Security Alliance CSA Guidance Version 3 and recovery capabilities have been engineered into the service offering are key questions for provider and customer to address. Second, the resource pooling practiced by cloud services as well as the rapid elasticity offered by cloud infrastructures may drastically complicate incident handling. Precise identification of assets affected during an incident and the collection of the telemetry and artifacts associated with the attack (logging, netflow data, memory, machine images, storage etc) without compromising the privacy of co-tenants is a technical challenge that must be addressed primarily by the provider, but it is up to the cloud customer to satisfy himself that his cloud service provider has done so and can provide him with the incident-handling support he requires. Third, despite not being described as an essential cloud characteristic, cloud computing frequently leads to data crossing geographic and jurisdictional boundaries, in the worst case without explicit knowledge of this fact by the cloud customer. The ensuing legal and regulatory implications affect the incident handling process by placing limitations on what may be done and/or prescribing what must be done during incident response, in all phases of the lifecycle. As always, it is advisable that an organization include representatives from its legal department on the Incident Response team to provide guidance on these issues. Cloud computing also presents opportunities for information security professionals to deliver an enhanced response. Virtualization technologies and the elasticity inherent in cloud computing platforms can allow for more efficient and effective containment and recovery, with less service interruption than would be expected with more traditional data center technologies. Also, investigation of incidents may be easier in some respects, as virtual machines can easily be moved into lab environments where runtime analysis can be conducted and forensic images taken and examined. 1.2 The Cloud Architecture Security Model as a Reference To a great extent, deployment and service models dictate the division of labor when it comes to Incident Response in the cloud ecosystem. Using the architectural framework and security controls review advocated in Domain 1 (see Cloud Reference Model figure 1.5.2a) can be valuable in identifying what technical and process components are owned by which organization, at which level of the “stack.” Cloud service models (IaaS, PaaS, SaaS) differ appreciably in the amount of visibility and control a customer has to the underlying IT systems and other infrastructure that deliver the computing environment. This has implications for all phases of Incident Response as it does with all other domains in this guidance document. For instance, in a SaaS solution, response activities will mostly reside with the CSP, whereas in IaaS, a greater degree of responsibility and capability for detecting and responding to security Copyright © 2011 Cloud Security Alliance CSA Guidance Version 3 incidents will reside with the customer. However, even in IaaS there are significant dependencies on the CSP. Data from physical hosts, network devices, shared services, security devices like firewalls and any management backplane systems must be delivered by the Provider. To be certain, some providers are already provisioning the capability to deliver this telemetry to their customers, and managed security service providers are advertising cloud based solutions to receive and process this data. Given the complexities, the Security Control Model described in Domain 1 (figure 1.5.1c) and the activities an organization performs to map security controls to your particular cloud deployment should inform IR planning and vice versa. Traditionally, controls for Incident Response have concerned themselves more narrowly with higher-level organizational requirements; however, security professionals must take a more holistic view in order to be truly effective. Those responsible for IR should be fully integrated into the auditing and planning of any security control which would directly, or even indirectly, affect response. At a minimum, this process can help in mapping of roles/responsibilities during each phase of the IR lifecycle. Cloud deployment models (public, private, hybrid, community) are also considerations when reviewing IR capabilities in a cloud deployment; the ease of gaining access to IR data varies for each deployment model. It should be self-evident that the same continuum of control/responsibility exists here as well. In this domain, we primarily concern ourselves with the more public end of the continuum. We assume that the more private your cloud, the more control you will have to develop the appropriate security controls, or have them delivered by your provider to your satisfaction. 2. Incident Response Lifecycle Examined 2.1 Preparation Preparation may be the most important phase in the Incident Response Lifecycle when information assets are deployed to the cloud. Identifying the challenges (and opportunities) for Incident Response should be a formal project undertaken by information security professionals within your organization prior to migration to the cloud. This exercise should be undertaken during every refresh of the enterprise Incident Response Plan. In each lifecycle phase discussed below, the questions raised and suggestions provided can serve to inform the organization’s planning process. Integrating the concepts discussed into a formally documented plan should serve to drive the right activities to remediate any gaps and take advantage of any opportunities. Preparation begins with a clear understanding and full accounting of where the organization's data resides in motion and at rest. Given that the organization's information assets now traverse organizational, and likely, geographic boundaries necessitates threat modeling on both the physical and logical planes. Data Flow diagrams which map to physical assets, and Copyright © 2011 Cloud Security Alliance CSA Guidance Version 3 map organizational, network and jurisdictional boundaries should serve to highlight any dependencies that might arise during a response. Since multiple organizations are now involved, Service Level Agreements (SLA) and contracts between the parties now become the primary means of communicating and enforcing expectations for responsibilities in each phase of the IR lifecycle. It is advisable to share Incident Response Plans with the other parties and to precisely define any terminology. Where possible, any ambiguities should be cleared up in advance of an incident. It is unreasonable to expect CSPs to create separate IR plans for each customer. However, the existence of some (or all) of the following points in a contract/SLA should give your organization some confidence that your provider has done some advanced planning for Incident Response: Points of Contact, communication channels, and availability of IR teams for each party Notification criteria, both SOC to SOC and to any external parties Incident declaration criteria and event data available Explication of roles/responsibilities during a security incident, including The incident data/artifacts that are to be shared during an incident, in what format, and the means of dissemination Identification of any "sanitization" operations that may be undertaken on data provided Description of any IR testing done by the parties to the contract and whether results will be shared Post-mortem activities, including any expectations on final Incident Reports and root cause analyses Once the roles and responsibilities have been determined, your organization can now properly resource, train, and equip your Incident Responders to handle the tasks that they will have direct responsibility for. For example, if your application resides in a PaaS model and your cloud provider has agreed to provide (or allow retrieval of) platform-specific logging, having the technologies and personnel on staff to receive, process and analyze those types of logs is an obvious need. For IaaS and PaaS, aptitude with virtualization and the means to conduct forensics and other investigation on virtual machines will be integral to any response effort. A decision about whether the particular expertise required is organic to your organization or is outsourced to a Third Party is something to be determined during the preparation phase. Note that outsourcing then prompts another set of contracts/NDAs/SLAs to manage. It should be noted here that the customer organization will be responsible for any application layer code that is developed on-premise and deployed to the cloud. Scenarios where the root cause is determined to be a code flaw in the application puts the onus squarely on the organization to enhance secure coding practices, and to emphasize response activities which allow the efficient and effective remediation of bugs. It is easy to envision scenarios where an Copyright © 2011 Cloud Security Alliance CSA Guidance Version 3 enterprise application is offline after having been properly contained - waiting for the bug fix then becomes the roadblock to full recovery. This may represent a shift in emphasis for inhouse IR teams; your organization is presumably outsourcing some (or all) responsibility for network level incident detection and response to the provider. The most important part of preparing for an incident is testing the plan. Tests should be thorough and mobilize all the parties who are likely going to be involved during a true incident. It is unlikely that your CSP has resources to participate in tests with each of its customers; consider role-playing as a means to identify what tasking, or requests for information are likely to be directed at the CSP. Use this information to inform future discussions with the provider while in the preparation phase. Another possibility is for the customer to volunteer to participate in any testing the CSP may have planned. 2.2 Detection and Analysis Timely detection of security incidents and successful subsequent analysis of the incident (what has happened, how did it happen, which resources are affected, etc.) depend on the availability of the relevant data and the ability to correctly interpret that data. In both cases, cloud computing provides challenges: (1) availability of data largely depends on what the cloud provider supplies to the customer; (2) analysis is complicated by the fact that the analysis at least partly concerns provider-owned infrastructure, of which the customer usually has little knowledge. Elasticity and resource pooling complicate both the collection of relevant data and the interpretation of such data. The following subsections provide guidance regarding data sources for detection, analysis and the interpretation of data. Given the distribution of responsibilities discussed in paragraph 1.1.2 above, Incident Response becomes heavily dependent on the logistics of collecting and correlating the right telemetry (logs, forensic artifacts, etc.) so that a coherent picture can be developed in the detection and analysis phases. Understanding where data resides at every step in the transaction is required to mobilize the right parties to develop strategies to contain and eradicate an attack. Imperative to enumerate the attack asap. It dictates all of the further strategies and helps to identify who should be passed command of the incident. Ultimate responsbilty 2.2.1 Data Sources As in any hosted IT service integration, the IR team will need to determine the appropriate logging required to adequately detect anomalous events and identify malicious activity that would affect their cloud applications. It is imperative for the customer organization to conduct an assessment of what logs (and other data) are available, how they are collected and processed and finally how and when they may be delivered by the CSP. Copyright © 2011 Cloud Security Alliance CSA Guidance Version 3 Some incidents can only be detected by the cloud provider, e.g., because they concern the infrastructure hosted by the cloud provider -- the SLAs must be such that the cloud provider indeed informs in a timely and reliably manner about these incidents. For other incidents -even though detectable by the customer -- the cloud provider may be in a better position for detection. Cloud customers should prefer cloud providers that optimally assist in the detection of incidents. For the detection and subsequent analysis of incidents on the customer side, logging information is required. When collecting required log information, ensure that the following are taken into consideration: Clock Synchronization. All cloud component clocks should be synchronized. Ensure that the cloud provider's log data sufficiently distinguishes time zones for accurate forensic interpretation. Elasticity Characteristics. As new cloud resources (VMs, etc.) are brought online to service demand the log information produced by the new resource instance will need to be added to the stream of log data when appropriate. Virtualization Components. As appropriate, ensure that you can retrieve required hypervisor log data. Audit Logs. Acquisition of audit logs from all required components (e.g. network, system, application, and cloud administration roles and accesses, backup and restore activities, maintenance access, change management activity) Performance Logs. These logs may help provide indications of notification Legal Requirements. All organizations involved in the response must be able to ensure that any data collected Data Formats. Normalization of event and other data is a considerable challenge. The use of open formats (such as the emerging CEE [4]) may ease processing at the customer side. The amount of data produced from the cloud deployment may be considerable. It may be necessary to investigate cloud provider options regarding log filtering options from within the cloud service, before it is sent to the customer, to reduce network and customer internal processing impacts. Additional considerations include the level of analysis or correlation performed by the CSP and the cloud tenant to identify possible incidents prior to forensics. If analysis is performed at the CSP, the escalation and hand-off points for the incident investigation must be determined. 2.2.2 Forensic and other Investigative Support Although still immature, efforts are already underway within the forensic community to develop the tools and protocols to collect and examine forensic images derived from virtualized environments. It is important that the customer understand their own forensic requirements, research what the vendor may have for meeting those requirements, and address any gaps. Copyright © 2011 Cloud Security Alliance CSA Guidance Version 3 The IaaS cloud customer should request that the vendor provide access to virtual images through such mechanisms as VM snapshots or VM introspection. On the customer side, the capability to stand up a To greatly facilitate detailed offline analyses, look for cloud providers with the ability to deliver snapshots of the customer’s entire virtual environment – firewalls, network (switches), systems, applications, and data. Also, providers that can use their management backplane/systems to scope an incident and identify only those nodes that are under attack can greatly enhance the response. The organization’s IR team should familiarize themselves with information tools the cloud vendor provides to assist the operations and IR processes of their customers. Knowledge base articles, FAQs, incident diagnosis matrices, etc. can help fill the experience gap a cloud customer will have with regard to the cloud infrastructure and its' operating norms. This information may assist the IR team in discriminating operational issues from true security events and incidents. 2.2.3 Communications during an Incident Standards exist to communicate incident information for the purpose of sharing indicators of compromise or to actively engage another party in an investigation. The standards were developed in the Internet Engineering Task Force (IETF) and are also incorporated in the International Telecommunication Union’s (ITU) Cyber Security Exchange (CYBEX) project. The Incident Object Description Exchange Format (IODEF) in RFC 5070 provides a standard XML schema used to describe an incident and Real-time Inter-network Defense (RID) in RFC 6045 and RFC 6046 describe a standard method to communicate the incident information between entities, which includes a CSP and tenant. Parties should consider the means by which sensitive information is transmitted between parties to ensure that out-of-band channels are available and that encryption schemes are used to ensure integrity and authenticity of information. 2.3. Containment, Eradication, and Recovery As with the other phases of Incident Response, close coordination with all stakeholders is required to ensure that strategies developed to contain, eradicate, and recover from an incident are effective, efficient, and take into consideration all legal and privacy implications. The options must be also consistent with business goals and seek to minimize disruption to service. This is considerably more challenging when multiple organizations are at the table. At the technical level, options for this phase will differ depending upon the deployment and service model, and also the layer of the stack at which the attack was targeted. There may be Copyright © 2011 Cloud Security Alliance CSA Guidance Version 3 multiple strategies that can be employed, possibly by different entities equipped with different technological solutions. If at all possible, thought exercises should be conducted in the preparation phase to anticipate these scenarios and a conflict resolution process identified. Once the “owner” (or owners) of a particular containment or eradication strategy is identified, that owner must verify implementation of the strategy. There may be multiple steps to the strategy that rely on coordinated timing to be successful. Consumers of IaaS are primarily responsible for the containment, eradication and recovery from incidents; however providers may help to assist with certain categories of attack, such as a Denial of Service. The extent to, and conditions under which, facilities at the provider will be made available to the customer to assist in responding to an attack should be identified in the preparation phase. The situation is more complicated for SaaS and PaaS deployments. Organizations are advised to investigate the facilities offered by their providers to contain, eradicate and recover from an incident. Consumers may have little (technical) ability to contain an incident in a SaaS and PaaS services other than closing down user access and inspecting their data as hosted within the service prior to a later re-opening – as in traditional deployments such a decision must be based on the business impact of losing the service weighed against the business impact of the service being corrupted. Furthermore, SaaS and PaaS consumers are reliant upon their CSPs to provide timely fixes to flaws affecting their code prior to being able to resume service. Customers must also consider how their provider will handle incidents affecting the provider itself or affecting other tenants on a shared platform in addition to incidents that are directly targeted at their own organization. Cloud deployments may have some benefits in this phase – for example, if there are issues with a service running on a particular IaaS cloud then the customer may have the option of moving the service on to another cloud, especially if they have implemented one of the metacloud management solutions. As discussed in the introduction, the relative ease with which nodes can be shut down and new instances brought up may help to minimize service interruption when a code fix needs to be deployed. Smaller enterprises may benefit from the economies of scale which allow for more expensive mitigation technologies, such a DoS protection, to be extended to their sites. In a cloud environment there may be many system images with identical or similar vulnerabilities that can allow an exploit to propagate beyond the initial entry point. There will need to be a determination that further intrusion didn't take place, and/or the exploit wasn't used on other instances in the cloud. The impacted images must be identified and isolated to prevent propagation and analyzed to determine how the attack took place. If the hypervisor layer is intact, the cloud environment has an advantage because of the ability to rapidly create copies for analysis. Network Copyright © 2011 Cloud Security Alliance CSA Guidance Version 3 isolation is often also required to triage the environment, and this is a weaker area for the cloud because of the challenge of inserting network monitoring between virtual systems. Once the attack has been identified, the impacted systems can be isolated by network filtering, altering the impacted code, or pausing the images. For applications with the ability to perform failover or execute a DR plan their functions can be moved to unaffected systems. Simultaneously the historical extent of the incident is determined so that known good copies of the system and data can be identified for the recovery phase. Recovery efforts should include robust verification that the root cause has been identified and remediated prior to the application coming back on-line. This is crucial to avoiding a “race condition” where an un-identified vulnerability allows the attacker to compromise newly provisioned nodes. This may require recreating a base image or restoring a known good backup and applying the mitigation. For attacks targeted lower in the stack, the “owner” of the particular system affected should verify that any configuration errors, patches, or other remediation efforts have been universally deployed. Post-recovery, a "Lessons Learned" activity, leading with the Incident Report, takes place. A detailed Incident Report is generated based on the previous activities, to be shared with impacted parties. In a cloud environment this includes the cloud provider and related organizations, in addition to your internal IR team. The Incident Report should include the timeline of the incident, analysis of the root cause or vulnerability, actions taken to mitigate problems and restore service, and recommendations for long-term corrective action. Corrective actions are likely to be a blend of customer-specific and provider supported, and the provider Incident Response team should provide a section with their perspective of the incident and proposed resolution. After an initial review of the Incident Report by the customer and service provider, joint discussions should be held to develop and approve a remediation plan. 3.0 Recommendations Recommendations o Cloud customers must understand how the CSP defines events of interest vs. security incidents and what events/incidents the cloud-service provider reports to the cloud customer in which way. Event/incident reports that are supplied in an open format (such as the emerging CEE [4], IODEF [5], or IDMEF [6]) can facilitate the processing of these reports at the customer side. o Cloud customers must understand the CSP's support for incident analysis, particularly the nature (content and format) of data the CSP will supply for analysis purposes and the level of interaction with the CSP's incident response Copyright © 2011 Cloud Security Alliance CSA Guidance Version 3 team. In particular, it must be evaluated whether the available data for incident analysis satisfies legal requirements on forensic investigations that may be relevant to the cloud customer. o Especially in case of IaaS, cloud customers should favor CSP's that leverage the opportunities virtualization offers for forensic analysis and incident recovery such as access/roll-back to snapshots of virtual environments, virtual-machine introspection, etc. o Cloud customers must set up proper communication paths with the CSP that can be utilized in event of an incident. o For each cloud service, cloud customers should identify the most relevant incident classes and prepare strategies for the incident containment, eradication and recovery incidents; it must be assured that each cloud provider can deliver the necessary assistance to execute those strategies. 4.0 Requirements Requirements For each cloud-service provider that is used, the approach to detecting and handling incidents involving resources hosted at that provider must be planned and described in the enterprise incident response plan. The SLA of each cloud-service provider that is used must guarantee the support for incident handling required for effective execution of the enterprise incident response plan for each stage of the incident handling process: detection, analysis, containment, eradication, and recovery. Testing will be conducted at least annually. Customers should seek to integrate their testing procedures with that of their provider (and other partners) to the greatest extent possible. Copyright © 2011 Cloud Security Alliance CSA Guidance Version 3 Bibliography [1] GRANCE, T., KENT, K., and KIM, B., Computer Security Incident Handling Guide. NIST Special Publication 800-61 [2] GROBAUER, B. and SCHRECK, T., Towards Incident Handling in the Cloud: Challenges and Approaches. In Proceedings of the 3. ACM Cloud Computing Security Workshop (CCSW), Chicago, Illinois, October 2010 [3] REED, J., Following Incidents into the Cloud. SANS Reading Room [4] FITZGERALD, E., et al., Commen Event Expression (CEE) Overview. Report of the CEE Editorial Board, 2010 [5] DANYLIW, R., et al., The Incident Object Description Exchange Format, IETF Internet Draft, 2007 [6] DEBAR, H., et al., The Intrusion Detection Message Exchange Format (IDMEF), IETF Internet Draft, 2007 Copyright © 2011 Cloud Security Alliance