An Integrated CyberSecurity Approach for HEP Grids Workshop Report http://hpcrd.lbl.gov/HEPCybersecurity/ 1. Introduction The CMS and ATLAS experiments at the Large Hadron Collider (LHC) being built at CERN, in Switzerland each involve approximately 2000 physicists from around the world. DOE, as the host of the US CMS and ATLAS tier 1 centers, is providing a key element of the global support infrastructure for these experiments. The LHC Grid will combine resources from many sites, including several very large compute clusters. Reliable and sustained access to these data, compute and communication resources holds the key to the productivity of the CMS and ATLAS communities. The challenges posed in protecting and maximizing the utility of the widely distributed ensemble of resources while providing open access to the community of physicists are significant. Current experience dictates that we must be able to quickly identify, isolate and react to intentional unacceptable use of any part of the computing infrastructure. The mere size and prominence of the LHC worldwide Grid attracts attention. The potential to be able to harness the enormous compute power may encourage malicious attacks which, if successful, can then be turned around to use that power for further mischief. At a March 2005 workshop in Oakland, CA, workshop participants identified a number of critical areas to be addressed that will build on existing work and provide a coherent program to reduce the risk to the large investment in LHC computing. These fall into four general categories: risk analysis, the ability for a VO to perform monitoring and control their resources, the ability to recover quickly from an incident, and vulnerability analysis of the middleware. Each and every physicist is expected to be able to access each and every resource controlled by experiment policy and authorization. By breaking into a single vulnerable system, therefore, an intruder can potentially gain access to many other resources. The program of work, therefore, takes account of the inherently distributed nature of the problem by putting strong emphasis on coordinated response to and control of an incident, since security at one location can be compromised by events at another location. Last year the San Diego Supercomputer Center was completely offline for an entire week due to a security compromise. The LHC Grid represents an extremely valuable resource, and our goal is to develop capabilities and procedures that will minimize the impact a security incident will have on the availability and effectiveness of our production infrastructure. In an environment such as the LHC Grid, covering a very diverse set of resources down to the individual laptops, we plan to the assumption that some security compromise is inevitable. If a system is compromised, the system must be quickly isolated and recovery needs to be rapid, efficient, and thorough so that cost and latent risk are minimized. Sites must be able to regain control of their resources as quickly as possible and to prevent the compromise from spreading to other sites. This includes the ability to quickly and selectively disable both users and services. If a user credential is known to be compromised, it is important to be able to quickly determine the complete list of resources that were accessed using that credential since it was compromised. The ability to quickly recover from a security incident adds the additional value of allowing fast recovery from non-malicious user errors. In fact, user or administrator errors can cause as much damage as a malicious hacker. It is also important to be able to quickly determine when problems are due to a unauthorized activities and when they are due to activities triggered by legitimate members of the CMS and ATLAS communities. To help our planning and the prioritization of the proposed program of work we define the vulnerabilities to a potential incident thus: Loss of unique data Insertion of fraudulent data Inability to reestablish control of the computing infrastructure after an incident. Subversion of system software (loss of integrity) Inability to ingest detector output Massive coherent failure of the ensemble of resources Compromise of key infrastructure Pervasive slow down due to compromise that couldn’t be removed We have arrived at the program of work below in risk, likelihood, impact and our ability to mitigate and respond. Clearly responsibility for the defense and continuous operation of the LHC computing systems span all organizations involved – the experiments, the facility administrators, the middleware and service providers and the end users. All these players are already closely involved in the planning and execution of tasks to protect the LHC systems, and to provide the end-to-end security and trust infrastructure to allow controlled access to and open use of the systems by the physics communities. In this document we describe a set of tasks that, delivered as a coherent and managed program of work across the facility security teams, the technology providers and the experiments, will significantly reduce the vulnerability of the LHC computing environment to security incidents. We feel it is crucial to begin this work as soon as possible. We recommend that the community beginning working on a set of “best practice” documents to help build consensus within the HEP community on what the risks and issues are, and what are the best solutions. 2. Goals and Requirements An experimental collaboration constitutes a “Virtual Organization,” or VO, and is expected to operate information resources as its infrastructure. The VO has a duty to contribute to the overall security of the shared infrastructure. For example, it is important to ensure that a compromise at a Tier 3 site or a scientist’s laptop does not compromise the entire grid. Only the VO can know what jobs are running and what the current set of resource utilization should be. The VO is also responsible for detecting and terminating runaway jobs, which may be due to user error or software bugs. A virtual organization is composed of multiple real organizations, each of which have there own security requirements as well. Security tools and solutions must be designed and deployed in a manner that facilities exchange of information between organizations and virtual organizations. 1. The impact of a compromised user credential should be restricted to that user’s work, and should ideally be short-lived such that its malicious capabilities will time-out in a manageable time-scale. This goes for compromised host credentials as well. 2. The impact of a compromise (root account etc) on a resource should be restricted as much as possible to that resource. 3. Higher risk services should be structured such that the impact and scope of any compromise is minimized. 4. Response to and control of incidents should be tested in a realistic distributed environment. 5. The latency of response to and containment of incidents should be minimized. 6. Usable and timely forensic information should be available to the incident response teams to allow tracing of the source and scope of an incident. 7. Stakeholders (site security, VO administration, etc) need to collect and review information independently, and have the ability to share and compare their analyses. 3. Program of Work At a March 2005 workshop in Oakland, CA, workshop participants identified a number of critical areas to be addressed that will build on existing work and provide a coherent program to reduce the risk to the large investment in LHC computing. These fall into four general categories: risk analysis, the ability for a VO to perform monitoring and control their resources, the ability to recover quickly from an incident, and vulnerability analysis of the middleware. Item 1: Risk Analysis and Best Practices It is essential to perform ongoing risk analysis of the LHC computing infrastructure. This includes analysis of the software stacks, the configurations or resources and services, and the trust relationships between all parties. It also includes closely monitoring new security exploits as they come out. The activity will provide periodic information to guide the program of work and prioritize the focus of the security teams. Item 2: Security Logging and Auditing Service The core component of this task is a real-time Security Logging and Auditing Service. This information service would contain as much log data as possible related to a set of Grid jobs, including host syslogs, CA logs, middleware logs, and so on. Some level of logging from firewalls and IDS’s would be also very useful, but these will likely need to be sanitized before sites would release them. This data will be used to help identify problems and to quickly recover from an incident. It will also be used to help debug authentication and firewall problems (situations where there is not currently a useful error message to understand why something did not connect). It would also help provide the necessary audit trail to help perform fast recovery after a security incident. Requirements: 1. Standardize the audit entry formats where ever possible to facilitate the subsequent browsing, querying and filtering. 2. Instrument the middleware runtimes to securely log relevant audit information. 3. Provide an integrating and organizing framework to collect many diverse sources of information (e.g. routers, job logs etc) to reconstruct the thread-of-work through the Grid fabric. 4. Make the audit information discoverable and accessible to diverse organizations through common interfaces. 5. Provide real time collection and analysis of the information to enable timely response. 6. Build in data filtering mechanisms so that we are not overwhelmed by too much log data. 7. Provide the trusted organizations secure access to the distributed audit information. The tasks required enable an organization or VO to monitor and control their Grid are: Security Logging and Auditing Service: Deployment of a scalable and reliable real-time service. Existing solutions such as the EDG logging and bookkeeping service will be evaluated. Tools to integrate existing log files will be developed. Auditing of all components: We will perform an analysis of what needs to be audited from each component, and work with middleware developers to ensure they are logging the necessary information. This logging will be integrated with the information service. Resource vulnerability scanning: Organizations and VO’s need the ability to scan site Grid resources for vulnerabilities, since small sites may not be doing this, and large site might miss something. This will help VOs to perform security certification of the Grid resources they are responsible for, and help maximize the utility of their Grid. IDS / IPS: Intrusion Detection systems should be deployed to monitor Grid use and detect unauthorized behavior (due to user error, user breaking the rules, or due to unauthorized use). This data must be integrated into the information service. Border Control (site and VO): The boundaries of enclaves of trust are places where information is gathered and control may be applied. These border must be clearly defined, and then protected. Configuration Verification: Many security mechanisms such as firewalls, VOMS servers, and so on are configured and maintained by hand, and the chance of mis- configuring something is high. Therefore it is critical that the various layers are integrated and configuration of the system is automated to the extent possible. Mechanisms to check the configuration of each of the layers and to analyze the security of the configured whole are essential. Item 3: Incident Response and Recovery The key to incident response is to be able to quickly contain their scope and to recover. Often it is very difficult to determine the extent of the damage, and what must be done to clean up after an incident. For example, if a user credential is known to be compromised, what is the complete list of resources that were accessed using that credential? Or if a single host at a site has been rootkit’ed, what other hosts might be compromised as well. If a vulnerability is found in a Grid middleware component, how do we locate all locations where that version of the middleware is installed, disabled those resources until the vulnerability is fixed, and then patch / upgrade the software on all those resources? This task includes the following work items: Incident Response: Incident response typically needs to be coordinated between the local resource, local network, border, virtual organization, and wide-area network and needs to be automated to the extent possible. Effective incident response requires accurate information and analysis of the attack, which will be provided by the VO information server. Effective incident response also requires coordination between several sites by means of a confidential communication channel. The team of responders must be able to rapidly create a communication channel to respond to incidents. A suite of secure information and communication services tuned to the needs of security officers and their partners, responding to an on-going incident is needed. Forensics: Forensics data from all levels of the system are critical to long-term response (i.e. prosecution) and effective recovery. There are two primary goals: to collect evidence and minimize recovery effort. This data is often high volume and from a diverse set of sources. The responders need to be able to in real-time analyze the data and determine exactly what hosts have been compromised and the nature of the compromises to contain the attack and narrow the recovery effort. Security Testing: Tiger teams will be formed to look for vulnerabilities, and response planning will be done. We will also perform 2 major security drills. Item 4: Middleware vulnerability testing and analysis This activity will be responsible for evaluating and enhancing the quality of the middleware from a security perspective. This includes but is not limited to vulnerability to attacks, ease of patching, and installation procedures. From a security perspective, the end-to-end quality of a software stack is not determined only by its resistance to malicious attacks. The time it takes to replace a version with a known vulnerability with a new version that eliminates this vulnerability plays an important role in determining the quality of our software stack. Testing and analysis of all middleware is required. External software audits are needed, which could be done as software peer reviews, where middleware developers could review each other’s architectures. Other Work This workshop also identified several other areas that we feel are important, but we feel these issues are much broader than just LHC computing. We hope other projects will be addressing these issues. These include: Wide-Area Network Monitoring: The wide-area network provides an excellent place to track attack trends and to detect worms, viruses, and to recognize attack patterns. Connection logs and netflow data from the routers or from a network IDS can be used for this. By monitoring key Internet exchange points, one can provide an early warning system for viruses, worms, and attacks, and potentially block the attack before it reaches the end sites. Also, through cooperation with the end-sites, an attack manifesting at one site can be blocked from attacking other sites. Data Integrity: user error, hardware error, TCP checksum issue, intentional corruption, and so on. Authentication / Authorization Issues: protection again stealing short term credentials or session keys, and projection against high-jacking sessions. As the revocation of credentials is very expensive from an operational and management perspective, short-lived assertions should be used wherever possible. This would require further development and deployment of credential issuing services, like MyProxy and GridLogon. Authorized Audit Log Write/Read Access: The audit data is both sensitive and vital for investigations and recovery. The writing to those logs should therefore be integrity protected and authenticated. Furthermore, the access to the logs should be subject to access control policy and should allow trusted audit officers to access the logs in other administrative domains to reconstruct the forensic trail through the Grid fabric. This would require a fine-grained access control policy framework integrated with the audit log and collection services. Disposable Execution Environments: Virtual Machine techniques such as Xen and VMWare allow the creation of a restricted execution environment that can be destroyed or reloaded. The insulation properties of VM technologies may be able to help confine compromises to a single image and disallow rootkits to take over complete physical machines. Furthermore, paused/frozen images of an OS with a selective set of installed and configured applications could be used to facilitate security related updates and patches, and substantially speedup the recovery process after detected compromise. These technologies are maturing, and should be evaluated for use in the LHC Grid. Rootkit detection: Better tools for detection of rootkits are needed. Best Practices / Community Consensus It is important to start to build community consensus on what is the best was to secure sites that are part of the LHC Grid. We recommend that a set of “best practices” documents on several aspects of Grid Security be written. These include the following: Risk Analysis of the LHC Grid: What are the main risks in terms of likelihood and recovery cost? Key management: What are the issues involving user and host key management (e.g.: caching, revocation, etc. ) Logging and auditing: What components should be included for standard logging and auditing? This would include a detailed report on what we log today and some ideas on how this information can be collected and used. What information should be logged locally, what should logged centrally, and what data filtering can be done? Scanning and VO certification: what vulnerabilities can be monitoring via scanning, and checklist of items a VO could use to certify that a given Grid resource meets its security standards? Integrated IDS: what should the IDS’s be looking for, what information should be exchanged between the sites? Incident Response: what steps should be taken to contain and recover from an incident?