1. Introduction - OSG Document Database

advertisement
An Integrated CyberSecurity Approach for HEP Grids
Workshop Report
http://hpcrd.lbl.gov/HEPCybersecurity/
1. Introduction
The CMS and ATLAS experiments at the Large Hadron Collider (LHC) being built at
CERN, in Switzerland each involve approximately 2000 physicists from around the
world. DOE, as the host of the US CMS and ATLAS tier 1 centers, is providing a key
element of the global support infrastructure for these experiments. The LHC Grid will
combine resources from many sites, including several very large compute clusters.
Reliable and sustained access to these data, compute and communication resources holds
the key to the productivity of the CMS and ATLAS communities.
The challenges posed in protecting and maximizing the utility of the widely distributed
ensemble of resources while providing open access to the community of physicists are
significant. Current experience dictates that we must be able to quickly identify, isolate
and react to intentional unacceptable use of any part of the computing infrastructure. The
mere size and prominence of the LHC worldwide Grid attracts attention. The potential to
be able to harness the enormous compute power may encourage malicious attacks which,
if successful, can then be turned around to use that power for further mischief.
At a March 2005 workshop in Oakland, CA, workshop participants identified a number
of critical areas to be addressed that will build on existing work and provide a coherent
program to reduce the risk to the large investment in LHC computing. These fall into four
general categories: risk analysis, the ability for a VO to perform monitoring and control
their resources, the ability to recover quickly from an incident, and vulnerability analysis
of the middleware.
Each and every physicist is expected to be able to access each and every resource
controlled by experiment policy and authorization. By breaking into a single vulnerable
system, therefore, an intruder can potentially gain access to many other resources. The
program of work, therefore, takes account of the inherently distributed nature of the
problem by putting strong emphasis on coordinated response to and control of an
incident, since security at one location can be compromised by events at another location.
Last year the San Diego Supercomputer Center was completely offline for an entire week
due to a security compromise. The LHC Grid represents an extremely valuable resource,
and our goal is to develop capabilities and procedures that will minimize the impact a
security incident will have on the availability and effectiveness of our production
infrastructure.
In an environment such as the LHC Grid, covering a very diverse set of resources down
to the individual laptops, we plan to the assumption that some security compromise is
inevitable. If a system is compromised, the system must be quickly isolated and recovery
needs to be rapid, efficient, and thorough so that cost and latent risk are minimized. Sites
must be able to regain control of their resources as quickly as possible and to prevent the
compromise from spreading to other sites. This includes the ability to quickly and
selectively disable both users and services. If a user credential is known to be
compromised, it is important to be able to quickly determine the complete list of
resources that were accessed using that credential since it was compromised.
The ability to quickly recover from a security incident adds the additional value of
allowing fast recovery from non-malicious user errors. In fact, user or administrator
errors can cause as much damage as a malicious hacker. It is also important to be able to
quickly determine when problems are due to a unauthorized activities and when they are
due to activities triggered by legitimate members of the CMS and ATLAS communities.
To help our planning and the prioritization of the proposed program of work we define
the vulnerabilities to a potential incident thus:
 Loss of unique data
 Insertion of fraudulent data
 Inability to reestablish control of the computing infrastructure after an incident.
 Subversion of system software (loss of integrity)
 Inability to ingest detector output
 Massive coherent failure of the ensemble of resources
 Compromise of key infrastructure
 Pervasive slow down due to compromise that couldn’t be removed
We have arrived at the program of work below in risk, likelihood, impact and our ability
to mitigate and respond. Clearly responsibility for the defense and continuous operation
of the LHC computing systems span all organizations involved – the experiments, the
facility administrators, the middleware and service providers and the end users. All these
players are already closely involved in the planning and execution of tasks to protect the
LHC systems, and to provide the end-to-end security and trust infrastructure to allow
controlled access to and open use of the systems by the physics communities.
In this document we describe a set of tasks that, delivered as a coherent and managed
program of work across the facility security teams, the technology providers and the
experiments, will significantly reduce the vulnerability of the LHC computing
environment to security incidents.
We feel it is crucial to begin this work as soon as possible. We recommend that the
community beginning working on a set of “best practice” documents to help build
consensus within the HEP community on what the risks and issues are, and what are the
best solutions.
2. Goals and Requirements
An experimental collaboration constitutes a “Virtual Organization,” or VO, and is
expected to operate information resources as its infrastructure. The VO has a duty to
contribute to the overall security of the shared infrastructure. For example, it is important
to ensure that a compromise at a Tier 3 site or a scientist’s laptop does not compromise
the entire grid. Only the VO can know what jobs are running and what the current set of
resource utilization should be. The VO is also responsible for detecting and terminating
runaway jobs, which may be due to user error or software bugs.
A virtual organization is composed of multiple real organizations, each of which have
there own security requirements as well. Security tools and solutions must be designed
and deployed in a manner that facilities exchange of information between organizations
and virtual organizations.
1. The impact of a compromised user credential should be restricted to that user’s work,
and should ideally be short-lived such that its malicious capabilities will time-out in a
manageable time-scale. This goes for compromised host credentials as well.
2. The impact of a compromise (root account etc) on a resource should be restricted as
much as possible to that resource.
3. Higher risk services should be structured such that the impact and scope of any
compromise is minimized.
4. Response to and control of incidents should be tested in a realistic distributed
environment.
5. The latency of response to and containment of incidents should be minimized.
6. Usable and timely forensic information should be available to the incident response
teams to allow tracing of the source and scope of an incident.
7. Stakeholders (site security, VO administration, etc) need to collect and review
information independently, and have the ability to share and compare their analyses.
3. Program of Work
At a March 2005 workshop in Oakland, CA, workshop participants identified a number
of critical areas to be addressed that will build on existing work and provide a coherent
program to reduce the risk to the large investment in LHC computing. These fall into four
general categories: risk analysis, the ability for a VO to perform monitoring and control
their resources, the ability to recover quickly from an incident, and vulnerability analysis
of the middleware.
Item 1:
Risk Analysis and Best Practices
It is essential to perform ongoing risk analysis of the LHC computing infrastructure. This
includes analysis of the software stacks, the configurations or resources and services, and
the trust relationships between all parties. It also includes closely monitoring new
security exploits as they come out. The activity will provide periodic information to guide
the program of work and prioritize the focus of the security teams.
Item 2: Security Logging and Auditing Service
The core component of this task is a real-time Security Logging and Auditing Service.
This information service would contain as much log data as possible related to a set of
Grid jobs, including host syslogs, CA logs, middleware logs, and so on. Some level of
logging from firewalls and IDS’s would be also very useful, but these will likely need to
be sanitized before sites would release them.
This data will be used to help identify problems and to quickly recover from an incident.
It will also be used to help debug authentication and firewall problems (situations where
there is not currently a useful error message to understand why something did not
connect). It would also help provide the necessary audit trail to help perform fast
recovery after a security incident.
Requirements:
1. Standardize the audit entry formats where ever possible to facilitate the
subsequent browsing, querying and filtering.
2. Instrument the middleware runtimes to securely log relevant audit information.
3. Provide an integrating and organizing framework to collect many diverse sources
of information (e.g. routers, job logs etc) to reconstruct the thread-of-work
through the Grid fabric.
4. Make the audit information discoverable and accessible to diverse organizations
through common interfaces.
5. Provide real time collection and analysis of the information to enable timely
response.
6. Build in data filtering mechanisms so that we are not overwhelmed by too much
log data.
7. Provide the trusted organizations secure access to the distributed audit
information.
The tasks required enable an organization or VO to monitor and control their Grid are:

Security Logging and Auditing Service: Deployment of a scalable and reliable
real-time service. Existing solutions such as the EDG logging and bookkeeping
service will be evaluated. Tools to integrate existing log files will be developed.

Auditing of all components: We will perform an analysis of what needs to be
audited from each component, and work with middleware developers to ensure
they are logging the necessary information. This logging will be integrated with
the information service.

Resource vulnerability scanning: Organizations and VO’s need the ability to scan
site Grid resources for vulnerabilities, since small sites may not be doing this, and
large site might miss something. This will help VOs to perform security
certification of the Grid resources they are responsible for, and help maximize the
utility of their Grid.

IDS / IPS: Intrusion Detection systems should be deployed to monitor Grid use
and detect unauthorized behavior (due to user error, user breaking the rules, or
due to unauthorized use). This data must be integrated into the information
service.

Border Control (site and VO): The boundaries of enclaves of trust are places
where information is gathered and control may be applied. These border must be
clearly defined, and then protected.

Configuration Verification: Many security mechanisms such as firewalls, VOMS
servers, and so on are configured and maintained by hand, and the chance of mis-
configuring something is high. Therefore it is critical that the various layers are
integrated and configuration of the system is automated to the extent possible.
Mechanisms to check the configuration of each of the layers and to analyze the
security of the configured whole are essential.
Item 3: Incident Response and Recovery
The key to incident response is to be able to quickly contain their scope and to recover.
Often it is very difficult to determine the extent of the damage, and what must be done to
clean up after an incident. For example, if a user credential is known to be compromised,
what is the complete list of resources that were accessed using that credential? Or if a
single host at a site has been rootkit’ed, what other hosts might be compromised as well.
If a vulnerability is found in a Grid middleware component, how do we locate all
locations where that version of the middleware is installed, disabled those resources until
the vulnerability is fixed, and then patch / upgrade the software on all those resources?
This task includes the following work items:

Incident Response: Incident response typically needs to be coordinated between
the local resource, local network, border, virtual organization, and wide-area
network and needs to be automated to the extent possible. Effective incident
response requires accurate information and analysis of the attack, which will be
provided by the VO information server. Effective incident response also requires
coordination between several sites by means of a confidential communication
channel. The team of responders must be able to rapidly create a communication
channel to respond to incidents. A suite of secure information and communication
services tuned to the needs of security officers and their partners, responding to an
on-going incident is needed.

Forensics: Forensics data from all levels of the system are critical to long-term
response (i.e. prosecution) and effective recovery. There are two primary goals: to
collect evidence and minimize recovery effort. This data is often high volume and
from a diverse set of sources. The responders need to be able to in real-time
analyze the data and determine exactly what hosts have been compromised and
the nature of the compromises to contain the attack and narrow the recovery
effort.

Security Testing: Tiger teams will be formed to look for vulnerabilities, and
response planning will be done. We will also perform 2 major security drills.
Item 4: Middleware vulnerability testing and analysis
This activity will be responsible for evaluating and enhancing the quality of the
middleware from a security perspective. This includes but is not limited to vulnerability
to attacks, ease of patching, and installation procedures. From a security perspective, the
end-to-end quality of a software stack is not determined only by its resistance to
malicious attacks. The time it takes to replace a version with a known vulnerability with a
new version that eliminates this vulnerability plays an important role in determining the
quality of our software stack. Testing and analysis of all middleware is required.
External software audits are needed, which could be done as software peer reviews,
where middleware developers could review each other’s architectures.
Other Work
This workshop also identified several other areas that we feel are important, but we feel
these issues are much broader than just LHC computing. We hope other projects will be
addressing these issues. These include:

Wide-Area Network Monitoring: The wide-area network provides an excellent
place to track attack trends and to detect worms, viruses, and to recognize attack
patterns. Connection logs and netflow data from the routers or from a network
IDS can be used for this. By monitoring key Internet exchange points, one can
provide an early warning system for viruses, worms, and attacks, and potentially
block the attack before it reaches the end sites. Also, through cooperation with the
end-sites, an attack manifesting at one site can be blocked from attacking other
sites.

Data Integrity: user error, hardware error, TCP checksum issue, intentional
corruption, and so on.

Authentication / Authorization Issues: protection again stealing short term
credentials or session keys, and projection against high-jacking sessions. As the
revocation of credentials is very expensive from an operational and management
perspective, short-lived assertions should be used wherever possible. This would
require further development and deployment of credential issuing services, like
MyProxy and GridLogon.

Authorized Audit Log Write/Read Access: The audit data is both sensitive and
vital for investigations and recovery. The writing to those logs should therefore be
integrity protected and authenticated. Furthermore, the access to the logs should
be subject to access control policy and should allow trusted audit officers to
access the logs in other administrative domains to reconstruct the forensic trail
through the Grid fabric. This would require a fine-grained access control policy
framework integrated with the audit log and collection services.

Disposable Execution Environments: Virtual Machine techniques such as Xen
and VMWare allow the creation of a restricted execution environment that can be
destroyed or reloaded. The insulation properties of VM technologies may be able
to help confine compromises to a single image and disallow rootkits to take over
complete physical machines. Furthermore, paused/frozen images of an OS with a
selective set of installed and configured applications could be used to facilitate
security related updates and patches, and substantially speedup the recovery
process after detected compromise. These technologies are maturing, and should
be evaluated for use in the LHC Grid.

Rootkit detection: Better tools for detection of rootkits are needed.
Best Practices / Community Consensus
It is important to start to build community consensus on what is the best was to secure
sites that are part of the LHC Grid. We recommend that a set of “best practices”
documents on several aspects of Grid Security be written. These include the following:

Risk Analysis of the LHC Grid: What are the main risks in terms of likelihood
and recovery cost?

Key management: What are the issues involving user and host key management
(e.g.: caching, revocation, etc. )

Logging and auditing: What components should be included for standard logging
and auditing? This would include a detailed report on what we log today and
some ideas on how this information can be collected and used. What information
should be logged locally, what should logged centrally, and what data filtering
can be done?

Scanning and VO certification: what vulnerabilities can be monitoring via
scanning, and checklist of items a VO could use to certify that a given Grid
resource meets its security standards?

Integrated IDS: what should the IDS’s be looking for, what information should be
exchanged between the sites?

Incident Response: what steps should be taken to contain and recover from an
incident?
Download