Towards Privacy Aware Data Analysis Workflows for e-Science William K. Cheung

Towards Privacy Aware Data Analysis Workflows for e-Science
William K. Cheung
Yolanda Gil
Department of Computer Science
Hong Kong Baptist University
Kowloon Tong, Hong Kong
william@comp.hkbu.edu.hk
Information Sciences Institute
University of Southern California
4676 Admiralty Way, Marina del Rey, CA 90292
gil@isi.edu
Abstract
e-Science is getting more distributed and collaborative
and data privacy quickly becomes a major concern,
especially when the data contain sensitive information.
Existing data access policies for privacy management
are too restrictive for supporting the large variety of
data analysis needs in e-Science. In this paper, we
argue the need of a new type of policies that govern
data privacy based on the type of processing done on
the data. A semantic workflow approach is proposed
to address the challenge. Data analysis processes are
described as workflows. Ontologies for data analysis
and privacy preservation describe the functionalities
and the privacy attributes of the processes, as well as
process-constraining privacy policies. We give some
examples of related policies with their potential fields
for application explained. Also, we present via a case
study on distributed data clustering to illustrate how
the approach could be integrated with a workflow
system to make it privacy aware.
Existing data access policies offer a very basic privacy
protection mechanism, where a user can access a data
source if his/her certificates and credentials satisfy the
access policies defined for that data source. An alternative
approach that is less restrictive is to apply privacy
preserving techniques to the data before releasing it, where
sensitive information like personal identity or medical
background are properly hidden through anonymization
and partitioning (Samarati 2001). In addition, one can
design new data analysis algorithms that are privacy
preserving, a fast-growing area for the past few years
(Chris et al. 2003). These algorithms can preserve data
privacy through techniques like secure multiparty
computation (Lin et al 2005) and data generalization
(Zhang & Cheung 2005), and yet can perform reasonable
data analysis.
These existing mechanisms alone are too restrictive for
many applications, especially in e-Science. Consider the
case of a cancer patient signing a release form for their
medical records. He or she may be not only willing but
eager to allow access to for medical research purposes as
long as it is anonymized. However, they suspect that if the
record is released it could be used by insurance companies
to design more profitable insurance rates that would raise
his or her medical expenses. Given the choice to release
the data or not without any say on the use of the data, the
patient may decline to allow the use of its record. Clearly,
a more flexible privacy mechanism would allow the patient
to specify a policy on how the data will be processed, not
just on the blanket release of the data per se. Consider
another case, where a cancer research laboratory has
collected treatment protocol data for thousands of patients
over several decades. The lab is happy to share its data
with medical researchers in other fields to analyze the
relationship of cancer with liver transplantation or with
heart failure, but would not want other competing research
groups to use the data in competing cancer research quests.
Existing approaches would not allow the lab to specify
Introduction
Data privacy is important in e-science, especially when
distributed and collaborative data analysis processes are
involved. It is not difficult to find scenarios where
distributed data analysis and data privacy protection are
both needed at the same time. For example, one can
analyze individuals’ clinical data like brain images by
gaining access to related remote sources for disease
diagnosis (Beltrame et al. 2006), where the patients’
identity has to be kept strictly confidential. Other than the
subjects’ identity threat, the scientists themselves may have
privacy concerns on their scientific findings as data sets,
preliminary results and data analysis processes can now be
easily and widely shared in e-Science collaborations
(Deelman and Gil 2006). These privacy concerns are
important even though the advantages of sharing data to
facilitate collaborative scientific research are well
understood (NIH 2004). Thus, the need for having privacy
management support in e-Science is immediate.
Copyright © 2007, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
17
expressed in the workflow. The workflow system could
assist the user in modifying the analysis in order to satisfy
the privacy policies stated. In this framework we can
selectively specify privacy policies for the same data set
that would allow it to be accessible for certain types of
workflows (analyses) and not for others.
This paper describes our work to date in defining a new
class of privacy policies that apply at the workflow (data
processing) level. To define these workflow-level policies,
one needs to address the following major issues:
• how to properly represent policies in terms of data
analysis processes, data privacy concepts and related
workflow constructs (ontological issue),
• how to automatically enforce those policies in data
analysis process management within the workflow
system (policy enforcement issue), and
• how to provide provenance regarding the privacy of
the data as well as their data analysis history so that
the system can justify its use of the data (provenance
issue).
Figure 1. Data analysis using (a) traditional access
control, (b) privacy-aware workflow systems.
In this paper, we describe how a semantic approach can be
adopted to address those issues in the context of workflow.
In particular, we show how semantic web technology can
be used to describe data analysis workflows as well as their
data privacy requirements. Also, we explain how the
privacy related ontology could be used together with some
policy framework for representing privacy polices for
controlling data analysis process creation and execution.
Examples of possible privacy polices enabled by the
introduced privacy awareness are provided. In addition, a
case study is presented to illustrate how the proposed
approach can be used to govern a particular distributed
data mining process.
this kind of policy concerning the use of the data, only to
specify who would have access to it.
The goal of our work is to investigate a new kind of
privacy protection policies that constrain the type of
processing on the data, rather than the access to the data.
That is, instead of defining policies to specify who can
access a data set and how much of it can be accessed, our
goal is to define policies that specify what can be done
with the data. This would allow a more flexible approach
to privacy that covers data processing in addition to the
existing data access techniques.
Motivation
The key idea in our approach to privacy protection is the
use of workflows to describe the type of processing done to
a dataset, and to express policies that can be used to control
the creation and execution of workflows. Workflows have
recently emerged as a useful paradigm to represent and
manage complex computations in many scientific
applications (Deelman and Gil 2006). We propose to
extend workflow systems to be privacy-aware, so that they
can be given privacy policies defined in terms of types of
analysis and data handling performed by the workflow
system. Figure 1 illustrates a traditional access control
approach (a) and contrasts it with a privacy-aware
workflow system (b). A data access control policy would
either enable or disable a user’s access solely based on
their credentials and certificates. In this example, the user
is only able to access D1 but not D2, D3, or D4. In
contrast, a privacy-aware workflow system would enable
the expression of additional kinds of privacy policies that
would enable access based on the type of analysis done as
Managing privacy in data analysis processes in e-Science
has two different aspects of concern, namely data privacy
and process privacy. The former one concerns the privacy
of data sources as well as data products created during
data analysis processes. The latter one concerns the privacy
of knowledge captured in data analysis processes. In this
paper, we mainly focus on protection of data privacy in the
context of workflow.
Our Goal: Privacy-Aware Data Analysis
Instead of specifying policies to control data access, we set
policies on types of data analysis that can be applied to the
data. The following are some higher level expressions of
the policies that we are targeting where notions like
purposes of analysis, types of analysis, characteristics
related to data privacy and analysis accuracy are involved
to govern data analysis processes.
18
Before we proceed, it is worth mentioning that our concern
is not only limited to the identity threat. In fact, one could
generalize the target to be hidden from individuals’ identity
to some important data attributes or experiment runs which
will depend on the particular application and situation at
hand.
Example 1 Patient medical images should not be released
for analysis except for the purpose of supporting a
particular medical image analysis project and the images
have to be encrypted if they are transmitted via untrusted
networks.
Example 2 Given the purpose of medical diagnosis, any
classification performed on clinical data must provide the
confidence level for each data item and have its overall
accuracy reaching a particular level of quality standard.
Related Research on Policy Governed Data
Analysis
As an alternative approach for privacy protection in data
analysis, policies of data usage can be adopted for
governing data analysis processes (Weitzner et al. 2006).
For example, a research lab wants some of their on-line
data and analysis tools to be only used for the purpose of
demonstrating the system’s analysis capability and thus
posts a related data usage policy. In case the data set is
later on found to be used (say together some other data
sources) for re-identification disclosure of the subjects who
provide the data, the one doing that will be accountable for
the consequence. In addition, it will be even more
appealing if such policy-violating data analysis processes
can be caught early on and be stopped before they are
actually executed.
Example 3 Data containing drug dosage information
should not be released for any analysis except for the
purpose of public health care study, and the data should
not contain any personal identification attribute and have
to be properly anonymized before they can be used.
For the three examples provided, terms in italics reveal the
need of a vocabulary to describe workflow-relevant
concepts about data privacy and data analysis, and the
remaining non-italic portions correspond to the constructs
for describing policies in existing data access control
frameworks. With a similar analogy, our proposed
approach (to be shown in the later sections) also involves
two parts, namely ontological description of privacy and
data analysis, and the adoption of a policy framework with
the ontological description integrated.
While there has been work found in the literature for
representing and reasoning about privacy policies
(Bradshaw et al. 2003, Kagal, Finin & Joshi 2003), only
conventional security concepts like authentication,
authorization and encryption have been considered. We
envision that privacy preserving data analysis techniques
will soon get more mature and widely accepted. The family
of privacy policies will need to be further enriched with the
additional privacy related semantics being properly
represented and reasoned.
Related Research on Privacy Preserving Data
Analysis
While restricting access to the data could be found to
restrict to support various kind of data analysis, one could
adopt the approach of restricting information in the data so
that they are (a) free of identifiers that would permit
linkages to any target individual and (b) free of content that
would create unacceptably high risks of individual
identification. For example, one may allow a set of data to
be released and analyzed as far as fields related to personal
information are anonymized.
Approach
Our approach to develop a privacy-aware data analysis
framework is to extend existing workflow systems to
incorporate privacy policies that control the type of data
analysis done on the data. Modeling data analysis
processes as workflows, also called scientific workflows, is
common in e-Science (Deelman & Gil 2006, Ludascher et
al. 2006, Oinn et al. 2006, Wassermann et al. 2006).
However, in the literature, workflow systems possessing
data privacy awareness are still lacking. Conventional
workflow systems were designed with the primary
objectives
of
providing
component
abstraction,
interconnectivity and reliable execution in mind. In eScience, data oriented and user (scientist) oriented
perspectives have been stressed. Examples include the use
of visual programming environments for constructing the
data analysis processes, e.g., Taverna (Oinn et al. 2006),
Kepler (Ludascher et al. 2006) and Sedna (Wassermann et
In the literature, techniques for releasing data without
disclosing sensitive information have been proposed for
various applications. For example, cryptography-based
techniques have been found useful in private data
communication in untrusted networks (Stalling 2005).
Techniques like anonymization (Samarati 2001) and
microaggregation (Domingo-Ferrer & Mateo-Sanz 2002)
have been found useful in applications like statistical
disclosure control. Also, there has been recent research
interest in developing data mining algorithms which are
privacy preserving with underlying techniques including
secure multiparty computation (Lin, Clifton & Zhu 2005),
random data perturbation (Kargupta et al. 2005) and data
generalization (Cheung et al. 2006).
19
Privacy preservation ontology. This ontology enables us
to describe workflow components that perform privacy
preservation techniques. PrivacyPreservation is the root
class, and the possible subclasses and their instances
include
• Encryption, {e.g., RSA, DES, RC4}
• Anonymization, {e.g., MD5, sequential numbering}
• Generalization, {e.g., Gaussian mixture model, kanonymity}
• Perturbation, {e.g., additive, multiplicative}
• SecureComputation, {e.g., secure-add, securemultiply}
al. 2006), and the adoption of a template oriented approach
for workflow creation, e.g., Wings (Gil et al. 2007).
Data analysis ontology. This ontology gives the taxonomy
of workflow components that process the data. We
consider here statistical data analysis algorithms (Hastie,
Tibshirani & Friedman 2003) that are widely used in many
domains, but domain-specific analysis types would be part
of the ontology as well (Cannataro et al. 2004).
DataAnalysis is the root class, and the possible subclasses
and their instances include:
• Clustering, {e.g., k-means, Gaussian mixture model)
• Classification, {e.g., C4.5, support vector machine}
• Manifold Learning, {e.g., GTM, ISOMAP, LLE}
• Association Rule, {e.g., Apriori}
Note that what being described is a very simple one. Data
analysis is a mature area and a more comprehensive data
analysis ontology which can describe the available tools
(e.g., Bernstein, Provost & Hill 2005) should be needed.
Figure 2. An ontology for describing privacy aware
data analysis workflows.
Regarding data privacy control, most workflow systems
are designed to support only the conventional data access
polices for privacy control. If other types of privacy polices
are to be respected, they can only be managed manually.
We propose a semantic approach for data privacy
management in workflow systems. In particular, we first
derive ontologies to describe fundamental concepts of data
privacy and data analysis, and to integrate them into the
ontological description of workflows to characterize their
privacy related properties. Then, we argue that a new type
of privacy policies can be specified using the derived
ontologies and some policy description framework so that
policy compliance test for data analysis workflows could
be performed automatically via metadata reasoning. In the
following, instead of providing a comprehensive view of
every aspect of data privacy protection (e.g., user
authentication, data/workflow access control, etc.), we
focus on those more related to privacy and data analysis.
Extensions of workflow related ontologies. Given the
privacy preservation ontology and the data analysis
ontology being available, the workflow template ontology
is extended with privacy awareness as follow:
Two properties are added as the metadata of the
workflow:
• for whose range captures the purpose of the workflow,
e.g., medical diagnosis, public health study, and the
purposes can be further described by a domain
ontology specific to the application.
• hasOutputQuality whose range captures the semantic
and metric of the workflow’s output quality, e.g. Fscore, classification accuracy, which again can be
further described using an additional analysis quality
ontology.
Building Blocks
Figure 2 depicts an ontology that contains most of the
important concepts needed for describing privacy aware
data analysis workflows. Note that the classes shown in
normal face are specifically proposed as building blocks
for describing essential constructs in workflows. They
include WorkflowTemplate, Link, Node, Data and
ComponentType which are adopted from Gil et al. 2007
and thus further details will not be provided. The classes in
bold face as well as their related properties are building
blocks for modeling privacy awareness as well as
extensions of the workflow related ontologies constructed
for embracing privacy awareness.
Two properties are added as the metadata of the Data
class (which can be the source or intermediate data
products):
• hasDataType whose range can take values of
numerical, nominal, relational, semi-structured, etc.
• hasDataAttributeSet which describes the schema of
the data attributes which should further be described
20
by the domain
hasPurpose.
specific
ontology
needed
by
Privacy Policies for Data Analysis Workflows:
Evaluation of some existing policy frameworks
Given the ontologies derived in the previous section, we
can have a better idea what privacy awareness can be
incorporated for data analysis can have in addition to the
conventional user authentication and authorization. This
implies opportunities for new types of privacy policies for
data analysis to be derived and at the same time challenges
regarding how the polices are going to be represented and
reasoned together with the workflow’s metadata for policy
compliance checking.
Two properties and two subclasses are added to
ComponentType:
• hasParameterSet whose range captures the set of
parameters needed for configuring a particular
component which is especially common for data
analysis component.
• hasProtocol which is added to ComponentCollection
and has its range referring to the distributed computing
protocol needed among the components in the
collection, e.g., iterative protocol for split-and-merge,
secure-multiparty computation protocol.
• PPComponentType and DAComponentType which
are subclasses of ComponentType specialized in
privacy preservation and data analysis respectively.
The latter one, in addition, carries three new properties.
Two of them are supportPPType and supportDataType
for characterizing what kind of the data its instance
can process and the other one is hasOutputQuality
which is semantically the same as that of the
WorkflowTemplate class.
KAoS (Bradshaw et al. 2003) and Rei (Kagal, Finin &
Joshi 2003) are two representative projects that make use
of Semantic Web technology to specify privacy related
policies. The former one was proposed in the context of
multiagent systems and the latter one was proposed in the
context of pervasive computing. Their latest versions
follow RDF-Scheme based syntax and support four types
of policies including positive authorization, negative
authorization, positive obligation, and negative obligation
(Tonti et al. 2003). Authorization refers to the notion that
some action is permitted or not. Obligation refers to the
notion that some action has to and should not be done
given a certain condition.
A new class called PPData is introduced as a subclass
of Data to represent privacy preserved data types and it has
two properties:
• ProtectedBy whose range captures the privacy
preservation technique being adopted, e.g. encryption,
anonymization.
• LevelofProtection whose range captures the level of
protection possessed by the data. While there doesn’t
exist so far a universal way to specify the level of
protection, different privacy preservation techniques
are associated with their ways of specifying the level.
For example, for encryption, the key length is a good
indicator of the protection level. For anonymization,
the size of the smallest indistinguishable set is a good
indicator. Related knowledge can be further described
in the privacy preservation ontology.
Both projects provide ontologies for specifying policies
with the concepts of like resource, actors, actions, context,
policy, control, etc. included and are expected to be further
extended in application specific way. Take KAoS as an
example. If ones want to describe privacy policies which
can support also the notion of privacy awareness presented
in the previous section, they can map the ComponentType
class to the Action class in KAoS and so that
PPComponentType and DAComponentType will be
referred to the privacy preservation actions and data
analysis actions respectively. The properties adhering to
them as metadata become the context of those actions in
the terminology of KAoS.
To contrast with our need of policies for controlling data
privacy, KAoS primarily focuses on real-time
communication and interaction among software agents,
and thus related dynamics are carefully modeled in them.
For data analysis workflows in e-Science, those constructs
are of less relevance since a relatively reliable distributed
computing environment is mostly assumed. Instead, the
concern, which is also the theme of this paper, is more on
the data, the overall accuracy of the analysis and the
privacy issues within an execution. Also, as numerical
measures are often involved in the context of privacy
awareness, ways to formally incorporate them in the policy
conditions and have an engine that can reason upon them
obviously is an open issue. Last but not the least, the
enriched set of privacy preservation and data analysis
Some of the concepts related to data and data analysis
components are highly related to the specification of
Predictive
Model
Markup
Language
(PMML:
http://sourceforge.net/projects/pmml). PMML is an XMLbased language which can be used for describing data
mining models to facilitate their exchange across different
platforms. Because of its goal, elements for describing
details of data structures (like matrices) and mining model
schema are included. To contrast with the goal of PMML,
ours is situated at a higher level for data analysis validation
and leaves the detailed interoperability issue to the lower
level of the stack of the workflow system.
21
techniques open up opportunities for new privacy policies
to be specified (as to be detailed in the following subsection) and at the same time impose new challenges on
how those policies can be automatically enforced. More
effort in formalizing and addressing the underlying
challenges in a disciplined manner is essential.
analysis component if the component carries more one
input data sources and they have overlapping attributes.
This policy exemplifies the need to handle the potential of
information bleaching via combining multiple privacy
preserved data. While more rigorous explanation on why
Policy 3 is essential is a bit out of the tone of this paper,
only the intuitive idea is provided. There are possibilities
that data items in different input sources (e.g., clinical
records and on-line phone books) may in fact refer to the
same entities. If the association of the multiple data sources
can be derived via the overlapping attributes, one could
learn more about those entities by studying their group
attributes in the multiple sources. In general, more research
attention is needed to handle situations with multiple
sources of information aggregated, even if each of them is
privacy preserved. The case study to be presented in the
next section will further elaborate this point.
Examples of privacy policies needed in data
analysis workflow
To help better illustrate the new set of privacy policies
needed for data analysis workflows, we present the
following examples and relate them with some possible
real scenarios. Some of them are similar to what we show
in the Motivation section but with further contextual details
included using the vocabulary provided by the ontologies
we described in the previous section.
•
Policies governing privacy in data transmission
•
Policy 1: Patient medical images (data) should not be the
input of any data analysis component or the output of the
overall workflow except for the purpose of medical
diagnosis, and the images have to be encrypted using RSA
with a key length of at least 128 bits before transmitted.
Policy 4: Given that the purpose of the analysis may lead
to critical action taking (e.g., patient isolation), the
confidence level of the final output should be available and
be higher than an associated threshold or the workflow
execution should be considered unsatisfactory and halted.
As the analysis is related to medical image comparison,
privacy preservation techniques like generalization and
perturbation will not be suitable or the image quality will
be degraded. The encryption approach fits well to the
situation as stated in Policy 1. An example of a research
initiative where distributed images are shared is The
Biomedical
Informatics
Research
Network
(http://www.nbirn.net/).
•
Policies governing privacy in intermediate data
products
This policy governs the analysis output whose quality may
vary depending on the data analysis algorithm itself (e.g., a
training phase is needed during the analysis) and the
imprecision of the input data, possibly caused by privacy
preservation. As the quality measure will not be known
during the workflow creation phase, this type of policies
can only be enforced during the workflow execution.
Policies governing privacy in data source selection
Case Study – Distributed Data Clustering
Policy 2: Data containing drug dosage information should
not be the input of any data analysis component or the
output of the overall workflow except for the purpose of
public health care study, and the data should not contain
person identification attributes and should be generalized
by at least 5-anonymity.
Suppose that there is a clinical trial study which is related
to mental health. Data are collected from patients at
different clinics and contain a range of medical
measurements, drug dosage, as well as patient related
demographic information. The set of attributes are not
identical for the data sets collected from different clinics
but all of them are supposed to have the personal
identification fields removed before releasing (Policy CS1).
Also, the data are restricted from further analysis unless the
patients’ data are further generalized into groups, each
satisfying k-anonymity and being represented by only the
first and second order statistics (Policy CS2).
Health care data are known to be sensitive, especially when
they touch on critical issues like mental health. As
healthcare data analysis mostly only focuses on trends and
patterns and less on particular cases, the generalization
approach fits well. Policy 2 or similar ones are relevant to
the analysis conducted at health related organization like
National
Survey
of
Family
Growth
(http://www.cdc.gov/nchs/nsfg.htm) and National Institute
of Health (http://www.nih.gov/).
Clustering has been a common technique for discovering
patterns in datasets. So, one of the analysis tasks under this
study is to apply clustering to the combined dataset to
identify patient groups with similar medical measurements
under a certain amount of drug dosage. The researcher uses
Policy 3: Data set satisfying k-anonymity as required for
its privacy protection should not be the input of any data
22
Later on, it is brought to the attention of the researcher that
one patient’s identity is revealed. The researcher uses the
workflow system’s provenance function enabled by the
metadata and manages to trace back to the clustering
execution just explained. After careful investigation, it is
found that there are some unexpected cases where the same
patient went to several clinics and the complaining patient
is one of those cases. As the same patient falls into
different groups at different hospitals, their intersections
can thus be uniquely characterized and thus the kanonymity property no longer holds. From this, the
researcher learns that Policy CS2 is insufficient and needs
to be revised. Instead, it should be the combined dataset
that needs to satisfy the k-anonymity instead of only those
before the combination. The researcher simply needs to
modify the policy so that the workflow system can avoid
similar incidents from happening again in the future.
The data analysis task being described in this case study is
a relatively simple one and the goal is to show how a
workflow system with privacy awareness embedded would
operate. As can be read from the case study, the policies
involved apply directly to some particular links in the
workflow and the corresponding implementation should
not be a big challenge. Whether there are needs for privacy
policies to be described in a more holistic sense, and how
can these global constraints be decomposed and
propagated to the corresponding parts of the workflow for
privacy controls are interesting issues worth future
investigation.
Figure 3. Creation of a distributed data clustering
workflow.
a privacy aware workflow system as what we have
described. Also, we assume that the data provided by the
hospitals and clinics have their metadata accurately tagged.
The researcher creates a workflow template and puts
directly all the data together and feeds them into a
clustering component. The workflow system applies the
policies to the data analysis workflow and finds that one
data source still contains the patient name as revealed in
the metadata (DataAttributeSet). This violates Policy CS1
and thus the system prompts the user to transform the
dataset to address this issue.
Conclusion and Future Work
In this paper, we motivated the need for managing privacy
in data analysis workflows so that a new type of privacy
policies that constrain processing on data can be supported.
We also described our initial work on a semantic approach
to represent privacy policies relevant to data analysis. We
argued the validity of the approach by showing how
analysis-relevant terms can be defined in ontologies, and
how they can be combined within a policy framework to
represent the policies. Finally, we discussed how those
policies can be applied via various examples, potential
areas of application and a detailed case study. We believe
that workflow systems with the proposed privacyawareness incorporated could ease the scientists in setting
appropriate privacy polices that suit for different types of
collaborative research projects and at the same time can
help them in safeguarding the privacy of sensitive data
throughout the data analysis lifecycle.
The workflow system finds that Policy CS1 is respected
this time but violation of Policy CS2 is detected instead as
the data fed into the clustering component are not privacy
protected as required. The researcher learns from the
system about Policy CS2 and thus feeds the data first to a
correct
generalization-based
privacy
preserving
components before going into the clustering component.
With both policies satisfied, the researcher forgets that the
original clustering component he selected does not support
privacy preserved data. The workflow system detected the
mismatch with the help of the metadata captured in
instances
of
PPData
(via
ProtectedBy) and
DAComponentType (via SupportPPType), and prompts the
researcher about the mismatch problem. As suggested by
the system, the researcher switches the clustering
component to one that can support clustering of PPData
abstracted as GMMs (Cheung et al 2006). The system
eventually finds the data analysis flow valid (as shown in
Figure 3) and thus executes it.
We are currently implementing the approach by extending
the Wings framework. The ontologies and policies shown
in the paper will be represented in OWL and SWRL so that
the policy enforcement can be carried out by the workflow
system with the help of OWL reasoners. How to design a
privacy policy framework which suits best for data analysis
23
is no doubt an open research issue. Also, as hinted in the
Case Study section of the paper, a related open issue is to
gain further understanding on the full spectrum of the
policies needed to be represented in the policy framework.
Deelman E. and Gil, Y. 2006. Final Report of the NSF Workshop
on
the
Challenges
of
Scientific
Workflows.
(http://vtcpc.isi.edu/wiki/images/3/3a/NSFWorkflowFinal.pdf)
Domingo-Ferrer, J., and Mateo-Sanz, J.M. 2002. Practical Dataoriented Microaggregation for Statistical Disclosure Control.
IEEE Transactions on Knowledge & Data Engineering, 14(1),
189-201.
Extending the focus from merely data privacy to also
process privacy is another important direction of
investigation. Recently, the need of sharing experimental
processes among scientific have been identified to be
important in further facilitating collaboration knowledge
discovery in empirical science. Related collaborative
process sharing tools (e.g., myExperiment.org) have been
built to ease the corresponding sharing management.
However, workflow systems with the policy-based
enforcement as described in this paper incorporated for
controlling workflow provenance sharing are still lacking
and should form an important compliment of what we
discussed in this paper.
Gil, Y., Ratnakar, V., Deelman, E., Mehta, G. and Kim, J. 2007.
Wings for Pegasus: Creating Large-Scale Scientific
Representations of Computational Workflows. To appear in
Proceedings of the 19th Annual Conference on Innovative
Applications of Artificial Intelligence (IAAI).
Hastie, T., Tibshirani, R. and Friedman J.H. 2003. The Element of
Statistical Learning, Springer-Verlag.
Kagal, L. Finin, T., and Joshi, A. 2003. A Policy Language for
Pervasive Systems. In Proceedings of the Fourth IEEE
International Workshop on Policies for Distributed Systems and
Networks, 63-76.
Acknowledgements
This research was supported in part by the Air Force Office
of Scientific Research (AFOSR) through grant FA9550-061-0031.
Kargupta, H., Datta, S., Wang, Q., and Sivakumar, K.
2005. Random Data Perturbation Techniques and Privacypreserving Data Mining. Knowledge and Information Systems,
7(4), 387-414, New York, Springer Verlag.
References
Lin, X., Clifton, C., and Zhu, M. 2005. Privacy-preserving
clustering with distributed EM mixture modeling. Knowledge and
Information Systems, 8(1), 68--81, New York, Springer-Verlag.
Bradshaw J., Uszok, A., Jeffers, R., Suri, N., Hayes, P., Burstein,
M., Acquisti, A., Benyo, B., Breedy M., Carvalho, M., Diller, D.,
Johnson, M., Kulkarni, S., Lott, J., Sierhuis, M. and Van Hoof, R.
2003. Representation and reasoning for DAML-based policy and
domain services in KAoS and Nomads. In Proceedings of the
Second International Joint Conference on Autonomous Agents
and Multiagent Systems, 835--842, New York, ACM Press.
Ludascher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E.,
Jones, M., Lee, E.A., Tao, J., and Zhao, Y. 2006. Scientific
Workflow Management and the Kepler System. Concurrency
Computation: Practice & Experience, 18(10), 1039-1065.
NIH. 2004. Data Sharing Workbook, National Institutes of
Health. (http://grants2.nih.gov/grants/policy/data_sharing/
Beltrame, F., Canesi, B., Molinari, E., Porro, I., Schenone, A.,
Torterolo, L. 2006. Book of Abstracts of Enabling Grids for EsciencE User Forum. (http://egee-intranet.web.cern.ch/egeeintranet/User-Forum/2006/)
data_sharing_workbook.pdf)
Oinn, T., Greenwood, M., Addis, M., Alpdemir, M.N., Ferris, J.,
Glover, K., Goble, C., Goderis, A., Hull, D., Marvin, D., Li, P.,
Lord, P., Pocock, M.R., Senger, M., Stevens, R., Wipat, A. and
Wroe, C. 2006. Taverna: Lessons in Creating a Workflow
Environment for the Life Sciences, Concurrency Computation:
Practice & Experience, 18(10), 1067-1100.
Bernstein, A., Provost, F. and Hill S. 2005. Toward Intelligent
Assistance for a Data Mining Process: An Ontology-Based
Approach for Cost-Sensitive Classification. IEEE Transactions
on Knowledge and Data Engineering, 17(4), 503-518.
Cannataro, M., Comito, C., Schiavo, F.L. and Veltri, P. 2004.
Proteus, a Grid based Problem Solving Environment for
Bioinformatics:
Architecture
and
Experiments.
IEEE
Computational Intelligence Bulletin, 3(1), 7-18.
Samarati, P. 2001. Protecting Respondents' Identities in
Microdata Release. IEEE Transactions on Knowledge and Data
Engineering, 13(6), 1010-1027.
Cheung, W.K., Zhang, X., Wong, H., Liu, J., Luo, Z. And Tong,
F. 2006. Service-oriented Distributed Data Mining, IEEE Internet
Computing, 10(4), 44-54.
Singh M.P. and Vouk, M.A. 1996. Scientific Workflows:
Scientific Computing meets Transactional Workflows. In
Proceedings of the NSF Workshop on Workflow and Process
Automation in Information Systems: State-of-the-Art and Future
Directions, SUPL28--34.
Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X. and Zhu, M.
2003. Tools for Privacy Preserving Distributed Data Mining.
ACM SIGKDD Explorations, 4(2), 19-26.
24
Stalling, W. 2005. Cryptography and Network Security:
Principles and Practices (4th ed.). Prentice Hall.
Tonti, G., Bradshaw, J., Jeffers, R., Montanari, R., Suri, N. and
Uszok A. 2003. Semantic Web Languages for Policy
Representation and Reasoning: A Comparison of KAoS, Rei, and
Ponder. In Proceeding of 2nd International Semantic Web
Conference, Lecture Notes in Computer Science, 2870, 419-437,
Springer-Verlag.
Wassermann, B., Emmerich, W., Butchart, B., Cameron, N.,
Chen, L. and Patel, J. 2006. Sedna: A BPEL-based Environment
for Visual Scientific Workflow Modeling. In Workflows for
eScience - Scientific Workflows for Grids, Taylor, I.J., Deelman,
E., Gannon, D., and Shields M.S. (eds.), Springer Verlag.
Weitzner, D.J., Abelson, H., Berners-Lee, Tim, Hanson, C.,
Hendler, J., Kagal, L., McGuinness, D.L., Sussman, G.J.,
Waterman, K.K. 2006. Transparent Accountable Data Mining:
New Strategies for Privacy Protection, Technical Report, MITCSAIL-TR-2006-007, MIT.
25