24123 >> Bryan Parno: Anupam was once at Stanford... for six years.

advertisement
24123
>> Bryan Parno:
for six years.
Anupam was once at Stanford but has been at CMU now
>> Anupam Datta:
Five.
>> Bryan Parno:
Five years. And did a lot of interesting work both
on verifying the correctness of various cryptographic protocols and
more recently at looking at various aspects of privacy from
differential privacy to private audit logs and trying to apply a lot of
rigor to a difficult area. And in addition to understanding things
like the HIPAA laws, which has its own fun and challenges there.
So today he's going to talk a little bit about the audit piece.
>> Anupam Datta: All right. Thank you very much, Ryan. And thank you
all for coming. I want to summarize some work that we have been doing
over the last five years or so at CMU on trying to understand privacy
at a more semantic level so that it's more clear what privacy policies
actually mean operationally. And it has led us to mechanisms based on
audit and accountability for enforcement.
And just to place things in perspective, we now live in a world where
personal information is everywhere. Information about individuals is
collected through search engines and social networks and online retail
stores and cell phones and so forth, and all this information is being
collected by these, in many cases by companies which now hold vast
repositories of personal information.
So this raises a basic question, privacy problem. How can we ensure
that these organizations that collect these vast repositories of
personal information actually respect privacy expectations in the
collection disclosure and use of personal information.
And one approach that has evolved in practice is that we are beginning
to see emergence of more and more privacy laws in various sectors. As
well as privacy promises that companies make like Microsoft and Google
and other companies that hold personal information about conditions
under which information will be collected and used and shared.
Just to make things a little more concrete, here are some snippets from
the HIPAA privacy rules for healthcare organization. It gives you a
sense of the kind of examples of policies that exist, that exist in
this space. So here's one example that says a covered entity, that's a
hospital or a healthcare organization, is permitted to use and disclose
protected health information without an individual's authorization for
treatment, payment and healthcare.
There's another that says that a covered entity must obtain an
authorization for any disclosure or use of [inaudible] notes. In
general for personal health information so long as the information is
being shared for these purposes there is no need to get authorization
from the patient except for the special case of therapy notes. These
are just two snippets in a large and complicated law that includes over
80 other clauses. It's the kind of thing that's going to be extremely
tedious to enforce manually.
So we would like some algorithmic support in enforcement of these kinds
of policies. And there are many other laws. This is just one example.
There's the Gramm-Leach-Bliley Act that financial institutions have to
comply with. There's FERPA for educational institutions in the U.S.
The U data privacy directive and so forth.
>>: I have a question. Concerning the first two points, you somehow
increase, assume deny overwrites permit.
>> Anupam Datta: So the way the law is written, I have not shown you
the entire piece. But it uses mechanisms like exceptions. So the way
it says is that this is the clause except if, and then this. And so
the priority, if you will, what clause preempts what other is encoded
into the law using this exception mechanism. And that has a natural
logic. That leads to a natural logical presentation of the law also.
Now, so those are some examples of the kind of policies we see in
regulated sectors like healthcare, but on the Web also companies make
promises and their privacy policies that they're expected to respect.
Otherwise the FTC can and has imposed fines on them and such. So
here's some examples. So Yahoo!'s practice is not to use the content
of messages for marketing purposes. And then there's something from
Google's policy here. Now, it's one thing to have laws and promises.
But the question is really are these promises actually being kept?
Right? Are these laws actually being complied with. And there's a lot
of evidence that we have lots and lots of violations of these, the
kinds of clauses that appear in these laws. So they're examples of
healthcare and the healthcare sector reaches in violations of HIPAA.
This is an example of where Facebook was called, had to pay pay price
for not living up to their promises. Google ran into rough waters and
so forth.
And this Wall Street Journal selection series of lectures, series of
articles on what they know, what they call the what they know series
had lots of episodes where privacy promises are not being respected or
the practices are somewhat questionable.
So that's the broader context. But the technical question that I and
others are interested in is what can we as computer scientists do to
help address this problem? So let me focus on a very concrete scenario
and a snippet of a privacy policy from HIPAA to motivate some of the
technical questions that I'll get into later in the talk.
So we will consider a simple scenario from healthcare privacy where a
patient comes and shares his personal health information with the
hospital. The hospital might use this information both internally for
purposes like treatment and payment and operations, and share this
information with third parties like an insurance company for the
purpose of, say, in this case payment.
But when this information flows in a manner that's not permitted, for
example, if this information goes out to a drug company, which is then
used for advertising, then that would be a violation.
And then in addition to this cross-organizational flows inside the
hospital also there's a complex process and there could be violations
internally also, where information is used in inappropriate ways.
So I guess the one thing I want to point out here is that many of these
violations are going to happen because of authorized insiders. People
who have been given access to this information for certain purposes
like treatment and payment and operations, but then they either use
that information in ways that are not permitted or share it under
conditions that are not permitted. So we do not expect that the
problem will be solved using well-known techniques from access control
which is because the people who have a legitimate reason for access are
exactly the ones who are committing the violations.
So access control is, of course, necessary in this setting to ensure
that employees, people who are not employees of the hospital do not get
access to the information, but that is not going to be sufficient for
enforcing the kind of policies that appear in this application domain.
So let's look now at a very concrete example from the HIPAA privacy
rule. And it's going to be a bit of a mouthful. But the reason I
picked this particular clause from HIPAA is that all the concepts that
are relevant in a whole bunch of privacy policies including HIPAA and a
lot of other policies show up in this one example.
So the policy says that a covered entity, again think of that as a
hospital, may disclose a protected person's information to law
enforcement officials for purpose of identifying an individual if the
individual made a statement admitting participating in a violent crime
that the covered entity believes may have caused serious harm to the
victim. It's a long and convoluted sentence, but let's look at the
different pieces and the corresponding formal concepts that they
represent.
So first of all, when we talk about privacy policies, there's often
going to be an action involving personal information. So in this
setting, in this example the action is a transmission or send action
that goes from a sender P-1 to recipient P-2 and it contains the
message M. In this example, the sender P-1 is the hospital, the
covered entity. P-2 is the law enforcement official. And M is the
message that contains the information that is being shared.
But in addition to transmission actions, there could be accesses or use
actions. Other kinds of actions that involve this personal
information.
The second piece is that this is common in HIPAA and many other privacy
policies, in order to get some abstraction, often roles are used to
group large classes of individuals. So in this case the recipient of
the message is in the role of law enforcement. And the same policy
will apply to all law enforcement officials, not just to P-2.
There are data attributes of the message that is being transmitted.
And this clause applies to messages that are of the type protected
health information. And you could imagine a characterization of an
attribute hierarchy that says that various types of information are
protected health information. A prescription might be protected health
information whereas maybe an e-mail that contains some scheduling
appointments, scheduling is not.
Then there are these temporal concepts. In this case the policy
permits the flow of information if in the past a statement was made to
this effect. And temporal concepts capture a bunch of privacy idioms.
For example, often in privacy policies you see things like concepts.
Like in the past a customer patient gave consent, then it was okay to
share information.
There are also things like notification that require bounded future
temporal constraints. So notification policies are of the form that if
any organization loses their personal health information or personal
information about customers, then within a certain number of days, like
30 days, they have to inform the customers that they have lost their
information. So that's a bounded future requirement.
So temporal constraints about the past and bounded future timed
temporal constraints are going to show up a lot in these kinds of
privacy policies.
Now, in addition to these concepts, now we will begin to see some other
grayer concepts that are also somewhat vaguer. Right? So in
particular, there is this notion of a purpose, which shows up all over
the place in privacy policies. Right. So in this case the disclosure
is permitted for the purpose of identifying criminal activity. In many
cases you'll see companies promise that information will be used for
one purpose. Say to improve the search experience. Or to improve
treatment but not for other purposes. Like marketing and so forth.
Right? So that raises the basic question, what does it mean for an
action to be for a purpose or information to be used only for a purpose
or not for a purpose. And what does that mean semantically and how
would we enforce these kinds of promises or policies?
Another thing that shows up is beliefs. You see here in this example
there is this notion that it's okay to do the disclosure if the
hospital believes that the crime caused serious harm. So this requires
some understanding of beliefs. Now, I'm going to separate out these
two classes of concepts into what I'm going to call black and white
concepts. These are things for which semantics are not that hard to
give based on previous work on giving semantics to policies and
enforcement techniques. The enforcement will be difficult for some
other reasons that I'll explain.
And these are much more vague. In purposes in particular is one thing
we're going to focus on. Beliefs, I'm not going to talk about too much
today, partly because I believe that a lot of the work that has been
done on authorization logics could directly apply to the belief part of
this.
So we are going to not deal with beliefs appealing to prior work. But
I do want to highlight these two broad classes of concepts that arise
in privacy policies and I'm going to discuss some methods for their
enforcement.
So very broadly, the 10,000-foot view here is that we are-there's a
research area to be explored formulating privacy policies giving
semantics to various privacy concepts and their enforcement. And
enforcement I'll talk about at least two classes of techniques. One to
detect violations of policy in some cases as we'll see it will be very
hard to prevent violations.
And also accountability. Whereby accountability that when a violation
is detected in a multi-agent system where many people could have done
many different things that ultimately led to the violation, how do we
identify agents to blame for policy violations as well as incentives to
debtor policy violations.
So much of this
the audit piece
this is this is
practice, where
on prevention.
is a work in progress. I'm going to focus primarily on
today. So at a very high level one way to think about
this parallels how the justice system works, in
there is a law but the enforcement is not always based
It requires police officers to do detection of violations and give
parking tickets and other mechanisms for assigning blame and
appropriate punishments. And that's what we're going to try and mirror
in the digital world. And detection will be the first step of that
process.
Now, we're not the only one to talk about the importance of audit and
accountability in this setting. There have been a couple of recent
position papers from MIT from Hal Labelson and Danny Widener and others
and also Walter Lampson, and there's also, in the corporate world there
is increasing push for accountability-based privacy governance in which
the corporate privacy people at Microsoft have been largely involved.
The goal, one big goal of this work is suppose that there is a
regulation that organizations are expected to comply with. How do they
demonstrate to a third party that in fact they are complying with these
regulations?
And so far much of this has been very ad hoc, but part of what I'm
going to talk about today is a step towards producing algorithmic
support for these kinds of tasks.
So at a very high level, the approach for audit is going to parallel
the separation between the black and white concepts and the gray
concepts. So we start off with the privacy policy and put it into a
computer readable form. And we have actually done complete
formalizations of both the HIPAA privacy rule and the
Gramm-Leach-Bliley Act for financial institutions.
And one way to think about this audit box is it takes the input and
organizational audit log that records what software systems and people
have touched, what pieces of data are shared, what pieces of
information, and the policy and then the audit box comes back with,
well, this policy was violated on this execution, that kind of
information, whether a violation happened or not.
And paralleling the informal separation that I mentioned when I walked
through the policy example, we will look at -- we have one algorithm
that does fully automated audit for these black and white concepts.
And a different algorithm, you can think of that as an oracle that
provides guidance on these gray policy concepts. So the second
algorithm is going to be a little tricky to get at because it's trying
to, in particular, it focuses on this purpose piece, and the reason
that it's complicated is that we are trying to figure out whether a
person accessed information with thinking about achieving a purpose
like treatment or not. It's as if we're trying to understand the human
psyche. We are not there yet. We don't have the oracle from the
Matrix yet, but we'll try to approximate it using some AI techniques.
So those are the two big pieces of the technical presentation. So let
me first talk about auditing black-and-white policy concepts. This is
John Torque with two of my post-docs, former post-docs. T. Buck is now
faculty at Max Plank and Laemengio [phonetic] still at CMU.
In order to audit the black-and-white concepts, although I said they're
somewhat simpler than the gray concepts, there's two main technical
challenges. One challenge is that these audit logs are incomplete, in
the sense that they may not have sufficient information to decide
whether a policy is true or false.
So if you think about
control, when I tried
that's sitting and it
depending on what the
access control, let's say file system access
to read the file there's a reference monitor
will either let me access the file or not
policy says.
So there is often in access control there's enough information to
decide whether to allow access or not. So the reference monitor will
come back with a yes or no answer. We'll see that in the presence of
incomplete audit logs. We will not always get a yes/no answer.
It could be that the parallel of the reference monitor does audit
algorithm, can either say yes the policy was satisfied, no, the policy
was violated, or it can say I don't know. But we want to deal with the
I don't know scenario in a graceful manner.
So there are a bunch of sources of incompleteness. One is future
incompleteness. So since we might have these notification-like laws
that talk about what needs to happen in the future, there may not be
enough information in the log at the current moment to say whether or
not it's violated or not.
But the hope is that as the log grows over time we'll get to a point
where we'll know for sure. There may not be information about some of
these gray concepts, these somewhat subjective concepts. There may not
be -- evidence may not be recorded for purposes of beliefs and things
of that nature. Sometimes logs may be spatially distributed.
And there may not be information in one log to decide whether
information -- whether the policy's violated or not.
>>: [inaudible].
>> Anupam Datta:
Can what?
>>: Can't keeping a log violate some of these policies?
>> Anupam Datta: Yes, it could. This is the class of policies I'm
enforcing here primarily policies that will talk about conditions under
which information can be shared or not, or used for a certain purpose
or not.
We don't have mechanisms to deal with data retention policies and
things like that. They have to be dealt with using other mechanisms.
Who has access -- the log is operating on human subjects data. And did
all of the people who contributed to this system with data consent that
their medical records would create records in a log which then you
would access and do studies on.
>> Anupam Datta: That's an interesting point. So under law, HIPAA
requires these healthcare organizations to maintain these audit logs.
So healthcare organizations have to maintain these audit logs. And
then there is often -- the way it works in practice -- I should say
there are audit tools that are now appearing in the market for
healthcare audits that are getting bought and used.
Often the way that they're getting used is that there are some
designated people in the audit office if you will who access these
logs, and these existing commercial tools do very simple things. You
can only issue SQL queries. So you can find all employees who accessed
more than 100 times in the last two days.
Things of that nature, right? And then the audit, these tools, there
is the fair warning tool which is a company, a start-up that is doing
reasonably well. The other tool, the P2P sentinel tool which required
by Barnard, similar thing. What they do in addition which is partly
what you're getting at, is they would keep track of who accessed the
audit log. So there's another layer of, but that's as far as there's a
trail of who is accessing the log information.
>>: Right. But you said they're required to keep the audit log. But
the moment a researcher goes in and does some research using the audit
log that's different than keeping the audit log.
>>: The researcher going in -- HIPAA allows -- I'm not particularly
happy about this clause in HIPAA. But HIPAA allows deidentified.
So we have not -- whether you're asking me whether I looked at these
logs, the answer is no. But if you're asking if it's permitted under
law, the answer is yes. HIPAA allows deidentified for a very
operational notion of deidentified, but that may be very unrelated to
protecting privacy. It allows deidentified information to be shared
for the purpose of research. So under HIPAA it's permitted.
You don't need consent from patients to do that. All right. So this
is one big challenge, the dealing with incompleteness. And the way
we're going to do that is a simple idea. We'll model the complete logs
with three valued structures meaning given a predicate, the log might
tell us that the predicate is true or false or unknown, meaning that it
doesn't know. It doesn't have enough information.
>>: Basically seems there was a presumption that things are consistent.
In addition there's a fourth category.
>> Anupam Datta:
You mean the policy?
>>: The policy is inconsistent.
>> Anupam Datta: The policy we are assuming -- if the policy's
consistent, inconsistent, then all bets are off, because faults will
imply.
>>: The heap --
>> Anupam Datta: We haven't found -- that's a good question. We
haven't found any inconsistencies in HIPAA. And part of the reason for
that is HIPAA is largely operational. That's a good thing. That's
part of the reason we looked at that. But part of the reason for that
is inconsistencies might arise when one part of the policy says do
something and another part says don't do it.
But whenever that has risen in HIPAA, it has always come through this
exception mechanism so that it's clear what overwrites what. Now, we
haven't done a mechanical automated and licensed to check for
consistency. But maybe that's something we can do because now we have
it in a machine readable formalization.
>>: [inaudible] you're pulling lots of different logs together, you're
assuming they're all consistent, might just be missing some pieces?
>> Anupam Datta: It might be missing pieces. That's incompleteness.
Incompleteness we can deal with. Inconsistency in the logs will also
be problematic for the same reason because we're assuming that if a
predicate is true, then it cannot be false.
But the logs are not necessarily pulled from different places. So the
application we're going to do with the real logs, which has taken more
than a year to get close to it from the Northwestern Memorial Hospital,
part of the Sharps project, Carl Gunther has done experiments published
results on that.
distributed log.
It's one place, it's like a CERNER log, it's not a
All right. So it's a very simple abstraction. Given a predicate the
log will tell us whether it's true, false or unknown, right. And then
the meaning of larger policy formulas can be defined using this
algorithm. The idea of this iterative reduce algorithm, which is the
audit algorithm, is that it takes a log and policy, but unlike standard
access control, runtime monitoring, instead of coming back with a
true/false answer, it checks as much of the policy as it possibly can,
given the information in the log, output a receipt policy, contains
only the predicates for which the log says unknown. I don't know
whether it's true or false. That will be a simpler policy.
When the log is extended with additional information, you can run the
algorithm again and you proceed in this way iteratively. So
pictorially you have the log and policy, think of this slide as HIPAA.
You run reduce. You see the reduced policy 5.1 when it grows there's
potentially more information about the predicates in 5.1 that there
wasn't information about here. You run this again and this process
continues. And at any intermediate point we can invoke the oracles for
great concepts like we have an algorithm for determining purpose,
restrictions and you can call that algorithm because this algorithm is
not going to deal with those gray concepts. So that's the picture.
Then a little bit more detail, the policy logic looks a little bit like
this. I don't want you trying and read everything on this slide. It's
a fragment of first order logic. We need over N bounded domains. The
interesting technical challenge here is we have to allow for
quantification over infinite domains because HIPAA talks about for all
messages, the messages sent out by the hospital has to respect some
policies.
And because of that, that's the technical challenge where we had to go
beyond what is already known in runtime monitoring. And the logic is
expressive since it has quantification over these infinite domains, can
quantify over time, can express timed temporal properties.
Now, if I take this policy and write it out on this logic it looks a
little bit like this. Again, I don't want you to necessarily read the
formula. The important thing here is that there's going to be a
distinction. Well, there's quantification over all messages, the set
of messages in English is infinite.
And all time points. And the other thing to take away is there's the
black part of the policy which the algorithm will deal with
automatically. And then there are the red parts, which are really the
gray concepts. And this algorithm will not deal with. It talks about
things like purposes and beliefs and so forth.
Now, the formal definition of the reduced algorithm, let me show you
little snippets of it, if the formula is just a predicate, then the
algorithm will find out from the log whether that predicate is true or
false or unknown.
If it's true, it returns truth. If it's false, it returns false. But
if it's unknown, then it returns the whole predicate phi. So in this
case the receivable formula is the entire predicate P.
And then we apply this recursively. If it's a conjunction, you just
apply reduce on the two parts and so forth. The interesting case is
when we have universal quantification over an infinite domain. One
naive way to try to do this is to consider all substitutions for X.
And then this becomes a conjunction. So five is X-Set to X1. But
that's going to be an infinite formula. The algorithm will never
terminate if you do that.
Instead, we are going to restrict the syntax to have these guards. So
since this is an implication, it's going to be trivially true one sees
false. The interesting case is when C is true. Now, the C will be
such that there's only going to be a finite number of substitutions of
X and that finite substitutions can be computed. So intuitively one
way to think about this is if you think about HIPAA and what is HIPAA
saying? HIPAA is saying that every message sent out by the hospital
should respect some complicated logical formula that represents HIPAA.
But the number of messages sent out by the hospital is finite. The
hospital does not send out every possible information in the English
language. There are only maybe a few messages that the hospital sent
out to third parties.
So for this predicate is going to be true only for those messages. And
in that case you get this as a finite, the instances you'll get as a
finite conjunction.
Right? So let me write this out in Greek form, then you get one
conjunct for each of the finite substitutions that makes C true. And
then the rest, this is saying allowing for the incompleteness in the
audit log. Since in the future you might get other substitutions, more
messages might get sent out, or even the log was not passed complete.
So maybe there were some messages that will show up as the log expands.
We have to somehow deal with that, and that's captured by this
conjunct, which is saying if I get other substitutions other than the
ones that I've already considered, then I should also have a piece for
that in the formula.
>>: Did I understand you correctly did you say that in HIPAA, it is
always the case that appropriate guard finetizing guard can be found.
>> Anupam Datta:
That's right.
That's right.
>>: Why would it not be acceptable to convert for all into, for all
must into none may?
>> Anupam Datta:
For all must --
>>: So requiring being that there's no message sent by the organization
that violates the rule rather than all of them must follow the rule.
>> Anupam Datta: So that's fine. That doesn't change anything, right?
So if I express that -- well, if I express that in this first order
logic, that's an existential quantifier over an infinite domain. And
that will become an infinite disjunction.
Converting universal to existential will not help. Maybe we can take
that offline. All right. So I guess coming back to your question, the
general theorem though we have is that if this initial policy satisfies
a syntactic mode checking, we're using the idea of mode checking from
logic programming. Then the finite substitutions can be computed. So
now we have a syntactic characterization of these guards, so which the
finite substitutions can always be computed.
Turns out as an instance of that, we get the whole of HIPAA and the
whole of Gramm-Leach-Bliley. If someone comes up with a third law that
we want to look at, and we would like to see first whether this theorem
applies to that law or not.
If it does, so the generality has that nice property, and our argument
for why this somewhat esoteric theorem that uses techniques from mode
checking is useful in the setting that, look, the whole of HIPAA and
Gramm-Leach actually satisfies this test, right?
So if you look at this particular, going back to our policy example,
and here's an example of an incomplete audit log. Now, if you look at
all those quantifiers, then we are going to find substitution for the
variables P1 and P2 and M and so forth by mining this log. Mining this
log you see that the send predicate there, there's only one instance of
send, that's this instance over here, and that will give us the
substitution that P1 corresponds to UPMC and P2 corresponds to the
Alaghenny [phonetic], and that message corresponds to exactly this
message M 2 and so forth.
So we can mine these from the log, and when you do that, now we know
the true values for various predicates and we're left with residual
formula that only contains the gray part of the policy. The rest all
become true and disappear.
We have actually implemented this. And applied it to simulated audit
log. So this is not over real audit logs. We haven't gotten our hands
on audit logs, on hospital audit logs yet when we wrote this paper for
CCS last year.
And it turns out that the average time for checking compliance of each
disclosure protected health information is about .12 seconds for a 15
megabyte log. So this does scale reasonably well. Now, so that's
performance. So one thing to be careful about for performance is that
as you apply this algorithm, the residual formula can actually grow.
Because we have the finite substitution -- you know, whenever we see a
for all, it becomes bigger, because you get one entry for each of the
residual, each of the substitutions that you mine from the log. And
then there's the residual piece. So after a few iterations, the policy
will become too big for the algorithm to work.
Because the residual formula largely has these things as purposes and
such which this algorithm cannot handle. And that's something I'll
come how to handle purpose is the next part of this talk.
The other thing that is a relevant question for this application is
reduce can only deal with what I'm calling black and white concepts.
Everything else shows up in the residual formula. In HIPAA about
80 percent of all the atomic predicates is actually what I'm calling
black and white. Which means the algorithm can deal with that
completely automatically.
So that doesn't mean that 80 percent of the clause of HIPAA are black
and white, though. I'm just counting the number of atomic predicates.
Now, in HIPAA there are about 85 clauses of which about 17, which is
20 percent, is completely black and white.
So for those 20 percent of the clauses, this algorithm will be fully
automated. And for the others, there are things like purposes and
beliefs that we have to deal with.
So purpose remains important problem. And that gives us a sense of
purpose. Now, in terms of related work, there has been a fair amount
of work on specification languages and logic for privacy policies.
Some of them are either not expressive enough to have -- doesn't have
the richness to capture these kind of policies, or they are not
designed for enforcement. Like P 3 P is primarily a specification
language, not mechanism for enforcement. And there are prior work. So
this was work I did back at Stanford, is we used quantifiers for
specification, but enforcement was really using propositional logic.
So we couldn't enforce the whole of HIPAA using that prior work.
In terms of actual specification of laws, we had looked at a few
clauses in this earlier work and Gunther looked at one section, and
some work at Stanford from John Mitchell's group looked at a few more
sections, but our work is the first that does the whole of HIPAA.
The nearest technical work in terms of monitoring runtime monitoring is
this work from David Basin's group. Some of the, they have a much more
comprehensive implementation and evaluation. They have actually
applied it to real audit logs from Nokia, which we're also trying to
get now.
And they seem to work quite well. The two things that make that
approach unsuitable for this application is, one, they also have a kind
of mode checking which they don't call mode checking but they call it a
safe rain check, but that is a much less expressive way of restricting
the guards and in fact this restriction is too restrictive; it cannot
express the kind of clauses we see in HIPAA and JLB. The other thing
is like much of the work in runtime monitoring, they assume that audit
logs, indications are past complete. So the only source of
incompleteness is future incompleteness.
And that's also something we wanted to avoid.
about allowing other forms of incompleteness.
It would be more general
Now moving on to this piece on purpose restrictions. So this is work
that will appear next month at the Oakland symposium it's John Fork
with my student Michael Shans and Janet Wing at CMU.
So purpose restrictions I've motivated before show
privacy policies, and there are at least two kinds
restrictions. There are not for restrictions that
information is not going to be used for a purpose,
that we want to use information only for a certain
up in a lot of
of purpose
says that
and the only forward
purpose.
And potentially not for other things right? So the goal is to really
give semantics to not for and only for purpose restrictions but it's
parametric in the information and action meaning that you wanted to
apply for all purposes for all types of information and all types of
actions.
And then to provide an auditing algorithm once we have a sense of what
it means. So let me try to motivate, give you a sense of how we arrive
at the final notion by using this running example. Imagine that an
x-ray is taken in a medical office. It's added to the medical record,
and the medical record is then shipped to a diagnosis for diagnosis by
a specialist who is a specialist in reading x-rays. And the policy
says that medical records are only going to be used for diagnosis.
Right? So the question is, is this action or diagnosis or not, if it
was not for diagnosis and then this is a violation of the policy. So
how can we go about defining what it means for an action to be a
purpose? So here's the first attempt. The first attempt is we're
going to say an action is for a purpose if it is labeled as such. So
someone -- whoever is sending out the message is going to label the
action with a purpose. And the obvious problem is that it kind of begs
the question, because how do we know that that labeling is correct?
What is the basis for doing that labeling.
The other problem is that one action can have different purposes
depending on context. If I go back to this example, intuitively it
appears that these two actions are for diagnosis, but imagine another
send record action from up there, that cannot be for diagnosis, because
this was an x-ray specialist, and the x-ray wasn't even added to the
medical record. So although these two actions are the same
syntactically, they're both send records, one is for diagnosis and not
the other.
So these two appear to be for diagnosis and this is not for diagnosis.
So actions cannot direct -- just labeling actions with purposes is not
sufficient. The natural next step is to also try and -- whether this
action is for diagnosis or not depends on the state from which the
action was done. So maybe we should take the state into account, which
is very natural, right?
So our formalization of purpose must also include states. Now, if I
look at this example, it appears that these two actions are necessary
and sufficient for achieving the purpose of diagnosis. Whereas that
one up there is not sufficient, because no diagnosis is achieved.
>>: Necessity seems obvious.
>> Anupam Datta:
Efficiency is dual.
Okay.
>>: Suppose we go from correct state and the send, but send to somebody
who is not a specialist cannot possible do diagnosis.
>> Anupam Datta: Right, so then it's not going to
diagnosis. I'm going to say this attempt, attempt
for a purpose if it's necessary and efficient as a
actions for achieving the purpose. If you send it
not a diagnosis, and he will violate this.
be sufficient for
two, is an action is
part of a chain of
to someone who is
That action could not be for the purpose because it wasn't sufficient.
But now I have presented it only to say that this is not very great.
So now I'll argue that necessary is too strong. And sufficiency is
both too strong and too weak. In that specific sense. So let's first
look at necessity.
So here's a slight modification of this example. Now, you see neither
of these actions are necessary. Because instead of sending to the
first specialist, I could have sent to the second specialist and vice
versa. So necessity is too strong. Instead we'll use nonredundancy.
Nonredundancy is a necessity with respect to a fixed sequence of
actions. Given a sequence of actions that reaches a goal state, an
action in that sequence is nonredundant, if removing that action from
the sequence results in the goal state no longer being reached.
If I go back to this example, both of these actions are nonredundant,
because if I fix an execution like this one, if I remove this, then the
purpose is no longer achieved. So nonredundancy is the weakening of
necessity that we are going to try and work with. All right? So
attempt three is now an action is for a purpose if it's part of
sufficient and nonredundant chain of actions for achieving that
purpose. This actually coincides. Slight adaptation of the counter
fractural definition of causality that people have worked on the
causality philosophers and also computer sciences like Judy Aperth have
similar notions of causality.
>>: What if you send it to a specialist and the specialist says I can't
make a diagnosis?
>> Anupam Datta: Good. Next slide. So I have now said why -- I've
now argued why necessity's too strong, and we have to replace it by
nonredundancy.
The next part I want to argue that sufficiency is too strong partly
because of what you're seeing. Probabilistic outcomes. It's too
strong because of probabilistic outcome. But also it's too weak in a
certain sense because if we look at this picture, then you might have a
more accurate diagnosis if you used one specialist versus another.
And in that case we would like the record to have gone to the right
specialist rather than to the one who always does better. If we know
that that person always does better. Right? But you can -- instead of
saying always does better you can have an expectation over it, things
like that, right? The other thing is exactly what you're saying. You
may not always get a diagnosis, even if you did an action with the
intention of getting a diagnosis, because of probabilistic outcomes.
So that's why -- that's the difference between the adapted counter
fractural definition of causality and what we're going to end up with.
So there are at least two things that motivate our thesis which will be
based on planning. Because of probabilistic failures we cannot require
that a sequence of action actually furthers a purpose. We can only
require that they form plans to perform actions to further the purpose.
And quantitative purpose saying that the agent adopts the plan that
optimizes the expected satisfaction of the purpose. So that leads us
to the thesis that an action is for a purpose if the agent planned to
perform the action while considering that purpose.
So if the actions are in other words an action is for a purpose, if it
is part of a plan or part of an optimal plan for achieving that
purpose. Now, having reduced the question of what it means for an
action to be purposed to this more concrete thing about planning, now
we can hope for algorithms for checking.
>>: For the speaker and not for the purpose that seemed reasonable.
But if we say only for a purpose. Now, it's very common, suppose I'm
doing x-rays, to send a specialist with teamwork, our terms also.
There's another purpose to increase our payment -- to send you to our
specialist and outside.
>> Anupam Datta: Right. So good. So what you're getting at is that
this algorithm is not going to be both sound and complete. It's going
to be -- let me try to think about this. It's going to be useful if it
says not for -- an action is going to -- this is saying an action is
for a purpose if it is part of a plan of achieving the purpose. And if
I show that the action could not have been part of a plan for achieving
the purpose, all possible plans that you can think of given the
environment model, then I know for sure that there was a violation.
But I don't know that the policy was respected if I cannot argue that.
So there will be a space of tenable deniability, and for a not for a
purpose it will be the other way around.
>>: It seems we all come from mathematical logic. And I think the
logic appropriate here. More appropriate than the logical cords, the
logic of not making something true or false but achieving beyond a
reasonable doubt, lowering the --
>> Anupam Datta: That's a very interesting point. So this is as far
as we have gotten. I'm not -- I think we are presenting this as where
we are after two years of thought.
This is by no means the final word on this topic. So I'll be happy to
hear about your thoughts in greater detail when we talk.
And we haven't even gotten to a logic. We just have a model at the
moment. We are wondering what the right logic for it will be.
Initially, we were stuck for a long time on the counter factual, the
nonredundant but sufficient definition, which seemed very close to
causality so all that work been done in that literature, but since we
got to planning it seems a little bit different from what exists out
there.
So the remaining few minutes, don't want to exceed the time, I want to
give you a sense of what we did in terms of trying to get a sense of
whether this hypothesis, this is our thesis, but is it true or not,
what does that formally mean, what does it mean for an action to be for
a purpose, formally, and how do we audit for it. So for that first
one, one obvious way is try little examples which is what we did for
two years before we got to this point and. The sequence I showed you
is a condensed summary of the kind of thought process that went into
it.
The other thing we tried to do was we gave a survey using this
Mechanical Turk to about 200 people, and we compared the predictions of
two hypotheses to provide responses to the provided responses. The
planning hypothesis is ours, says an action is for a purpose if and
only if that action is part of the plan of furthering the purpose.
And the other hypothesis an action is for a purpose if and only if that
action furthers the purpose. This was kind of the reigning champion
before this work.
And the survey kind of overwhelmingly shows that at least those 200
people cannot generalize too much from a sample of 200. But at least
those 200 people when there was disagreement between -- we had four
scenarios. One in which we had satisfied both these hypotheses, and
they all said yes a scenario in which both of the hypotheses were false
and they said no. 1 in which this was true and this was false, and many
more people said yes.
And very few people said no. And when this was true and this was
false, most people said no and a few people said yes. In the details
of the survey, I don't have time to get into, but I have some back-up
slides at the end to show you the kind of questions that we asked them.
Now, coming to auditing, the picture now is in addition to the privacy
policy and the audit log, there is an environment model for planning,
which is essentially the state machine kind of thing that I was showing
you, which formally is a Markov decision process, which is a standard
model that is used in the planning community. Given a privacy policy a
Markov decision process that keeps track of the environment model and
an audit log which records the actual execution, what actually
happened, this algorithm comes back with obeyed inconclusive or
violated. And the basic idea for only for policies is that you want to
find all strategies that are consistent with the observed agent
behaviors under model for purpose B. And if none of these strategies
optimize the purpose P under the model, then the agent violated only 4
P.
And there is -- this can be mapped to the Markov decision process and
the details I don't want to get into. But I'll show you an example in
the next slide. The soundness theorem here is saying that the
algorithm returns true if and only if the actions recorded in the log
are not for the purpose. So if the algorithm returns true, you know
that not for purpose policy is satisfied.
violated.
And only for policy is
But the other side you don't know. There is tenable deniability on the
other side. Coming to this example, let me illustrate how the
algorithm works. This is the Markov decision process. I've shown you
states and actions and there's the reward function. The reward
function is essentially when you go from this state to this state you
get the difference in between these diagnosis scores. So you don't get
anything coming here. If you go from here to here you get six, and
here you get two. So the optimal path through this model is this path.
If any other path happened, then that would be a violation. So if you
went on this path, for example, then this action was not for diagnosis,
because it could not have been part of a plan, an optimal plan for
achieving this purpose. Now, this can be algorithmically done using
adaptations of plan recognition algorithms. So this is where we are.
Now, in terms of fast approaches, while fast approaches were useful,
they were not semantically justified. So they included things like
labeling actions or labeling sequences of actions or agent roles or
code, but there was no justification for how the labeling should be
done. So if the labeling was done incorrectly, then there wasn't any
way of catching it.
So our work provides a semantic foundation for these approaches. It
provides a basis for correct labeling and the paper discusses in great
details what the limitations.
Now, in order to make this, I'm not claiming that this work is usable
as it currently exists. There's a bunch of reasons for that. Well, I
guess one extension that we have worked out is enforcing purpose
restrictions on information use. So what I was talking about so far
was whether an action is for a purpose or not.
But whether information is used for a purpose or not is a little bit
more subtle, because there could be implicit flows. So suppose you
want to say the gender will not be used for targeted advertising, then
we have to show that the planning process is unaltered whether you keep
the gender as male or female. So it's similar to notions of
noninterference and information flow. It's a combination of planning
and information flow if you will.
Now, the other thing that gets in the way of making this approach,
using this approach directly in practice is we are assuming these
environment models, these Markov decision processes are an input to
this algorithm but constructing them for complicated organization
problems is going to be a lot of work that we ideally don't want to do
manually.
So one of the things we would like to explore once we get the
reallotted logs is to see if it can learn these things from the logs.
It's not out of the question, because there's a lot of work on cued
learning and reinforcement learning that tries to learn Markov purposes
giving them observed data.
Another thing here, if you think about this model of planning, the
reason we chose MDPs is because it's kind of the most developed
formalism for planning in the EI community.
But it's a good model of planning for things like robots where you can
program them to act in certain ways. People not so much. Right? It's
assuming that people are completely rational and they can do all these
complicated calculations.
So if people -- if the EI community comes up with better models of
human planning then we can instantiate this framework with other
models.
However, I should say that although this is not such a great model for
human planning, it is something that an organization could use to
analyze this environment model and prescribe more operational policies
for humans where the calculation is done by the organization, not by
humans.
And finally we really want to push on getting, applying it to practice.
So as we get the audit logs from the sharp's project from Northwestern
memorial and Johns Hopkins and Vanderbilt, that's the next thing on our
agenda to try this. So I'll leave you with the best picture of how
audit works and it's separated into auditing for these black and white
concepts and gray concept like purposes and the final kind of big
picture of using audit and accountability for enforcement. Thank you
very much. [applause].
>>: Can HIPAA -- is your impression how well is it supposed to consult
computer scientists.
>> Anupam Datta: Right. So HIPAA was very operational. That was one
very good thing about HIPAA. That's part of the reason why the first
algorithm that I presented was quite effective for big chunks of the -especially for disclosure. Conditions under which disclosure can
happen is very operational.
Now, the other thing that the specific question that you had asked
earlier about one thing having higher priority over another. That was
also spelled out quite explicitly. The structure of the clauses -maybe I can use the board -- was often of the form that this. So there
were a bunch of what we were calling positive, positive clauses. And
there was a conjunction over these positive clauses. So all of these
conditions are negative, I don't know that they're calling them
negative clauses all of these possibilities have to hold, all of the
clauses have to hold.
And then there were a bunch of positive clauses that said if any one of
these held, it was okay to share information. So one way to think
about this is these are saying under these conditions it's okay to
share information and these are saying it's a generalization of the
denying clauses, that unless these things are satisfied, you cannot
share information.
Now, the interesting thing about the way the law was structured is that
the positive clauses had conjunctive exceptions. And the negative
clauses had disjunctive exceptions. Right? Meaning that this
condition has to be always satisfied, except if this other thing
happens, which means that this can be written as a negative core or an
exception.
And these had conjunctive exceptions, meaning that if this holds, you
can release information to this third party, except if this other thing
also holds.
So these had like a core with the conjunctive. And so the preemption
mechanism, the priority mechanism, was encoded when we tried to
formalize it, this is what it looked like in the logic. You can
imagine multiple nestings of these kinds of exceptions but in HIPAA
there was only one level T.
>>: [inaudible] this language is used in [inaudible] when you write
these things.
>> Anupam Datta: Right. So if you have I guess I'll talk with you in
greater detail. If you have suggestions about you're using first order
logic, right, but you can imagine using better logic that are more
suited for this application. That will because of these exceptions and
their encoding in this manner, the formulas get a lot bigger, right?
I know [inaudible] tried to do something with [inaudible] logics but
for a very different application. So I don't know what other logics
might be suitable here. But that would be an interesting conversation
to have.
>>: In general, the English is fairly unambiguous?
>> Anupam Datta: Yeah, I think that most of it -- it required a clever
CMU grad student the formalization took over a year to do. But Henry
was also doing other things with Frank Fenning. But there were only a
couple of places where we had to go talk to a lawyer. And the lawyers
came back with some answers that helped us a little bit. But I don't
want to claim that this is authoritative encoding of the law. I think
we have to make these things open source and up for discussion and
debate. We do need input from lawyers, but for the most part the high
level bit here is HIPAA is largely operational. And that's actually
the way the law should be written.
The places where it's
purposes and beliefs.
level of abstraction,
you say that personal
of payment, treatment
operational, then you
more abstract are things that have to do with
But at some level that seems like the right
because you know how are you going to -- unless
health information can be shared for the purpose
and operations, if you try to make it more
have to list all the conditions.
And that seems like it will vary from organization to organization.
Right? So I wouldn't know how to make it more operational. On the
other hand, what will impede one other thing that will impede these
kinds of techniques to be applied if there are lots and lots of laws
and policies is you will need a Henry Dearm for every one of those laws
and policies.
If it could be written in a more structured, if we can figure out an
input language that is more structured, while being more usable, so
that it's not first order logic, or some other complicated logic. That
could then serve as a way for an automatic compilation. That would
be -- I don't know that it will ever happen with laws. Because it's
interesting. I gave this talk and Michelle Denady, who is now the
chief privacy officer of Macafee, but was at Sun and Oracle before
that, she went to -- she got very excited. She went to Washington and
tried to convince the senators to write in some structure -- their laws
in some structured language. And she didn't get very far,
unsurprisingly. But maybe that's something that companies can do,
right? A company like Microsoft, if they have a usable way of writing
policies, that could then be used to compile down to something that an
enforcement engine could operate on.
That seems much more doable. Because what gets ultimately enforcement
organization is not the law because the law tends to be more abstract.
But internally the policies tend to be more operational. So yesterday
when I was talking with chief, with the corporate privacy people here,
that's one thing that we discussed we might want to do. Look at the
internal privacy policy and see if they're working on making it more
operational apparently right now. See if we can work with that.
>>: Can you give us an example of one of the places where you may have
to talk to a lawyer?
>> Anupam Datta: Oh, I will have to look up my e-mail. It's been -the formalization was done about three years ago. So I don't remember.
But I'll dig it up and let you know.
anymore, blissfully forgotten.
>> Bryan Parno:
>> Anupam Datta:
[applause]
Let's thank Anupam.
Thanks.
Yeah.
I have no memory of it
Download