Lecture16 - The University of Texas at Dallas

advertisement
Trustworthy Semantic Web
Confidentiality, Privacy
and Trust
Prof. Bhavani Thuraisingham
The University of Texas at Dallas
February 2011
Outline of the Unit
 What are logic and inference rules
 Why do we need rules?
 Example rules
 Logic programs
 Monotonic and Nonmonotoic rules
 Rule Markup
 Example Rule Markup in XML
 Policy Specification
 Relationship to the Inference and Privacy problems
 Summary and Directions
 Confidentiality, Privacy and Trust
Logic and Inference
 First order predicate logic
 High level language to express knowledge
 Well understood semantics
 Logical consequence - inference
 Proof systems exist
 Sound and complete
 OWL is based on a subset of logic – descriptive logic
Why Rules?
 RDF is built on XML and OWL is built on RDF
 We can express subclass relationships in RDF; additional
relationships can be expressed in OWL
 However reasoning power is still limited in OWL
 Therefore the need for rules and subsequently a markup language
for rules so that machines can understand
Example Rules
 Studies(X,Y), Lives(X,Z), Loc(Y,U), Loc(Z,U)  HomeStudent(X)
 i.e. if John Studies at UTDallas and John is lives on Campbell Road
and the location of Campbell Road and UTDallas are Richardson
then John is a Home student
 Note that
Person (X)  Man(X) or Woman(X) is not a rule in predicate logic
That is if X is a person then X is either a man of a woman. This can be
expressed in OWL
However we can have a rule of the form
Person(X) and Not Man(X)  Woman(X)
Monotonic Rules
  Mother(X,Y)
 Mother(X,Y)  Parent(X,Y)
If Mary is the mother of John, then Mary is the parent of John
Syntax: Facts and Rules
Rule is of the form:
B1, B2, ---- Bn  A
That is, if B1, B2, ---Bn hold then A holds
Logic Programming
 Deductive logic programming is in general based on deduction
- i.e., Deduce data from existing data and rules
- e.g., Father of a father is a grandfather, John is the father of
Peter and Peter is the father of James and therefore John is the
grandfather of James
 Inductive logic programming deduces rules from the data
- e.g., John is the father of Peter, Peter is the father of James,
John is the grandfather of James, James is the father of Robert,
Peter is the grandfather of Robert
- From the above data, deduce that the father of a father is a
grandfather
 Popular in Europe and Japan
Nonmonotonic Rules
 If we have X and NOT X, we do not treat them as inconsistent as in
the case of monotonic reasoning.
 For example, consider the example of an apartment that is
acceptable to John. That is, in general John is prepared to rent an
apartment unless the apartment ahs less than two bedrooms, is
does not allow pets etc. This can be expressed as follows:
  Acceptable(X)
 Bedroom(X,Y), Y<2  NOT Acceptable(X)
 NOT Pets(X)  NOT Acceptable(X)
 Note that there could be a contradiction. But with nonmotonic
reasoning this is allowed.
Rule Markup
 The various components of logic are expressed in the Rule Markup
Language – RuleML
 Both monotonic and nonmonotnic rules can be represented
 Example representation of Fact P(a) - a is a parent
<fact>
<atom>
<predicate>p</predicate>
<term>
<const>a</const>
<term>
<atom>
</fact>
Policies in RuleML
<fact>
<atom>
<predicate>p</predicate>
<term>
<const>a</const>
<term>
<atom>
Level = L
</fact>
Example Policies
 Temporal Access Control
- After 1/1/05, only doctors have access to medical records
 Role-based Access Control
- Manager has access to salary information
- Project leader has access to project budgets, but he does not
have access to salary information
- What happens is the manager is also the project leader?
 Positive and Negative Authorizations
- John has write access to EMP
- John does not have read access to DEPT
- John does not have write access to Salary attribute in EMP
- How are conflicts resolved?
Privacy Policies
 Privacy constraints processing
- Simple Constraint: an attribute of a document is private
- Content-based constraint: If document contains information
about X, then it is private
- Association-based Constraint: Two or more documents taken
together is private; individually each document is public
- Release constraint: After X is released Y becomes private
 Augment a database system with a privacy controller for constraint
processing
System Architecture for Access Control
User
Pull/Query
Push/result
RuleMLAccess
Policy
base
RuleMFAdmin
Credential
base
RuleML Data
Documents
Admin
Tools
RuleML Data Management
 Data is presented as RuleML documents
 Query language – Logic programming based?
 Policies in RuleML
 Reasoning engine
- Use the one developed for RuleML
Inference/Privacy Control
Technology
By UTD
Interface to the Semantic Web
Inference Engine/
Rules Processor
Policies
Ontologies
Rules
Rule-based Data
Management
Rules Data
Summary and Directions
 Rules have expressive and reasoning power
 Handles some of the inadequacies of OWL
 Both monotonic and nonromantic reasoning
 Logic programming based
 Policies specified in RulesML
 Need to build an integrated system
 Other rules: SWRL (semantic web rules language)
CPT: Confidentiality, Privacy and Trust
 Before I as a user of Organization A send data about me to
organization B, I read the privacy policies enforced by
organization B
- If I agree to the privacy policies of organization B, then I
will send data about me to organization B
- If I do not agree with the policies of organization B, then I
can negotiate with organization B
 Even if the web site states that it will not share private
information with others, do I trust the web site
 Note: while confidentiality is enforced by the organization,
privacy is determined by the user. Therefore for
confidentiality, the organization will determine whether a user
can have the data. If so, then the organization van further
determine whether the user can be trusted
What is Privacy
 Medical Community
- Privacy is about a patient determining what
patient/medical information the doctor should be released
about him/her
 Financial community
- A bank customer determine what financial information the
bank should release about him/her
 Government community
- FBI would collect information about US citizens. However
FBI determines what information about a US citizen it can
release to say the CIA
Some Privacy concerns
 Medical and Healthcare
- Employers, marketers, or others knowing of private medical
concerns
 Security
- Allowing access to individual’s travel and spending data
- Allowing access to web surfing behavior
 Marketing, Sales, and Finance
- Allowing access to individual’s purchases
Data Mining as a Threat to Privacy
 Data mining gives us “facts” that are not obvious to human analysts
of the data
 Can general trends across individuals be determined without
revealing information about individuals?
 Possible threats:
Combine collections of data and infer information that is private
 Disease information from prescription data
 Military Action from Pizza delivery to pentagon
 Need to protect the associations and correlations between the data
that are sensitive or private
-
Some Privacy Problems and Potential Solutions
 Problem: Privacy violations that result due to data mining
- Potential solution: Privacy-preserving data mining
 Problem: Privacy violations that result due to the Inference problem
- Inference is the process of deducing sensitive information from
the legitimate responses received to user queries
- Potential solution: Privacy Constraint Processing
 Problem: Privacy violations due to un-encrypted data
- Potential solution: Encryption at different levels
 Problem: Privacy violation due to poor system design
- Potential solution: Develop methodology for designing privacyenhanced systems
Privacy Constraint /Policy Processing
 Privacy constraints processing
- Based on prior research in security constraint processing
- Simple Constraint: an attribute of a document is private
- Content-based constraint: If document contains information
about X, then it is private
- Association-based Constraint: Two or more documents taken
together is private; individually each document is public
- Release constraint: After X is released Y becomes private
 Augment a database system with a privacy controller for constraint
processing
Inference/Privacy Control
Technology
By UTD
Interface to the Semantic Web
Inference Engine/
Rules Processor
(Reasoning in OWL?)
Privacy Policies
Ontologies
Rules
OWL/RDF Data
Management
OWL/RDF
Documents
Web Pages,
Databases
Semantic Model for Privacy Control
Dark lines/boxes contain
private information
Cancer
Influenza
Has disease
John’s
address
Patient John
address
England
Travels frequently
Privacy Preserving Data Mining
 Prevent useful results from mining
- Introduce “cover stories” to give “false” results
- Only make a sample of data available so that an adversary is
unable to come up with useful rules and predictive functions
 Randomization
- Introduce random values into the data and/or results
- Challenge is to introduce random values without significantly
affecting the data mining results
- Give range of values for results instead of exact values
 Secure Multi-party Computation
- Each party knows its own inputs; encryption techniques used to
compute final results
Platform for Privacy Preferences (P3P):
What is it?
 P3P is an emerging industry standard that enables web sites
to express their privacy practices in a standard format
 The format of the policies can be automatically retrieved and
understood by user agents
 It is a product of W3C; World wide web consortium
www.w3c.org
 When a user enters a web site, the privacy policies of the web
site is conveyed to the user; If the privacy policies are
different from user preferences, the user is notified; User can
then decide how to proceed
 Several major corporations are working on P3P standards
including
Platform for Privacy Preferences (P3P):
Organizations
 Several major corporations are working on P3P
standards including:
Microsoft
IBM
HP
NEC
Nokia
NCR
 Web sites have also implemented P3P
 Semantic web group has adopted P3P
-
Platform for Privacy Preferences (P3P):
Specifications
 Initial version of P3P used RDF to specify policies; Recent version
has migrated to XML
 P3P Policies use XML with namespaces for encoding policies
 P3P has its own statements and data types expressed in XML; P3P
schemas utilize XML schemas
 P3P specification released in January 20005 uses catalog shopping
example to explain concepts; P3P is an International standard and is
an ongoing project
 Example: Catalog shopping
-
Your name will not be given to a third party but your purchases will be
given to a third party
-
<POLICIES xmlns = http://www.w3.org/2002/01/P3Pv1>
<POLICY name = - - - </POLICY>
</POLICIES>
P3P and Legal Issues
 P3P does not replace laws
 P3P work together with the law
 What happens if the web sites do no honor their P3P policies
- Then
appropriate legal actions will have to be taken
 XML is the technology to specify P3P policies
 Policy experts will have to specify the policies
 Technologies will have to develop the specifications
 Legal experts will have to take actions if the policies are
violated
Privacy for Assured Information Sharing
Data/Policy for Federation
Export
Data/Policy
Export
Data/Policy
Export
Data/Policy
Component
Data/Policy for
Agency A
Component
Data/Policy for
Agency C
Component
Data/Policy for
Agency B
Key Points
 1. There is no universal definition for privacy, each
organization must definite what it means by privacy and
develop appropriate privacy policies
 2. Technology alone is not sufficient for privacy We need
technologists, Policy expert, Legal experts and Social
scientists to work on Privacy
 3. Some well known people have said ‘Forget about privacy”
Therefore, should we pursue research on Privacy?
- Interesting research problems, there need to continue
with research
- Something is better than nothing
- Try to prevent privacy violations and if violations occur
then prosecute
 4. We need to tackle privacy from all directions
Application Specific Privacy?
 Examining privacy may make sense for healthcare and
financial applications
 Does privacy work for Defense and Intelligence applications?
 Is it meaningful to have privacy for surveillance and
geospatial applications
- Once the image of my house is on Google Earth, then how
much privacy can I have?
I may want my location to be private, but does it make
sense if a camera can capture a picture of me?
- If there are sensors all over the place, is it meaningful to
have privacy preserving surveillance?
 This suggestion that we need application specific privacy
 It is not meaningful to examine PPDM for every data mining
algorithm and for every application
-
Data Mining and Privacy: Friends or Foes?
 They are neither friends nor foes
 Need advances in both data mining and privacy
 Need to design flexible systems
- For some applications one may have to focus entirely on
“pure” data mining while for some others there may be a
need for “privacy-preserving” data mining
- Need flexible data mining techniques that can adapt to the
changing environments
 Technologists, legal specialists, social scientists, policy
makers and privacy advocates MUST work together
Popular Social Networks

Face book - A social networking website. Initially the membership was restricted to
students of Harvard University. It was originally based on what first-year students were
given called the “face book” which was a way to get to know other students on campus.
As of July 2007, there over 34 million active members worldwide. From September 2006 to
September 2007 it increased its ranking from 60 to 6th most visited web site, and was the
number one site for photos in the United States.

Twitter- A free social networking and micro-blogging service that allows users to send
“updates” (text-based posts, up to 140 characters long) via SMS, instant messaging,
email, to the Twitter website, or an application/ widget within a space of your choice, like
MySpace, Facebook, a blog, an RSS Aggregator/reader.

My Space - A popular social networking website offering an interactive, user-submitted
network of friends, personal profiles, blogs, groups, photos, music and videos
internationally. According to AlexaInternet, MySpace is currently the world’s sixth most
popular English-language website and the sixth most popular website in any language,
and the third most popular website in the United States, though it has topped the chart on
various weeks. As of September 7, 2007, there are over 200 million accounts.
Social Networks: More formal definition

A structural approach to understanding
social interaction.

Networks consist of Actors and the Ties
between them.

We represent social networks
as graphs whose vertices are
the actors and whose edges
are the ties.

Edges are usually weighted to
show the strength of the tie.

In the simplest networks, an Actor is an
individual person.

A tie might be “is acquainted with”. Or it
might represent the amount of email
exchanged between persons A and B.
Social Network Examples

Effects of urbanization on individual wellbeing

World political and economic system

Community elite decision-making

Social support, Group problem solving

Diffusion and adoption of innovations

Belief systems, Social influence

Markets, Sociology of science

Exchange and power

Email, Instant messaging, Newsgroups

Co-authorship, Citation, Co-citation

SocNet software, Friendster

Blogs and diaries, Blog quotes and links
History

“Sociograms” were invented in 1933 by Moreno.

In a sociogram, the actors are represented as points in a two-dimensional
space. The location of each actor is significant. E.g. a “central actor” is plotted
in the center, and others are placed in concentric rings according to “distance”
from this actor.

Actors are joined with lines representing ties, as in a social network. In other
words a social network is a graph, and a sociogram is a particular 2D
embedding of it.

These days, sociograms are rarely used (most examples on the web are not
sociograms at all, but networks). But methods like MDS (Multi-Dimensional
Scaling) can be used to lay out Actors, given a vector of attributes about them.

Social Networks were studied early by researchers in graph theory (Harary et
al. 1950s). Some social network properties can be computed directly from the
graph.

Others depend on an adjacency matrix representation (Actors index rows and
columns of a matrix, matrix elements represent the tie strength between them).
Social Network Analysis of 9/11 Terrorists
(www.orgnet.com)
Early in 2000, the CIA was informed of two terrorist suspects linked to al-Qaeda.
Nawaf Alhazmi and Khalid Almihdhar were photographed attending a meeting of
known terrorists in Malaysia. After the meeting they returned to Los Angeles,
where they had
already set up residence in late 1999.
Social Network Analysis of 9/11 Terrorists
What do you do with these suspects? Arrest or deport them
immediately? No, we need to use them to discover more of the alQaeda network.
Once suspects have been discovered, we can use their daily activities
to uncloak their network. Just like they used our technology against
us, we can use their planning process against them. Watch them, and
listen to their conversations to see...
•who they call / email
•who visits with them locally and in other cities
•where their money comes from
The structure of their extended network begins to emerge as data is
discovered via surveillance.
Social Network Analysis of 9/11 Terrorists
A suspect being monitored may have many contacts -- both accidental and intentional. We
must always be wary of 'guilt by association'. Accidental contacts, like the mail delivery
person, the grocery store clerk, and neighbor may not be viewed with investigative interest.
Intentional contacts are like the late afternoon visitor, whose car license plate is traced back to
a rental company at the airport, where we discover he arrived from Toronto (got to notify the
Canadians) and his name matches a cell phone number (with a Buffalo, NY area code) that our
suspect calls regularly. This intentional contact is added to our map and we start tracking his
interactions -- where do they lead? As data comes in, a picture of the terrorist organization
slowly comes into focus.
How do investigators know whether they are on to something big? Often they don't. Yet in this
case there was another strong clue that Alhazmi and Almihdhar were up to no good -- the
attack on the USS Cole in October of 2000. One of the chief suspects in the Cole bombing
[Khallad] was also present [along with Alhazmi and Almihdhar] at the terrorist meeting in
Malaysia in January 2000.
Once we have their direct links, the next step is to find their indirect ties -- the 'connections of
their connections'. Discovering the nodes and links within two steps of the suspects usually
starts to reveal much about their network. Key individuals in the local network begin to stand
out. In viewing the network map in Figure 2, most of us will focus on Mohammed Atta because
we now know his history. The investigator uncloaking this network would not be aware of
Atta's eventual importance. At this point he is just another node to be investigated.
Figure 2 shows the two suspects and
Social Network Analysis of 9/11 Terrorists
Social Network Analysis of 9/11 Terrorists
Social Network Analysis of 9/11 Terrorists
We now have enough data for two key conclusions:
•
All 19 hijackers were within 2 steps of the two original suspects uncovered in 2000!
•
Social network metrics reveal Mohammed Atta emerging as the local leader
With hindsight, we have now mapped enough of the 9-11 conspiracy to stop it. Again, the
investigators are never sure they have uncovered enough information while they are in
the process of uncloaking the covert organization. They also have to contend with
superfluous data. This data was gathered after the event, so the investigators knew
exactly what to look for. Before an event it is not so easy.
As the network structure emerges, a key dynamic that needs to be closely monitored is the
activity within the network. Network activity spikes when a planned event approaches. Is
there an increase of flow across known links? Are new links rapidly emerging between
known nodes? Are money flows suddenly going in the opposite direction? When activity
reaches a certain pattern and threshold, it is time to stop monitoring the network, and
time to start removing nodes.
The author argues that this bottom-up approach of uncloaking a network is more effective
than a top down search for the terrorist needle in the public haystack -- and it is less
invasive of the general population, resulting in far fewer "false positives".
Figure 2 shows the two suspects and
Social Network Analysis of Steroid Usage in Baseball
(www.orgnet.com)
When the Mitchell Report on steroid use in Major League Baseball [MLB], was published, people were
surprised at who and how many players were mentioned. The diagram below shows a human network created
from data found in the Mitchell Report. Baseball players are shown as green nodes. Those who were found to
be providers of steroids and other illegal performance enhancing substances appear as red nodes. The links
reveal the flow of chemicals -- from provider to player.
Figure 2 shows the two suspects and
Knowledge Sharing in Organizations: Finding Experts
Figure 2 shows the two suspects and
Knowledge Sharing Network: Finding Experts
(www.orgnet.com)
Organizational leaders are preparing for the potential loss of expertise and knowledge flow
due to turnover, downsizing, outsourcing, and the coming retirements of the baby boom
generation. The model network (previous chart) is used to illustrate the knowledge continuity
analysis process.
Each node in this sample network (previous chart) represents a person that works in a
knowledge domain. Some people have more / different knowledge than others. Employees
who will retire in 2 years or less have their nodes colored red. Those who will retire in 3-4
years are colored yellow. Those retiring in 5 years or later are colored green.
A gray, directed line is drawn from the seeker of knowledge to the source of expertise. A-->B
indicates that A seeks expertise / advice from B. Those with many arrows pointing to them are
sought often for assistance.
The top subject matter experts -- SMEs -- in this group are nodes 29, 46, 100, 41, 36 and 55.
The SMEs were discovered using a network metric in InFlow that is similar to how the
Google search engine ranks web pages -- using both direct and indirect links.
Of the top six SMEs in this group, half are colored red[100] or yellow[46, 55]. The loss of
person 46 has the greatest potential for knowledge loss. 90% of the network is within
3 steps of accessing this key knowledge source.
Social Networks: Security and Privacy Issues: European
Network and Information Security Agency

The European Network and Information Security Agency (ENISA) has released its first
issue paper “Security Issues and Recomendations for Online Social Networks".


http://www.enisa.europa.eu/doc/pdf/deliverables/enisa_pp_social_networks.pdf
Four groups of threats: privacy related threats, variants of traditional network and
information security threats, identity related threats, social threats.

Recommendations are given for governments (oversight and adaption of existing data
protection legislation), companies that run such networks, technology developers, and
research and standardisation bodies.

Some concenrs: recommnendation to use automated filters against "offensive, litigious or
illegal content". This brings potential freedom of speech issues. European Digital Rights
has started a campaign against a similar recommendation by the Council of Europe.
Issue of portability of profiles social graphs are also addressed. However what is missing
is that “Information about social links is not about only one user, but also the others
which he is linked to. They have to agree if this information is moved to different
platforms”.
Social Networks: Security and Privacy Issues: Microsoft
Recommendations http://www.microsoft.com/protect/yourself/personal/communities.mspx

Online communities require you to provide personal information. Profiles are public.
Comments you post are permanently recorded on the community site.You might even
mention when you plan to be out of town.

E-mail and phishing scammers count on the appealing sense of trust that is often
fostered in online communities to steal your personal information. The more you reveal in
profiles and posts, the more vulnerable you are to scams, spam, and identity theft.

Here are some features to look for when you're considering joining an online community:
-
•Privacy policies that explain exactly what information the service will collect and
how it might be used.• User guidelines that outline a basic code of conduct for users
on their sites. Sites have the option to penalize reported violators with account
suspension or termination.•Special provisions for children and their parents, such as
family-friendly options geared towards protecting children under a certain
age.•Password protection to help keep your account secure..•E-mail address hiding,
which lets you display only part of your e-mail address on the site's membership
lists. Filtering options: Offered on blogging sites, these tools let you to choose which
subscribers can see what you've written.
Role of Semantic Web
 FOAF (Friend of a Friend)
 Social Graph represented in RDF
 Use the reasoning tools and analyze the social network for
suspicious events
 Protect the privacy of individuals
FOAF: http://www.foaf-project.org/about
http://en.wikipedia.org/wiki/FOAF_(software)
 FOAF (an acronym of Friend of a Friend) is a machine-
readable ontology describing persons, their activities and
their relations to other people and objects. Anyone can use
FOAF to describe him or herself. FOAF allows groups of
people to describe social networks without the need for a
centralised database.
 FOAF's descriptive vocabulary is expressed using RDF
Resource Description Framework and OWL Web Ontology
Language.
 Computers may use these FOAF profiles to find, for example,
all people living in Europe, or to list all people both you and a
friend of you know. This is accomplished by defining
relationships between people. Each profile has a unique
identifier (such as the person's e-mail addresses, a URI of the
homepage or weblog of the person), which is used when
defining these relationships.
FOAF: http://www.foaf-project.org/about
http://en.wikipedia.org/wiki/FOAF_(software)
 The FOAF project, which defines and extends the vocabulary
of a FOAF profile, was started in 2000 by and . It can be
considered the first Social Semantic Web application, in that it
combines RDF technology with 'Social Web' concerns.
 Tim Berners-Lee in a recent essay redefined the Semantic
web concept into something he calls the Giant Global Graph,
where relationships transcend networks/documents. He
considers the GGG to be on equal grounds with Internet and
World Wide Web, stating that "I express my network in a
FOAF file, and that is a start of the revolution."
FOAF: http://www.foaf-project.org/about
http://en.wikipedia.org/wiki/FOAF_(software)
 The following FOAF profile (written in XML format) states that Jimmy Wales
is the name of the person described here. His e-mail address, homepage and
depiction are resources, which means that each of them can be described
using RDF as well. He has Wikipedia as an interest, and knows Angela
Beesley (which is the name of a 'Person' resource).
 <rdf:RDF xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"> <foaf:Person
rdf:about="#JW"> <foaf:name>Jimmy Wales</foaf:name> <foaf:mbox
rdf:resource="mailto:jwales@bomis.com" /> <foaf:homepage
rdf:resource="http://www.jimmywales.com/" /> <foaf:nick>Jimbo</foaf:nick>
<foaf:depiction
rdf:resource="http://www.jimmywales.com/aus_img_small.jpg" />
<foaf:interest rdf:resource="http://www.wikimedia.org"
rdfs:label="Wikipedia" /> <foaf:knows> <foaf:Person> <foaf:name>Angela
Beesley</foaf:name> <!-- Wikimedia Board of Trustees --> </foaf:Person>
</foaf:knows> </foaf:Person> </rdf:RDF>
Download