An Ontological Approach to the Document Access Problem of Insider Threat Boanerges Aleman-Meza

advertisement
An Ontological Approach to the
Document Access Problem of
Insider Threat
Boanerges Aleman-Meza1
Phillip Burns2
Matthew Eavenson1
Devanand Palaniswami1
Amit P. Sheth1
(1) LSDIS Lab, Computer Science Dept.,
University of Georgia, USA
(2) CTA – Computer Technology Associates
USA
ISI 2005, (May 20)
Objective & Approach
 Determine if (classified) documents reviewed
an IC analyst satisfy his/her “need to know”
Characterization of “need to know” w.r.t. ontology
Characterizing document content in terms of
ontology
Discovering weighted semantic relationships
between document content and “need to know”
6/21/2004
2
Characterizing “Need to Know”
using an a Semantic Approach (using Ontology)
 Requires domain ontology
models important concepts & relationships of
domain (schema), captures factual knowledge
(instances)
 Relate analyst’s need to know to concepts &
relationships in ontology
e.g. terrorist organization, funding sources,
facilitators, members, methods
6/21/2004
3
“Need to know” = context of investigation
26,489 entities
34,513 (explicit) relationships
Add relationship
to context
6/21/2004
4
Characterizing document content in terms
of ontology “Semantic Annotation”
 Correlate words/phrases from document with
entities/relationships in ontology
 Entity identification
 Meta-data added to document (from
associated ontological knowledge)
 Active area of research but practically useful
technology now available
 Constrained to content of ontology
6/21/2004
5
6/21/2004
6
Semantic Relationships between
Document & “Need to Know”
 Semantic associations: relationships between
document concepts & “need to know”
concepts are discovered and ranked
 Ranking based on multiple factors
no. of links, types of links, location in ontology, …
 Ranking indicates degree of semantic
“closeness”
and therefore, how related document is to “need
to know”
6/21/2004
7
Documents
Ranking
 Highly relevant
 Closely related
 Ambiguous
 Not relevant
 Undeterminable
6/21/2004
8
Research Content
 Discovery & Ranking of semantic semantic
associations
 Characterizing “need to know” in terms of
ontological concepts & relationships
 Meta-data annotation of data and (semistructured & unstructured) documents
correlation of document content & concepts in
ontology
6/21/2004
9
Research Challenges
In this project we are addressing:
 Discovery of Semantic Associations per entity
per document
 Input/Visualization/Management of Context of
Investigation
 Scalability on number of documents & ontology
size
Performs well with thousand documents
 Ranking of documents
6/21/2004
10
Ranking of Documents Relevance
“Closely related entities are more relevant than distant entities”
E = {e | e  Document }
Ek = {f | distance(f, eE) = k }
 classes_Re levance( Ek ) 





k n
Document Relevance    relations_ Relevance( Ek ) 


k 0





entities_R
elevance
(
E
)
k


6/21/2004
11
Components of Document Relevance
2.
re
e8:Event
e7:Terror
Organization
n
i ze
ci t o f
n
ize
cit of
e1:Person
Relationship  [Class]
e4:WatchList
e2:Country
e3:Person
frien
ds
with
e5:Person
ds
frien
with
Context of Investigation
lis
te
in d
s
or
im y f
cla ibilit
s
on
sp
Relationships
constrains
e9:Person
wo
rk
at s
lives
in
e6:Company
e6:State
re
e8:Event
e1:Person
e4:WatchList
n
ize
cit of
e1:Person
e3:Person
e5:Person
frien
ds
with
e9:Person
wo
rk
at s
lives
in
e6:Company
3.
Entities match a list of entities
of interest (in the Context)
entity  Entities-List
e6:State
6/21/2004
Abu Abdallah
Turkmenistan
Konduz Province
…
12
ds
frien
with
e3:Person
e5:Person
frien
ds
with
e9:Person
wo
rk
at s
lives
in
e7:Terror
Organization
e2:Country
ds
frien
with
(specific entities)
•
•
•
•
n
i ze
ci t o f
re
e8:Event
e4:WatchList
e2:Country
n
ize
cit of
type(entity)  Context
s
or
im y f
cla ibilit
s
on
sp
e7:Terror
Organization
lis
te
in d
s
or
im y f
cla ibilit
s
on
sp
n
i ze
ci t o f
Entities belong
to classes in the
Context
lis
te
in d
1.
e6:State
e6:Company
Schematic of Ontological Approach to the Legitimate Access Problem
Semagix Freedom
Semagix Freedom
6/21/2004
13
Conclusions
 New Semantic Approach to the challenging
problem
 Viability demonstrated on a small scale
 Significant new research that builds upon the
latest Semantic Platform
 Many applications of this approach: vendor
vetting, knowledge discovery, ….
6/21/2004
14
Acknowledgements
 Semagix provided technology to populate
ontology using knowledge extraction, and
(semi-)automatic metadata extraction from
documents (Freedom toolkit).
 NSF-funded projects provided core research:
"Semantic Association Identification and Knowledge
Discovery for National Security Applications" (Grant No.
IIS-0219649) and "Semantic Discovery: Discovering
Complex Relationships in Semantic Web" (Grant No. IIS0325464)
6/21/2004
15
References


























1. B. Aleman-Meza, C. Halaschek, I.B. Arpinar, A. Sheth, Context-Aware Semantic Association
Ranking. Proceedings of Semantic Web and Databases Workshop, Berlin, September 78 2003, pp. 33-50
2. B. Aleman-Meza, C. Halaschek, A. Sheth, I.B. Arpinar, and G. Sannapareddy. SWETO:
Large-Scale Semantic Web Test-bed. Proceedings of the 16th International Conference on
Software Engineering and Knowledge Engineering (SEKE2004): Workshop on Ontology in
Action, Banff, Canada, June 21-24, 2004, pp. 490-493
3. R. Anderson and R. Brackney. Understanding the Insider Threat. Proceedings of a March
2004 Workshop. Prepared for the Advanced Research and Development Activity (ARDA).
http://www.rand.org/publications/CF/CF196/
4. K. Anyanwu and A. Sheth ρ-Queries: Enabling Querying for Semantic Associations on the
Semantic Web The Twelfth International World Wide Web Conference, Budapest, Hungary,
2003, pp. 690-699
5. K. Anyanwu, A. Maduko, A. Sheth, SemRank: Ranking Complex Relationship Search Results
on the Semantic Web, In Proceedings of the 14th International World Wide Web Conference,
Japan 2005 (accepted, to appear)
6. K. Anyanwu, A. Maduko, A. Sheth, J. Miller. Top-k Path Query Evaluation in Semantic
Web Databases. (submitted for publication), 2005
7. C. Halaschek, B. Aleman-Meza, I.B. Arpinar, A. Sheth Discovering and Ranking Semantic
Associations over a Large RDF Metabase Demonstration Paper, VLDB 2004, 30th International
Conference on Very Large Data Bases, Toronto, Canada, 30 August - 3 September,
2004
8. B. Hammond, A. Sheth, and K. Kochut, Semantic Enhancement Engine: A Modular Document
Enhancement Platform for Semantic Applications over Heterogeneous Content, in
Real World Semantic Web Applications, V. Kashyap and L. Shklar, Eds., IOS Press, December
2002, pp. 29-49
6/21/2004
16
References (cont)



















9. M. Rectenwald, K. Lee, Y. Seo, J.A. Giampapa, and K. Sycara. Proof of Concept System for
Automatically Determining Need-to-Know Access Privileges: Installation Notes and User
Guide. Technical Report CMU-RI-TR-04-56, Robotics Institute, Carnegie Mellon University,
October, 2004.
http://www.ri.cmu.edu/pub_files/pub4/rectenwald_michael_2004_3/rectenwald_michael_20
04_3.pdf
10. C. Rocha, D. Schwabe, M.P. Aragao. A Hybrid Approach for Searching in the Semantic
Web, In Proceedings of the 13th International World Wide Web, Conference, New York,
May 2004, pp. 374-383.
11. M.A. Rodriguez, M.J. Egenhofer, Determining Semantic Similarity Among Entity Classes
from Different Ontologies, IEEE Transactions on Knowledge and Data Engineering 2003
15(2):442-456
12. A. Sheth, C. Bertram, D. Avant, B. Hammond, K. Kochut, and Y. Warke. Managing Semantic
Content for the Web. IEEE Internet Computing, 2002. 6(4):80-87
13. A. Sheth, B. Aleman-Meza, I.B. Arpinar, C. Halaschek, C. Ramakrishnan, C. Bertram, Y.
Warke, D. Avant, F.S. Arpinar, K. Anyanwu, and K. Kochut. Semantic Association Identification
and Knowledge Discovery for National Security Applications. Journal of Database
Management, Jan-Mar 2005, 16 (1):33-53
14. Boanerges Aleman-Meza, Phillip Burns, Matthew Eavenson,Devanand Palaniswami, Amit Sheth. An
Ontological Approach to the Document Access Problem of Insider Threat
6/21/2004
17
Security and Terrorism Part of SWETO Ontology
6/21/2004
18
Semantic Annotation
 Document searched for entity names (or synonyms)
contained in ontology
 Then document entities are annotated with
additional information from corresponding entities in
ontology including named relationships to other
entities
 Following chart is example
 Highlighted text are entities found corresponding to
concepts in ontology
 XML is corresponding meta-data annotation
6/21/2004
19
Relevance Measures for Documents
(relating document content to IA “need to know”
 Relevance engine input
 the set of semantically annotated documents
the context of investigation for the assignment
the ontology schema represented in RDFS, and
the ontology instances represented in RDF
 Relevance measure function used to verify
whether the entity annotations in the
annotated document can be fit into the entity
classes, entity instances, and/or keywords
specified in the context of investigation.
6/21/2004
20
Open/proprietary Heterogeneous Data Sources
The Big Picture
SWETO Web
Service
documents
Knowledge
Discovery
Algorithms
Browsing
API
Trusted
Sources
databases
populates
Html
pages
Massive
Metadata
Store
Ontology /
knowledge base
XML
feeds
6/21/2004
popu
emails
21
lates
Semistructured
data
SWETO – Ontology Schema Visualization
See SemDis project of LSDIS Lab, University of Georgia
Relevance Measures for Documents
(relating document content to IA “need to know” (cont)
 Documents classified as:
 Highly relevant
 Document entities directly related
 Closely related
 Document entities related through strong semantic
associations
 Ambiguous
 Document entities related through weak semantic
associations
 Not relevant
 Document entities not related to “need to know”
 Undeterminable
 Document entities not found in ontology
6/21/2004
23
IA Context of Investigation
(characterization of “Need to Know”)
We define the context of investigation as a combination of
the following:
 A set of entity classes and relationships, and/or a
negation of a set of entity classes and relationships
 A set of entity instance names, and/or a negation of a set
of entity instance names
 A set of keyword values that might appear at any
attribute of the populated instance data, and/or a
negation of a set of keyword values
6/21/2004
24
Context of Investigation (cont)
 Goal is to capture, at a high level, the types of
entities, (or relationships), that are considered
important.
 Relationships can be constrained to be associated
with specified class types
 E.G. It can be specified that a relation ‘affiliated with’ is part
of the context only when it is connected with an entity that
belongs to a specific class, say, ‘Terror Organization’
6/21/2004
25
Ranking of Documents Relevance
Four groups of document-ranking:
- Not Related Documents
-
-
Ambiguously Related Documents
-
-
some relationship exists to the context
Somehow Related Documents
-
-
unable to determine relation to context
Entities are closely related to the context
Highly Related Documents
-
Entities are a direct match to the context
Cut-off values determine grouping of documents w.r.t. relevance
-
These are customizable cut-off values (more control and more
meaningful parameters compared to say automatic classification or
statistical approaches)
“Inspection” of a document is possible via (a) original document or (b)
original document with highlighted entities
6/21/2004
26
Download