Big Data and How to Overcome the Problems it Causes

advertisement
Big Data and How to Overcome
the Problems it Causes
Ontology Engineering CSE 510/PHI 598
Fall 2014
September 8, 2014
Big Data Problem
• Wikipedia defines Big Data as “…a collection of data
sets so large and complex that it becomes difficult to
process using on-hand database management tools.”
• Gartner defines Big Data with three ‘V’s:
– Volume
– Velocity (of production and analysis)
– Variety
• This means that Big Data are beyond our control (as
opposed to those complex and big systems with
diverse and changing data where the complexity is
known)
The Promise of Big Data
• Great insights can be obtained from large
diverse data sets if properly exploited with the
right analytics
• Proper exploitation requires solutions in the
areas of
– Hardware
– Software
– Method
Knowledge Representations: AttributeValue Systems
Restaurant
Cuisine
Cost
Avg. Diner
Review
Avg. Critic
Review
Reservation
Required
Tom’s Diner
American
$
3.2
2.8
No
Les Gros
Poissons
French
$$$$
4.5
4.8
Yes
Il Grand
Pesce
Italian
$$$
3.8
3.5
Yes
El Gran Pez
Spanish
$$
4.3
4.4
No
Den Stora
Fisken
Swedish
$$$
3.2
4.8
Yes
$$$$
4.0
2.2
Preferred
De Grote Vis Dutch
A Shortcoming of Attribute-Value
Systems
• Duplicate Attributes
Restaurant
Cuisine
Tom’s Diner
American
Tom Washington
Les Gros
Poissons
French
Jean Adams
Simone Jefferson
Il Grand
Pesce
Italian
Robert Madison
Simone Jefferson
El Gran Pez
Spanish
Louis Adams
Den Stora
Fisken
Swedish
Philip Jackson
De Grote Vis Dutch
… Owner
Kate Tyler
Owner 2
Claire Van Buren
Owner 3
Susan Harrison
Relational Database Solutions
• 1st Normal Form – No Attributes which are
themselves sets
Restaurant
Cuisine
… Owner
Tom’s Diner
American
Tom Washington
Les Gros
Poissons
French
Jean Adams
Les Gros
Poissons
French
Simone
Jefferson
Il Grand
Pesce
Italian
Robert Madison
Il Grand
Pesce
Italian
Simone
Jefferson
El Gran Pez
Spanish
Louis Adams
Rows Represent Unique Objects
• Each row now uniquely represents an aggregate entity of
Restaurant and Owner
• This aggregate forms the primary key of the table
Restaurant
Cuisine
… Owner
Tom’s Diner
American
Tom Washington
Les Gros
Poissons
French
Jean Adams
Les Gros
Poissons
French
Simone
Jefferson
Il Grand
Pesce
Italian
Robert Madison
Il Grand
Pesce
Italian
Simone
Jefferson
El Gran Pez
Spanish
Louis Adams
A Shortcoming of 1st Normal Form
• Since the attributes depend on only a part of the primary key (i.e.
Restaurant) the table is subject to risks of inconsistencies if the
attributes of one of the objects is changed but not the others
Restaurant
Cuisine
… Owner
Tom’s Diner
American
Tom Washington
Les Gros
Poissons
Creole
Jean Adams
Les Gros
Poissons
French
Simone
Jefferson
Il Grand
Pesce
Italian
Robert Madison
Il Grand
Pesce
Italian
Simone
Jefferson
El Gran Pez
Spanish
Louis Adams
Relational Database Solutions
• 2nd Normal Form requires that any attribute must describe
the object designated by the primary key rather than just
some part of it
Restaurant
Cuisine
Cost
Tom’s Diner
American
Les Gros
Poissons
…
Restaurant
Owner
$
Tom’s Diner
Tom Washington
Creole
$$$$
Les Gros
Poissons
Jean Adams
Il Grand
Pesce
Italian
$$$
Les Gros
Poissons
Simone
Jefferson
El Gran Pez
Spanish
$$
Robert Madison
Den Stora
Fisken
Swedish
$$$
Il Grand
Pesce
De Grote Vis
Dutch
$$$$
Il Grand
Pesce
Simone
Jefferson
El Gran Pez
Louis Adams
A Shortcoming of 2nd Normal Form
• While both Date and Day of Purchase describe the unique object of the
table (i.e. the Restaurant+Owner primary key) there are duplicate
combinations of the two
• If one of the combinations is changed without the other a date may be
shown has falling on two days of the week
Restaurant
Owner
Date of Purchase
Day of Purchase
Tom’s Diner
Tom Washington
5/3/1994
Wednesday
Les Gros
Poissons
Jean Adams
4/14/2008
Friday
Les Gros
Poissons
Simone Jefferson
4/14/2008
Saturday
Il Grand Pesce Robert Madison
10/28/2003
Thursday
Il Grand Pesce Simone Jefferson
2/2/1998
Monday
El Gran Pez
7/30/2012
Tuesday
Louis Adams
Relational Database Solutions
• 3rd Normal Form requires that any attribute describes the
entity represented by the primary key and only that entity
• No transitive descriptions as in the example from the
previous slide
Restaurant
Owner
Date of Purchase
Tom’s Diner
Tom Washington
5/3/1994
Les Gros
Poissons
Jean Adams
4/14/2008
Les Gros
Poissons
Simone Jefferson
4/14/2008
Date
Day of Week
5/3/1994
Wednesday
4/14/2008
Friday
10/28/2003 Thursday
Il Grand Pesce Robert Madison
10/28/2003
2/2/1998
Monday
Il Grand Pesce Simone Jefferson
2/2/1998
7/30/2012
Tuesday
El Gran Pez
7/30/2012
Louis Adams
Knowledge Representations As Highly
Designed Artifacts
Restaurant
Cuisine
Cost
Tom’s Diner
American
$
Les Gros
Poissons
Creole
$$$$
Il Grand
Pesce
Italian
El Gran Pez
Spanish
…
Restaurant
Owner
Date of Purchase
$$$
Tom’s Diner
Tom Washington
5/3/1994
Jean Adams
4/14/2008
$$
Les Gros
Poissons
Simone Jefferson
4/14/2008
Robert Madison
10/28/2003
Simone Jefferson
2/2/1998
Louis Adams
7/30/2012
Les Gros
Poissons
Date
Day of Week
De Grote Vis Dutch
$$$$ Il Grand Pesce
5/3/1994
Wednesday
Il Grand Pesce
4/14/2008 Friday
El Gran Pez
10/28/2003 Thursday
Den Stora
Fisken
Swedish
$$$
2/2/1998
Monday
7/30/2012
Tuesday
Application Translation Layers
Presentation
Layer
Business Layer
Data Access
Layer
Big Data Hardware Solution
• Costly and can overrun the capabilities of the
largest single machines
• A solution is to distribute information across
many smaller machines
Hardware Solution is Contrary to
Relational Design
• Designed to run on single machines
• Attempting to disassemble them and run
them on a cluster of machines is very difficult
• Big Data requires a different Data Model, one
that is cluster friendly, that is, one that can be
distributed while still being efficient at
retrieving the data that is needed
NoSQL Database Solutions
• Do not require a highly structured
representation of data, the data models are
relatively simple
– Key – Value Model
– Document Model
– Column Family Model
– Graph Model
Key-Value Data Model
• Key –Value pair where the key is associated to
some value
• The value can be any type of object, a number
a text value, an array, an image, a file, etc.
Tom’s
Diner
Les Gros
Poissons
Il Grand
Pesce
El Gran
Pez
Value associated with
Tom’s Diner
Value associated with
Les Gros Poissos
Value associated with
Il Grand Pesce
Value associated with
El Gran Pez
Document Data Model
• Each element is a document, that is, a complex data
structure of some type, usually expressed in JSON
(JavaScript Object Notation)
• No set schema for the documents
• More transparent than the Key-Value model
[
{
"id": 1,
"Name": "Tom's Diner",
"Cuisine": "American",
"Cost": "$",
"Average Diner Review": 3.2,
"Average Critic Review": 2.8,
"Reservation Required": "No",
"Owner": "Tom Washington"
}
]
Column Family Data Model
• A Row Key is associated with n-many column
families (i.e. groups of columns that store
related data)
1234
Name
“Tom’s Diner”
Cuisine
“American”
Cost
“$”
Avg
Review
2.8
Row Key
Name
“Tom Washington”
Restaurant
Column
Family
Owner
Column
Family
Aggregate Orientation
• As noticed and described by Martin Fowler*
all of the aforementioned noSQL data models
share an orientation towards storing a the
description of a significant object
• This enables the distribution of data that
tends to be requested together (clusterfriendly)
• Tends to be difficult to re-order the data to
query by different aggregates
* NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, by Sadalage, P.J. and Fowler, M. (2012)
Graph Data Model
Reservations
Not
Required
Avg. Critic
Review of
2.8
Restaurant
Wednesday
5/13/94
Date of
Purchase
Tom’s
Diner
Owner
Avg. Diner
Review of
3.2
Tom
Washington
Cost of
$
American
Cuisine
Graph Data Model
• Does not have an aggregate orientation,
rather the opposite, a granular orientation
that breaks the aggregate into its composite
elements
• Good for data exploration
• Still cluster – friendly, similar data can be
stored in separate graphs
RDF Data Model
• RDF specifies a regular syntax for well formed expressions
– rdf:statement – a simple expression that relates one entity to
another
– rdf:subject – the entity the statement is about
– rdf:predicate – the relationship said to hold between the two
entities
– rdf:object – the entity that is related to the subject
• Humans are mortal
• UB’s website homepage has URL http://www.buffalo.edu/
• Remus is the brother of Romulus
23
RDF Data Model
Subject
Predicate
Object
Tom’s Dinner
Is_a
Restaurant
Tom’s Dinner
Offers
American Cuisine
Tom’s Dinner
Costs
$
Tom’s Dinner
Has_average_diner’s review 3.2
Tom’s Dinner
Has_average_critics_review 2.8
Tom’s Dinner
Requires_reservation
No
Tom’s Dinner
Has_owner
Tom Washington
Methodological Solution
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
Origin
• Formats of data sources included free text,
semi-structured and structured
• Some data sets are made available only a
short time prior to system testing
• Data sets and domain of interest will change
• Data can not be collected into a single store
• Provide cross-source searching and analytics
• Need to maintain the provenance of data
26
High Level View of Ontology Content
• Enable Description of Human Activity
to perform
People &
Organizations
use
Actions
that take place in
Artifacts
Natural &
Artificial
Environments
are distinguished by
Time
Attributes
27
High Level View of Ontology Content
• Including the Activity of Describing Human Activity
People &
Organizations
produce
Information
that describe
Action
People &
Orgs
at a
Artifacts
Natural &
Artificial
Environments
Time
Attribute
Time
is distinguished by
Attributes
28
Current Import Structure of the I2WD Ontologies
Relation
Ontology
(RO)
RO BFO
Bridge 1.1
Basic Formal
Ontology
(BFO)
Upper Level Ontology:
Mid-Level Ontology:
Extended
Relation
Ontology
Agent
Ontology
Artifact
Ontology
ChEBI
Ontology
Event
Ontology
Geospatial
Ontology
Domain Ontology:
Information
Entity
Ontology
Quality
Ontology
Time
Ontology
Emotion
Ontology
Manufactured
Chemicals
Ontology
AIRS Mid-Level
Ontology
Information
Technology
Ontology
Counterterrorism
Ontology
29
Highlighted Capabilities of Ontologies
• Objects (persons, organizations, facilities,
materials, etc.) are linked to qualities,
functions and roles
– these links can be time-stamped
– these attributes can be differentiated between
designed and improvised
– these attributes can be measured using nominal
(tall, average), ordinal (1st, best), interval (30o
Celsius), and ratio (30mm, 10 gallons)
measurement types
30
Highlighted Capabilities of I2WD
Ontologies
• Events can be linked together with temporal
or causal relationships
• Ambiguous times (… occurred during the Spring of 2010) and
places (… happened in New York) can be integrated with
more precise information (…occurred on April 18th, 2010,
…happened in Central Park)
• Vocabulary for output of sentiment analysis
31
Using States to Express Time Dependent Attributes
• In 2004, Alaa al-Tamimi became Mayor of Baghdad.
Temporal
Interval
Is instance of
Gain Of
Role
Year
Is instance of
Is instance of
Mayor
Role
Person
Is instance of
Is instance of
Government
Is instance of
City
Is instance of
Baghdad
Alaa al-Tamimi’s
Mayor Role
2004
Is organizational
Context of
Has role
Interval
during
Delimited by
City Government
Of Baghdad
Participates in
Temporal Interval of
Gain of Alaa al-Tamimi’s
Mayor Role
Occurs on
Gain of Alaa al-Tamimi’s
Mayor Role
Participates in
Alaa al-Tamimi
32
Designed and Measured Artifact Attributes
Is nominal
measurement
of
Thermal Stability
Nominal
Measurement
Lithium
Thermal
Stability
Portion of
Lithium Cobalt
Oxide
bearer_of
Oxygen
Inheres_in
Cobalt
is made of
Thermal Stability
Nominal
Measurement
Value
Lithium Ion
Battery
has_part
Samsung
Galaxy S4
Has
text
value
prescribed_by
bearer_of
Design Specifications
of Samsung Galaxy S4
has_part
Poor
Data Transfer Speed
Ratio Measurement
Is ratio
meausrement
of
Data Transfer
Speed
prescribes
Data Transfer
Speed
Specification
Inheres_in
Inheres_in
Data Transfer Speed
Measurement Value
Has decimal value
36.6
Uses measurement
unit
Mbps
Data Transfer
Speed Specification
Value
Has decimal value
42.2
Uses measurement
unit
Mbps
Ontology Content Based on Standards
Partial List of Doctrine and Standards Used
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Basic Formal Ontology (BFO)
DOD Dictionary of Military and Associated Terms (JP 1-02)
Operations (FM 3-0)
Multinational Operations (JP 3-16)
Counterinsurgency (FM 3-24)
International Standard Industrial Classification of all Economic Activities Rev.4
(ISIC4)
Universal Joint Task List (CJSCM 3500.04C)
Weapon Technical Intelligence (WTI) Improvised Explosive Device IED Lexicon
JC3IEDM
Information Artifact Ontology (IAO)
Phenotype and Trait Ontology (PATO)
Foundational Model of Anatomy (FMA)
Regional Connection Calculus (RCC-8)
Allen Time Calculus
Wikipedia
34
Ontology Content Tested Against Data
Partial List of Data Sources Used
• Treasury Office of Foreign Assets Control – Specially Designated
Nationals and Blocked Persons
• NCTC – Worldwide Incidents Tracking System
• UMD – Global Terrorism Database
• RAND – Database of Worldwide Terrorism Incidents
• LDM version .60 (TED)
• VMF PLI
• DCGS-A Event Reporting
• BFT Report (CCRi test data)
• Cidne Sigact (CCRi test data)
• Long War Journal
• Harmony Documents from CTC at West Point
• Threats Open Source Intelligence Gateway
35
Ontologies Use a Common Upper Ontology
Entity
Object
Quality
bearer_of
Organization
Physical
Artifact
Quality of
Physical
Artifact
Quality of
Organization
has_quality
has_quality
• Produces common patterns within ontologies
– Reuse of mappings from the sources
• Easier to include new sources of data
– Enables more uniformity between queries
• Easier to transition to new domains of interest
36
Ontologies are Modular
Entity
Object
Physical
Artifact
Organization
located_at
Spatial
Location
located_at
• Each Class is defined in one place
– Facilitates locating a class within the target
ontologies
– Provides better recall in queries
• Less likely to overlook relevant data
37
Ontologies Enable both Early and Late Fusion
•
Data Source 1
Granular classes allow direct mappings from
various perspectives on the same domain
while preserving information that can be later
used for entity resolution
prescribes
Model
Car
Full Size
Mid Size
Compact
Car
has quality
manufactures
Full Size
Manufacturer
Length of
Wheelbase
Mid Size
designates
Model
Compact
Vehicle
Identification
Number
Car
Make
is nominally
measured by
Car
VIN
Data Source 3
VIN
Owner
Data Source 2
38
Organization of Ontologies
• A limited number of upper and mid-level
ontologies are carefully managed
• Domain ontologies are developed by subject
matter experts and tested by automated
procedures
• Content is pushed from domain ontologies to
mid-level ontologies as usage levels warrant
39
Future Re-Organization of Ontologies
BFO
Upper Level Ontology:
Extended Relation
Ontology
Information
Artifact Ontology
Mid-Level Ontology:
Domain Ontology:
Quality Ontology
Agent
Ontology
Artifact
Ontology
Event
Ontology
Geospatial
Ontology
Military
Events
Interpersonal
Events
Human
Anatomy
Watercraft
Ethnicities
Ground
Vehicles
Occupations
Aircraft
Weather
Events
Nationalities
Military
Units
Clothing
Acts of
Government
Religions
Ideologies
Disease
Ontology
Weapons
Communicati
on Devices
Tools
Legal System
Events
Acts of
Artifact Use
Time
Ontology
Chemical
Ontology
Plant
Taxonomy
Animal
Taxonomy
Geological
Taxonomy
Anthropogenic
Feature
Atmospheric
Feature
Hydrographic
Feature
Landform
Geopolitical
Feature
Role Defined
Area
Criminal Acts
Mental
Function
Ontology
40
Conformance Testing
• Inconsistency – A class is identified as being uninstantiable
• Semantic Smuggling – A class or property is reused with
changed content
• Multiple Inheritance – A class or property is asserted to be
a subclass of more than one superclass
• Taxonomy Overloading – A class or property is related to its
parent by a relationship other than subclass
• Containment – A class or property is not a child of any class
or property of the imported ontologies
• Conflation – A class or property includes information
model assertions that are not true of the domain
• Logic of Terms – A class or property is a set-theoretic
combination of other classes or properties
41
Building a Taxonomy – Common Problems
• Use – Mention Errors
• Part of rather than subclass of
Postal
Address
Country
Address
Locality
Address
Region
Postal Code
Post Office
Box Number
Street
Address
42
Building a Taxonomy – Common Problems
• Narrower in meaning than rather than subclass of
• Logic of Terms
Adhesives &
Sealants

Adhesives
Applicators &
Dispensers
Sealants
Adhesive
Application
Services
Glue
Applicators
Epoxy
Dispensers
In Thomasnet.com(http://www.thomasnet.com/browse)
classes are formed by conjunctions and the class hierarchy
contains examples of subclasses based on search patterns
43
Building a Taxonomy – Common Problems
• Narrower in meaning than rather than subclass of
Color
Green
Brown Green

Dark Green
Desaturated
Green
Light Green
Saturated
Green
Yellow Green
In the Phenotypic Quality Ontology
(http://purl.obolibrary.org/obo/PATO_0000320) classes are
subclasses by hue.
44
Building a Taxonomy – Common Problems
• Non-Disjoint Classes
Day
Sunday
Monday
Tuesday
Day of Week
Holiday
Anniversary
Wednesday
Thursday
Friday
Saturday
45
Download