Big Data and How to Overcome the Problems it Causes Ontology Engineering CSE 510/PHI 598 Fall 2014 September 8, 2014 Big Data Problem • Wikipedia defines Big Data as “…a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools.” • Gartner defines Big Data with three ‘V’s: – Volume – Velocity (of production and analysis) – Variety • This means that Big Data are beyond our control (as opposed to those complex and big systems with diverse and changing data where the complexity is known) The Promise of Big Data • Great insights can be obtained from large diverse data sets if properly exploited with the right analytics • Proper exploitation requires solutions in the areas of – Hardware – Software – Method Knowledge Representations: AttributeValue Systems Restaurant Cuisine Cost Avg. Diner Review Avg. Critic Review Reservation Required Tom’s Diner American $ 3.2 2.8 No Les Gros Poissons French $$$$ 4.5 4.8 Yes Il Grand Pesce Italian $$$ 3.8 3.5 Yes El Gran Pez Spanish $$ 4.3 4.4 No Den Stora Fisken Swedish $$$ 3.2 4.8 Yes $$$$ 4.0 2.2 Preferred De Grote Vis Dutch A Shortcoming of Attribute-Value Systems • Duplicate Attributes Restaurant Cuisine Tom’s Diner American Tom Washington Les Gros Poissons French Jean Adams Simone Jefferson Il Grand Pesce Italian Robert Madison Simone Jefferson El Gran Pez Spanish Louis Adams Den Stora Fisken Swedish Philip Jackson De Grote Vis Dutch … Owner Kate Tyler Owner 2 Claire Van Buren Owner 3 Susan Harrison Relational Database Solutions • 1st Normal Form – No Attributes which are themselves sets Restaurant Cuisine … Owner Tom’s Diner American Tom Washington Les Gros Poissons French Jean Adams Les Gros Poissons French Simone Jefferson Il Grand Pesce Italian Robert Madison Il Grand Pesce Italian Simone Jefferson El Gran Pez Spanish Louis Adams Rows Represent Unique Objects • Each row now uniquely represents an aggregate entity of Restaurant and Owner • This aggregate forms the primary key of the table Restaurant Cuisine … Owner Tom’s Diner American Tom Washington Les Gros Poissons French Jean Adams Les Gros Poissons French Simone Jefferson Il Grand Pesce Italian Robert Madison Il Grand Pesce Italian Simone Jefferson El Gran Pez Spanish Louis Adams A Shortcoming of 1st Normal Form • Since the attributes depend on only a part of the primary key (i.e. Restaurant) the table is subject to risks of inconsistencies if the attributes of one of the objects is changed but not the others Restaurant Cuisine … Owner Tom’s Diner American Tom Washington Les Gros Poissons Creole Jean Adams Les Gros Poissons French Simone Jefferson Il Grand Pesce Italian Robert Madison Il Grand Pesce Italian Simone Jefferson El Gran Pez Spanish Louis Adams Relational Database Solutions • 2nd Normal Form requires that any attribute must describe the object designated by the primary key rather than just some part of it Restaurant Cuisine Cost Tom’s Diner American Les Gros Poissons … Restaurant Owner $ Tom’s Diner Tom Washington Creole $$$$ Les Gros Poissons Jean Adams Il Grand Pesce Italian $$$ Les Gros Poissons Simone Jefferson El Gran Pez Spanish $$ Robert Madison Den Stora Fisken Swedish $$$ Il Grand Pesce De Grote Vis Dutch $$$$ Il Grand Pesce Simone Jefferson El Gran Pez Louis Adams A Shortcoming of 2nd Normal Form • While both Date and Day of Purchase describe the unique object of the table (i.e. the Restaurant+Owner primary key) there are duplicate combinations of the two • If one of the combinations is changed without the other a date may be shown has falling on two days of the week Restaurant Owner Date of Purchase Day of Purchase Tom’s Diner Tom Washington 5/3/1994 Wednesday Les Gros Poissons Jean Adams 4/14/2008 Friday Les Gros Poissons Simone Jefferson 4/14/2008 Saturday Il Grand Pesce Robert Madison 10/28/2003 Thursday Il Grand Pesce Simone Jefferson 2/2/1998 Monday El Gran Pez 7/30/2012 Tuesday Louis Adams Relational Database Solutions • 3rd Normal Form requires that any attribute describes the entity represented by the primary key and only that entity • No transitive descriptions as in the example from the previous slide Restaurant Owner Date of Purchase Tom’s Diner Tom Washington 5/3/1994 Les Gros Poissons Jean Adams 4/14/2008 Les Gros Poissons Simone Jefferson 4/14/2008 Date Day of Week 5/3/1994 Wednesday 4/14/2008 Friday 10/28/2003 Thursday Il Grand Pesce Robert Madison 10/28/2003 2/2/1998 Monday Il Grand Pesce Simone Jefferson 2/2/1998 7/30/2012 Tuesday El Gran Pez 7/30/2012 Louis Adams Knowledge Representations As Highly Designed Artifacts Restaurant Cuisine Cost Tom’s Diner American $ Les Gros Poissons Creole $$$$ Il Grand Pesce Italian El Gran Pez Spanish … Restaurant Owner Date of Purchase $$$ Tom’s Diner Tom Washington 5/3/1994 Jean Adams 4/14/2008 $$ Les Gros Poissons Simone Jefferson 4/14/2008 Robert Madison 10/28/2003 Simone Jefferson 2/2/1998 Louis Adams 7/30/2012 Les Gros Poissons Date Day of Week De Grote Vis Dutch $$$$ Il Grand Pesce 5/3/1994 Wednesday Il Grand Pesce 4/14/2008 Friday El Gran Pez 10/28/2003 Thursday Den Stora Fisken Swedish $$$ 2/2/1998 Monday 7/30/2012 Tuesday Application Translation Layers Presentation Layer Business Layer Data Access Layer Big Data Hardware Solution • Costly and can overrun the capabilities of the largest single machines • A solution is to distribute information across many smaller machines Hardware Solution is Contrary to Relational Design • Designed to run on single machines • Attempting to disassemble them and run them on a cluster of machines is very difficult • Big Data requires a different Data Model, one that is cluster friendly, that is, one that can be distributed while still being efficient at retrieving the data that is needed NoSQL Database Solutions • Do not require a highly structured representation of data, the data models are relatively simple – Key – Value Model – Document Model – Column Family Model – Graph Model Key-Value Data Model • Key –Value pair where the key is associated to some value • The value can be any type of object, a number a text value, an array, an image, a file, etc. Tom’s Diner Les Gros Poissons Il Grand Pesce El Gran Pez Value associated with Tom’s Diner Value associated with Les Gros Poissos Value associated with Il Grand Pesce Value associated with El Gran Pez Document Data Model • Each element is a document, that is, a complex data structure of some type, usually expressed in JSON (JavaScript Object Notation) • No set schema for the documents • More transparent than the Key-Value model [ { "id": 1, "Name": "Tom's Diner", "Cuisine": "American", "Cost": "$", "Average Diner Review": 3.2, "Average Critic Review": 2.8, "Reservation Required": "No", "Owner": "Tom Washington" } ] Column Family Data Model • A Row Key is associated with n-many column families (i.e. groups of columns that store related data) 1234 Name “Tom’s Diner” Cuisine “American” Cost “$” Avg Review 2.8 Row Key Name “Tom Washington” Restaurant Column Family Owner Column Family Aggregate Orientation • As noticed and described by Martin Fowler* all of the aforementioned noSQL data models share an orientation towards storing a the description of a significant object • This enables the distribution of data that tends to be requested together (clusterfriendly) • Tends to be difficult to re-order the data to query by different aggregates * NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, by Sadalage, P.J. and Fowler, M. (2012) Graph Data Model Reservations Not Required Avg. Critic Review of 2.8 Restaurant Wednesday 5/13/94 Date of Purchase Tom’s Diner Owner Avg. Diner Review of 3.2 Tom Washington Cost of $ American Cuisine Graph Data Model • Does not have an aggregate orientation, rather the opposite, a granular orientation that breaks the aggregate into its composite elements • Good for data exploration • Still cluster – friendly, similar data can be stored in separate graphs RDF Data Model • RDF specifies a regular syntax for well formed expressions – rdf:statement – a simple expression that relates one entity to another – rdf:subject – the entity the statement is about – rdf:predicate – the relationship said to hold between the two entities – rdf:object – the entity that is related to the subject • Humans are mortal • UB’s website homepage has URL http://www.buffalo.edu/ • Remus is the brother of Romulus 23 RDF Data Model Subject Predicate Object Tom’s Dinner Is_a Restaurant Tom’s Dinner Offers American Cuisine Tom’s Dinner Costs $ Tom’s Dinner Has_average_diner’s review 3.2 Tom’s Dinner Has_average_critics_review 2.8 Tom’s Dinner Requires_reservation No Tom’s Dinner Has_owner Tom Washington Methodological Solution Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ Origin • Formats of data sources included free text, semi-structured and structured • Some data sets are made available only a short time prior to system testing • Data sets and domain of interest will change • Data can not be collected into a single store • Provide cross-source searching and analytics • Need to maintain the provenance of data 26 High Level View of Ontology Content • Enable Description of Human Activity to perform People & Organizations use Actions that take place in Artifacts Natural & Artificial Environments are distinguished by Time Attributes 27 High Level View of Ontology Content • Including the Activity of Describing Human Activity People & Organizations produce Information that describe Action People & Orgs at a Artifacts Natural & Artificial Environments Time Attribute Time is distinguished by Attributes 28 Current Import Structure of the I2WD Ontologies Relation Ontology (RO) RO BFO Bridge 1.1 Basic Formal Ontology (BFO) Upper Level Ontology: Mid-Level Ontology: Extended Relation Ontology Agent Ontology Artifact Ontology ChEBI Ontology Event Ontology Geospatial Ontology Domain Ontology: Information Entity Ontology Quality Ontology Time Ontology Emotion Ontology Manufactured Chemicals Ontology AIRS Mid-Level Ontology Information Technology Ontology Counterterrorism Ontology 29 Highlighted Capabilities of Ontologies • Objects (persons, organizations, facilities, materials, etc.) are linked to qualities, functions and roles – these links can be time-stamped – these attributes can be differentiated between designed and improvised – these attributes can be measured using nominal (tall, average), ordinal (1st, best), interval (30o Celsius), and ratio (30mm, 10 gallons) measurement types 30 Highlighted Capabilities of I2WD Ontologies • Events can be linked together with temporal or causal relationships • Ambiguous times (… occurred during the Spring of 2010) and places (… happened in New York) can be integrated with more precise information (…occurred on April 18th, 2010, …happened in Central Park) • Vocabulary for output of sentiment analysis 31 Using States to Express Time Dependent Attributes • In 2004, Alaa al-Tamimi became Mayor of Baghdad. Temporal Interval Is instance of Gain Of Role Year Is instance of Is instance of Mayor Role Person Is instance of Is instance of Government Is instance of City Is instance of Baghdad Alaa al-Tamimi’s Mayor Role 2004 Is organizational Context of Has role Interval during Delimited by City Government Of Baghdad Participates in Temporal Interval of Gain of Alaa al-Tamimi’s Mayor Role Occurs on Gain of Alaa al-Tamimi’s Mayor Role Participates in Alaa al-Tamimi 32 Designed and Measured Artifact Attributes Is nominal measurement of Thermal Stability Nominal Measurement Lithium Thermal Stability Portion of Lithium Cobalt Oxide bearer_of Oxygen Inheres_in Cobalt is made of Thermal Stability Nominal Measurement Value Lithium Ion Battery has_part Samsung Galaxy S4 Has text value prescribed_by bearer_of Design Specifications of Samsung Galaxy S4 has_part Poor Data Transfer Speed Ratio Measurement Is ratio meausrement of Data Transfer Speed prescribes Data Transfer Speed Specification Inheres_in Inheres_in Data Transfer Speed Measurement Value Has decimal value 36.6 Uses measurement unit Mbps Data Transfer Speed Specification Value Has decimal value 42.2 Uses measurement unit Mbps Ontology Content Based on Standards Partial List of Doctrine and Standards Used • • • • • • • • • • • • • • • Basic Formal Ontology (BFO) DOD Dictionary of Military and Associated Terms (JP 1-02) Operations (FM 3-0) Multinational Operations (JP 3-16) Counterinsurgency (FM 3-24) International Standard Industrial Classification of all Economic Activities Rev.4 (ISIC4) Universal Joint Task List (CJSCM 3500.04C) Weapon Technical Intelligence (WTI) Improvised Explosive Device IED Lexicon JC3IEDM Information Artifact Ontology (IAO) Phenotype and Trait Ontology (PATO) Foundational Model of Anatomy (FMA) Regional Connection Calculus (RCC-8) Allen Time Calculus Wikipedia 34 Ontology Content Tested Against Data Partial List of Data Sources Used • Treasury Office of Foreign Assets Control – Specially Designated Nationals and Blocked Persons • NCTC – Worldwide Incidents Tracking System • UMD – Global Terrorism Database • RAND – Database of Worldwide Terrorism Incidents • LDM version .60 (TED) • VMF PLI • DCGS-A Event Reporting • BFT Report (CCRi test data) • Cidne Sigact (CCRi test data) • Long War Journal • Harmony Documents from CTC at West Point • Threats Open Source Intelligence Gateway 35 Ontologies Use a Common Upper Ontology Entity Object Quality bearer_of Organization Physical Artifact Quality of Physical Artifact Quality of Organization has_quality has_quality • Produces common patterns within ontologies – Reuse of mappings from the sources • Easier to include new sources of data – Enables more uniformity between queries • Easier to transition to new domains of interest 36 Ontologies are Modular Entity Object Physical Artifact Organization located_at Spatial Location located_at • Each Class is defined in one place – Facilitates locating a class within the target ontologies – Provides better recall in queries • Less likely to overlook relevant data 37 Ontologies Enable both Early and Late Fusion • Data Source 1 Granular classes allow direct mappings from various perspectives on the same domain while preserving information that can be later used for entity resolution prescribes Model Car Full Size Mid Size Compact Car has quality manufactures Full Size Manufacturer Length of Wheelbase Mid Size designates Model Compact Vehicle Identification Number Car Make is nominally measured by Car VIN Data Source 3 VIN Owner Data Source 2 38 Organization of Ontologies • A limited number of upper and mid-level ontologies are carefully managed • Domain ontologies are developed by subject matter experts and tested by automated procedures • Content is pushed from domain ontologies to mid-level ontologies as usage levels warrant 39 Future Re-Organization of Ontologies BFO Upper Level Ontology: Extended Relation Ontology Information Artifact Ontology Mid-Level Ontology: Domain Ontology: Quality Ontology Agent Ontology Artifact Ontology Event Ontology Geospatial Ontology Military Events Interpersonal Events Human Anatomy Watercraft Ethnicities Ground Vehicles Occupations Aircraft Weather Events Nationalities Military Units Clothing Acts of Government Religions Ideologies Disease Ontology Weapons Communicati on Devices Tools Legal System Events Acts of Artifact Use Time Ontology Chemical Ontology Plant Taxonomy Animal Taxonomy Geological Taxonomy Anthropogenic Feature Atmospheric Feature Hydrographic Feature Landform Geopolitical Feature Role Defined Area Criminal Acts Mental Function Ontology 40 Conformance Testing • Inconsistency – A class is identified as being uninstantiable • Semantic Smuggling – A class or property is reused with changed content • Multiple Inheritance – A class or property is asserted to be a subclass of more than one superclass • Taxonomy Overloading – A class or property is related to its parent by a relationship other than subclass • Containment – A class or property is not a child of any class or property of the imported ontologies • Conflation – A class or property includes information model assertions that are not true of the domain • Logic of Terms – A class or property is a set-theoretic combination of other classes or properties 41 Building a Taxonomy – Common Problems • Use – Mention Errors • Part of rather than subclass of Postal Address Country Address Locality Address Region Postal Code Post Office Box Number Street Address 42 Building a Taxonomy – Common Problems • Narrower in meaning than rather than subclass of • Logic of Terms Adhesives & Sealants Adhesives Applicators & Dispensers Sealants Adhesive Application Services Glue Applicators Epoxy Dispensers In Thomasnet.com(http://www.thomasnet.com/browse) classes are formed by conjunctions and the class hierarchy contains examples of subclasses based on search patterns 43 Building a Taxonomy – Common Problems • Narrower in meaning than rather than subclass of Color Green Brown Green Dark Green Desaturated Green Light Green Saturated Green Yellow Green In the Phenotypic Quality Ontology (http://purl.obolibrary.org/obo/PATO_0000320) classes are subclasses by hue. 44 Building a Taxonomy – Common Problems • Non-Disjoint Classes Day Sunday Monday Tuesday Day of Week Holiday Anniversary Wednesday Thursday Friday Saturday 45