Oracle – Big Data THE INTELLIGENCE LIFE-CYCLE and Schema-Last Approach Dr Neil Brittliff PhD A little about myself… Awarded a PhD at the University of Canberra in March this year for my work in the Big Data space Currently employed as Data Scientist within the Australian Government Have been employed by 5 law enforcement agencies Developed Cryptographic Software to support the Australian Medicare System First used Oracle products back in 1986 Worked in the IT industry since 1982 Resides in Canberra (capital of Australia) Canberra is the only capital city in Australia that is not named after a person Interests Tennis (play) / Cricket (watch) Bushwalking and camping Piano Playing (very bad) Making stuff out of wood Enjoys the art of Programming (prefers the ‘C’ language) Pushing the limits of the Raspberry Pi 2 Talk Structure Motivation Principles and Constraints Intelligence Life-Cycle Collect & Collate Analyse & Produce Report & Disseminate Motivation Research What is a Schema The Problem with ETL Data Cleansing verses Data Triage A New Architecture Oracle Big Data The Schema-Last Approach Indexing Technologies and Exploitation User Reaction Observations and Opportunities 3 National Criminal Intelligence 4 The Law Enforcement community are also in the business of collecting and analysing criminal intelligence and information… data, and where possible, sharing that resulting To do this, they need rich, contemporary, and comprehensive criminal intelligence… The National Criminal Intelligence Fusion Capability, which brings together subject big data matter experts, analysts, technology and to identify previously unknown criminal entities, criminal methodologies, and patterns of crime. Fusion capability identifies the threats and vulnerabilities through the data. data use of It brings together, monitors and analyses and information from Customs, other law enforcement, Government agencies and industry to build an intelligence picture of serious and organised crime in Australia. Australian Institute of Criminology 5 • While many of the challenges posed by the volume of data are addressed in part by new developments in technology, the underlying issue has not been adequately resolved. • Over many years, there have been a variety of different ideas put forward in relation to addressing the increasing volume of data, such as data mining. Darren Quick and Kim-Kwang Raymond Choo Australian Institute of Criminology September 2014 Objectives Support the Australian Intelligence Criminal Model Simple Interface to exploit the data Data ingestion must be simple to do and minimise transformation Support the large variety of data sources Fast ingestion and retrieval times Enable exact and fuzzy searching 6 Support ‘Identity Resolution’ Support metadata Main the data’s integrity Preserve Data-Lineage/Provenance Reproduce the ingested data source exactly! We don’t want this! The Intelligence Life-Cycle 7 Plan, prioritise & direct Evaluate & review Report & disseminate Collect & collate Analyse & produce Intelligence – Data Source Classification DATA SOURCE CLASSIFICATION Analyse & produce Collect & collate Low High High 5% Low 95% 8 Some Definitions: 9 Collect & Collate Schema is from the Greek word meaning ‘form' or ‘figure' and is a formal representation of data model which has integrity constraints controlling permissible data values. Data munging or sometimes referred to as data wrangling means taking data that’s stored in one format and changing it into another format. That a major problem for the data scientist is to flatten the bumps as a result of the heterogeneity of data. Jimmy Lin and Dmitriy Ryaboy. Scaling big data mining infrastructure: The twitter experience. Schema First Schema Application 10 Schema Raw Data Cleanse Storage Analyse Schema Last Schema Raw Data Triage Storage Analyse Data Cleansing … 11 Data cleaning, also called data cleansing or scrubbing, deals with detecting Collect & Collate and removing errors and inconsistencies from data in order to improve the quality of data. “Data cleansing is the process of analysing the quality of data in a data source, manually approving/rejecting the suggestions by the system, and thereby making changes to the data. Data cleansing in Data Quality Services (DQS) includes a computer-assisted process that analyses how data conforms to the knowledge in a knowledge base, and an interactive process that enables the data steward to review and modify computer-assisted process results to ensure that the data cleansing is exactly as they want to be done.” Microsoft: 2012 Data Sources – Always Increasing 12 Collect & Collate Gap Data Cleansing - Doesn’t WORK 13 “Data cleansing can be time-consuming and tedious, but robust estimators are not a substitute for careful examination of the data for clerical errors and other problems. ” Collect & Collate David Ruppert. Inconsistency of resampling algorithms for high-breakdown regression estimators and a new algorithm. Journal of the American Statistical Association, 97: 148–149, 2002. “Formal data cleansing can easily overwhelm any human or perhaps the computing capacity of an organization.” N. Brierley, T. Tippetts, and P. Cawley. Data fusion for automated non-destructive inspection. Proceedings of the RSPA, 2014. “that the data volume may overwhelm the Extract Transform Load process and that data cleansing may introduce unintentional errors.” Vincent McBurney, 17 mistakes that ETL designers make with very large data, 2007. Collect & Collate Data Cleansing – Loss of Format 14 Input Date Cleansed Date Comment 20 July 2014 20-07-2014 Australian Date July-20-2014 20-07-2014 American Format (mmm-dd-yyyy) 2014-20-07 20-07-2014 Arabic Format (right to left) 20-07-14 20-07-2014 Data Ambiguity July 2014 01-07-2014 Imputed Value "If you torture the data long enough, it will confess.“ Clifton R. Musser ETL vs Triage Initiate Initiate Extract Triage Determine Load n Collect & Collate 15 n Suitability? Suitability? Transform Application Assessment ? n n Verify? Load Fuse Report Resolve Complete Complete We did our research … 16 Oracle’s BDA Collect & Collate (Big Data Appliance) 17 Data Storage/Collation Collect & Collate 18 Store the Data Semantically Built on an defined taxonomy/ontology Perfect to capture metadata Searched for the perfect Triple-Store Graph List Subject Predicate Triple Object The Architecture Analyse & Produce Index Data Exploitation IIR Semantic Store Data Exploration Feeds RDF / Modelling Historical Data Apache PIG Index SPARQL R Language Hbase SOLR BDA Data Flow Search Assistant Set Store Disseminate Palantir Collect & Collate New Data 19 Schema Last … Collect & Collate ‘Triaged’ Data Schema First Name Middle Name Last Name Full-Name Street Number Street Name Suburb State Postcode Full-Address Models 20 ACC Search Engines – ‘Smackdown’ Feature SOLR IIR License Apache License Commercial Storage Inverted List Third-party Database Next Release Inverse Document Frequency Normalized Score Support Google Like search Score Model Collect & Collate 21 Result Pagination Homophone Support Can use synonym support Phoneme Search Spread indexes across multiple nodes Schema-less Support Programming Interface Geo-spatial Rest SOAP - API Collect & Collate Collect & Collation Tool 22 Analyse & Produce Bongo – Exploration 23 Report & Disseminate Palantir – Semantic Interface 24 User Reaction Time to Triage < 1 Hour > 1 Hour < 24 Hour > 24 Hours General Size % Megabytes <1 > 1 < 100 > 100 < 1000 > 1000 25 • Developed a Palantir Plugin to search the Fusion Data Holding • Bulk Matching was a great success • In general, user reaction has been positive • Time to Triage was usually under an hour where cleansing could take weeks!!! Collect & Collate Ingestion Rate – The Improvement 26 Observations… The Bulk Matcher Performance and Reliability Interaction with Palantir Configuration over Customisation Search for the ‘Single Source of Truth’ Golden Record Acceptance of the Schema Last Approach Overwhelmed by Search Results 27 Further Reading and Contacts Strategic Thinking in Criminal Intelligence Jerry H Ratcliffe The Federation Press – 2009 ISBN 978 186287 734-4 Intelligence-Led Policing Jerry Ratcliffe Routledge – 2008 ISBN 978-1-843292-339-8 Data Matching Concepts and Techniques and Record Linkage, Entity Resolution, and Duplicate Detection Peter Christen Springer – 2012 ISBN 978-3-642-31163-5 Foundations of Semantic Web Technologies Pascal Hitzler, Markus Krötzsch, Sebastian Rudolph CRC Press – 2010 ISBN 978-1-4200-9050-5 Big Data – A revolution that will transform how we live, work, and think Viktor Mayer-Schönberger and Kenneth Cukier HMH – 2013 ISBN 978-0-544-00269-2 Sharma The Schema Last Approach to Data Fusion Neil Brittliff and Dharmendra Sharma The Schema Last Approach to Data Fusion AusDM 2014 A Triple Store Implementation to support Tabular Data Neil Brittliff and Dharmendra Sharma AusDM 2014 28 University of Canberra http://www.Canberra.edu.au Australian Institute of Criminology http://www.aic.gov.au