BIG DATA Conceptual Modeling to the Rescue David W. Embley with special thanks to Stephen W. Liddle and the Data Extraction Research Group at Brigham Young University Roadmap • What is BIG DATA? • Why should Conceptual Modeling apply? • Examples to show how Conceptual Modeling can “come to the rescue” • Summary (and take-home message): – Principles that guide the use of Conceptual Modeling in BIG DATA applications – Challenges and Research Opportunities ER 2013 Keynote 2 Roadmap • What is BIG DATA? • Why should Conceptual Modeling apply? • Examples to show how Conceptual Modeling can “come to the rescue” • Summary (and take-home message): – Principles that guide the use of Conceptual Modeling in BIG DATA applications – Challenges and Research Opportunities ER 2013 Keynote 3 BIG DATA • Volume: typically exceeding terabytes • Variety: heterogeneous sources; diverse needs • Velocity: phenomenal rate of acquisition • Veracity: trustworthiness & uncertainty ER 2013 Keynote 4 Volume: Kilobyte (103) A paragraph of text ER 2013 Keynote 5 Volume: Megabyte (106) A small novel ER 2013 Keynote 6 Volume: Gigabyte (109) Sound wave of Beethoven’s Fifth Symphony ER 2013 Keynote 7 Volume: Terabyte (1012) All the X-ray images in a large hospital ER 2013 Keynote 8 Volume: Petabyte (1015) 10 billion Facebook photos ER 2013 Keynote 9 Volume: Exabyte (1018) 1/5 of the words ever spoken ER 2013 Keynote 10 Volume: Zettabyte (1021) Grains of sand on all the world’s beaches ER 2013 Keynote 11 Volume: Yottabyte (1024) Atoms in 7,000 human bodies NSA data site – purportedly designed to store yottabytes of data. ER 2013 Keynote 12 Variety: Heterogeneous Sources & Diverse Needs Radiology Report (John Doe, July 19, 12:14 pm) ER 2013 Keynote 13 Velocity Astronomers expect to be processing 10 petabytes of data every hour from the SKA telescope. Square Kilometer Array Telescope ER 2013 Keynote 14 Velocity One minute on the Internet: 640TB data transferred, 100k tweets, 204 million e-mails sent ER 2013 Keynote 15 Veracity: Uncertainty • An age-old question: “What is truth?” • Einstein: “The pursuit of truth and beauty is a sphere of activity in which we are permitted to remain children all our lives.” • Of one thing we can be certain: ER 2013 Keynote 16 Roadmap • What is BIG DATA? • Why should Conceptual Modeling apply? • Examples to show how Conceptual Modeling can “come to the rescue” • Summary (and take-home message): – Principles that guide the use of Conceptual Modeling in BIG DATA applications – Challenges and Research Opportunities ER 2013 Keynote 17 Conceptual Modeling & BIG DATA • Main thrust: organizing data [Chen, TODS’76] • And, that’s one of the challenges of BIG DATA … but – Volume: too big – Variety: too much – Velocity: too fast – Veracity: too uncertainty ER 2013 Keynote 18 Looking Backward select PART-NO, QUANTITY-ON-HAND where … ER 2013 Keynote 19 Looking Forward • Conceptualization of the Web – Semantic search as well as keyword search – World-wide knowledge sharing • Examples: – DB-pedia – Conceptual Graphs • • • • Google’s Knowledge Graph Yahoo!’s Web of Objects Facebook’s Graph Search Microsoft’s/Bing’s Satori Knowledge Base – Metaweb – FamilySearch • Conceptual Modeling should apply! ER 2013 Keynote 20 SELECT ?name ?description_en ?description_de ?musician WHERE { ?musician <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:German_musicians> . ?musician foaf:name ?name . OPTIONAL { ?musician rdfs:comment ?description_en . FILTER (LANG(?description_en) = 'en') . } OPTIONAL { ?musician rdfs:comment ?description_de . ER 2013 Keynote 21 Google’s Knowledge Graph ER 2013 Keynote 22 Yahoo!’s Web of Objects Yahoo!’s image answer to: What is a food? ER 2013 Keynote 23 Facebook’s Graph Search ER 2013 Keynote 24 Satori Knowledge Base ER 2013 Keynote 25 Metaweb Boston ? ER 2013 Keynote 26 Metaweb Don’t forget to take Wendy to Boston’s birthday party at 2:00. ER 2013 Keynote 27 Metaweb Don’t forget to take Wendy to Boston’s birthday party at 2:00. ER 2013 Keynote 28 Roadmap • What is BIG DATA? • Why should Conceptual Modeling apply? • Examples to show how Conceptual Modeling can “come to the rescue” • Summary (and take-home message): – Principles that guide the use of Conceptual Modeling in BIG DATA applications – Challenges and Research Opportunities ER 2013 Keynote 29 A service provided by The Church of Jesus Christ of Latter-day Saints. © 2013 by Intellectual Reserve, Inc. All rights reserved. A free family history web site Visitors per day: 85,000+ Pages viewed per day: 5M+ ER 2013 Keynote 30 & the WoK-HD Project • FamilySearch – Volume: • 1.8PB+ online (1.2B records along with 900M 2MB jpeg images) • 42PB+ offline (1.2B 30–40MB tiff images) – Velocity: • 500M+ images in 2013 • 200K+ volunteer indexers • WoK-HD scanned-book project (within FamilySearch) – Volume: 100,000 books (3.5TB expected) – Velocity: 25,000 books / year ER 2013 Keynote 31 WoK-HD (A Web of Knowledge Superimposed over Historical Documents) … … … … ER 2013 Keynote 32 WoK-HD (A Web of Knowledge Superimposed over Historical Documents) grandchildren of Mary Ely … … … … ER 2013 Keynote 33 WoK-HD (A Web of Knowledge Superimposed over Historical Documents) grandchildren of Mary Ely … … … … ER 2013 Keynote 34 WoK-HD (A Web of Knowledge Superimposed over Historical Documents) grandchildren of Mary Ely … … … … ER 2013 Keynote 35 WoK-HD (A Web of Knowledge Superimposed over Historical Documents) grandchildren of Mary Ely … … … … ER 2013 Keynote 36 WoK-HD Construction • Mitigating Velocity, Variety, & Volume – CM-based information extraction • PatternReader (for semi-structured text) • OntoSoar (for unstructured text) – Automated information harvesting & organization • Assuring Veracity – CM-based query processing (with links and reasoning chains for extracted information) – Automated analysis with evidence-based CMs ER 2013 Keynote 37 PatternReader THE ELY ANCESTRY. 419 SEVENTH GENERATION. 241213. Mary Eliza Warner, b. 1826, dau. of Samuel Selden Warner and Azubah Tully; m. 1850, Joel M. Gloyd (who was connected with Chief Justice Waite's family), 24331 1. Abigail Huntington Lathrop (widow), Boonton, N. J., b. 1810, dau. of Mary Ely and Gerard Lathrop ; m. 1835, Donald McKenzie. West Indies, who was b. 1812, d. 1839. (The widow is unable to give the names of her husband's parents.) Their children 1. Mary Ely, b, 1836, d. 1859. 2. Gerard Lathrop, b. 1838. 243312. William Gerard Lathrop, Boonton, N. J., b. 1812, d. 1882, son of Mary Ely and Gerard Lathrop; m. 1837, Charlotte Brackett Jennings, New York City, who was b. 1818, dau. of Nathan Tilestone Jennings and Maria Miller. Their children: 1. Maria Jennings, b. 1838, d. 1840. 2. William Gerard, b. 1840. ) . 3. Donald McKenzie, b. 1840, d. 1843. ] 4. Anna Margaretta, b. 1843. 5. Anna Catherine, b. 1845. 243314. Charles Christopher Lathrop, N. Y. City, b. 1817, d. 1865, son of Mary Ely and Gerard Lathrop ; m. 1856, Mary Augusta Andruss, 992 Broad St., Newark, N. J., who was b. 1825, dau. of Judge Caleb Halstead Andruss and Emma Sutherland Goble. Mrs. Lathrop died at her home, 992 Broad St., Newark, N. J., Friday morning, Nov. 4, 1898. The funeral services were held at her residence on Monday, Nov. 7, 1898, at half-past two o'clock P. M. Their children: 1. Charles Halstead, b. 1857, d. 1861. 2. William Gerard, b. 1858, d. 1861. 3. Theodore Andruss, b. i860. 4. Emma Goble, b. 1862. Miss Emma Goble Lathrop, official historian of the New York Chapter of the Daughters of the American Revolution, is one of the youngest members to hold office, but one whose intelligence and capability qualify her for such distinction. Miss Lathrop is not without experience; in her present home and native city, Newark, N. J., she has filled the positions of secretary and treasurer to the Girls' Friendly Society for nine years, secretary and president of the Woman's Auxiliary of Trinity Church Parish, treasurer of the St. Catherine's Guild of St. Barnabas Hospital, and manager of several of Newark's charitable institutions which her grandparents were instrumental in founding. Miss Lathrop traces her lineage back through many generations of famous progenitors on both sides. Her maternal ancestors were among the early settlers of New Jersey, among them John Ogden, who received patent in 1664 for the purchase of Elizabethtown, and who in 1673 was ER 2013 Keynote 38 PatternReader THE ELY ANCESTRY. 419 SEVENTH GENERATION. 241213. Mary Eliza Warner, b. 1826, dau. of Samuel Selden Warner and Azubah Tully; m. 1850, Joel M. Gloyd (who was connected with Chief Justice Waite's family), 24331 1. Abigail Huntington Lathrop (widow), Boonton, N. J., b. 1810, dau. of Mary Ely and Gerard Lathrop ; m. 1835, Donald McKenzie. West Indies, who was b. 1812, d. 1839. (The widow is unable to give the names of her husband's parents.) Their children 1. Mary Ely, b, 1836, d. 1859. 2. Gerard Lathrop, b. 1838. 243312. William Gerard Lathrop, Boonton, N. J., b. 1812, d. 1882, son of Mary Ely and Gerard Lathrop; m. 1837, Charlotte Brackett Jennings, New York City, who was b. 1818, dau. of Nathan Tilestone Jennings and Maria Miller. Their children: 1. Maria Jennings, b. 1838, d. 1840. 2. William Gerard, b. 1840. ) . 3. Donald McKenzie, b. 1840, d. 1843. ] 4. Anna Margaretta, b. 1843. 5. Anna Catherine, b. 1845. 243314. Charles Christopher Lathrop, N. Y. City, b. 1817, d. 1865, son of Mary Ely and Gerard Lathrop ; m. 1856, Mary Augusta Andruss, 992 Broad St., Newark, N. J., who was b. 1825, dau. of Judge Caleb Halstead Andruss and Emma Sutherland Goble. Mrs. Lathrop died at her home, 992 Broad St., Newark, N. J., Friday morning, Nov. 4, 1898. The funeral services were held at her residence on Monday, Nov. 7, 1898, at half-past two o'clock P. M. Their children: 1. Charles Halstead, b. 1857, d. 1861. 2. William Gerard, b. 1858, d. 1861. 3. Theodore Andruss, b. i860. 4. Emma Goble, b. 1862. Miss Emma Goble Lathrop, official historian of the New York Chapter of the Daughters of the American Revolution, is one of the youngest members to hold office, but one whose intelligence and capability qualify her for such distinction. Miss Lathrop is not without experience; in her present home and native city, Newark, N. J., she has filled the positions of secretary and treasurer to the Girls' Friendly Society for nine years, secretary and president of the Woman's Auxiliary of Trinity Church Parish, treasurer of the St. Catherine's Guild of St. Barnabas Hospital, and manager of several of Newark's charitable institutions which her grandparents were instrumental in founding. Miss Lathrop traces her lineage back through many generations of famous progenitors on both sides. Her maternal ancestors were among the early settlers of New Jersey, among them John Ogden, who received patent in 1664 for the purchase of Elizabethtown, and who in 1673 was … 1. Mary Ely, b, 1836, d. 1859. 2. Gerard Lathrop, b. 1838. … 1. Maria Jennings, b. 1838, d. 1840. 2. William Gerard, b. 1840. ) . 3. Donald McKenzie, b. 1840, d. 1843. ] 4. Anna Margaretta, b. 1843. 5. Anna Catherine, b. 1845. … 1. Charles Halstead, b. 1857, d. 1861. 2. William Gerard, b. 1858, d. 1861. 3. Theodore Andruss, b. i860. 4. Emma Goble, b. 1862. ER 2013 Keynote 39 PatternReader … 1. Mary Ely, b, 1836, d. 1859. 2. Gerard Lathrop, b. 1838. … 1. Maria Jennings, b. 1838, d. 1840. 2. William Gerard, b. 1840. ) . 3. Donald McKenzie, b. 1840, d. 1843. ] 4. Anna Margaretta, b. 1843. 5. Anna Catherine, b. 1845. … 1. Charles Halstead, b. 1857, d. 1861. 2. William Gerard, b. 1858, d. 1861. 3. Theodore Andruss, b. i860. 4. Emma Goble, b. 1862. ER 2013 Keynote 40 PatternReader … 1. Mary Ely, b, 1836, d. 1859. 2. Gerard Lathrop, b. 1838. … 1. Maria Jennings, b. 1838, d. 1840. 2. William Gerard, b. 1840. ) . 3. Donald McKenzie, b. 1840, d. 1843. ] “ 4. Anna Margaretta, b. 1843. 5. Anna Catherine, b. 1845. … 1. Charles Halstead, b. 1857, d. 1861. 2. William Gerard, b. 1858, d. 1861. 3. Theodore Andruss, b. i860. 4. Emma Goble, b. 1862. } Twins” (lost in OCR) OCR Error ER 2013 Keynote 41 PatternReader Conflate symbols and induce grammar … #. Aaaa Aaaa, b, 18##, d. 18##. #. Aaaa Aaaa, b. 18##. … #. Aaaa Aaaa, b. 18##, d. 18##. #. Aaaa Aaaa, b. 18##. ) . #. Aaaa AaAa, b. 18##, d. 18##. ] #. Aaaa Aaaa, b. 18##. #. Aaaa Aaaa, b. 18##. … #. Aaaa Aaaa, b. 18##, d. 18##. #. Aaaa Aaaa, b. 18##, d. 18##. #. Aaaa Aaaa, b. i8##. #. Aaaa Aaaa, b. 18##. ^(\d)\.\s([A-Z][a-z]{3,7})\s([A-Z][a-z]{4,9}),\sb\.\s([i1]8\d\d)$ ^(\d)\.\s(([A-Z][a-z][A-Z][a-z]{5})|([A-Z][a-z]{3,7}))\s([A-Z][a-z]{4,9}),\sb[.,]\s(18\d\d)\sd.\s(18\d\d)\.$ ER 2013 Keynote 42 PatternReader … 1. Mary Ely, b, 1836, d. 1859. 2. Gerard Lathrop, b. 1838. … 1. Maria Jennings, b. 1838, d. 1840. 2. William Gerard, b. 1840. ) . 3. Donald McKenzie, b. 1840, d. 1843. ] 4. Anna Margaretta, b. 1843. 5. Anna Catherine, b. 1845. … 1. Charles Halstead, b. 1857, d. 1861. 2. William Gerard, b. 1858, d. 1861. 3. Theodore Andruss, b. i860. 4. Emma Goble, b. 1862. ER 2013 Keynote 43 PatternReader … 1. Mary Ely, b, 1836, d. 1859. 2. Gerard Lathrop, b. 1838. … 1. Maria Jennings, b. 1838, d. 1840. 2. William Gerard, b. 1840. ) . 3. Donald McKenzie, b. 1840, d. 1843. ] 4. Anna Margaretta, b. 1843. 5. Anna Catherine, b. 1845. … 1. Charles Halstead, b. 1857, d. 1861. 2. William Gerard, b. 1858, d. 1861. 3. Theodore Andruss, b. i860. 4. Emma Goble, b. 1862. ER 2013 Keynote 44 Conceptual Modeling—the Backbone ER 2013 Keynote 45 Conceptual Modeling—the Backbone (\d)\.\s([A-Z][a-z]{3,7})\s([A-Z][a-z]{4,9}),\sb\.\s([i1]8\d\d) ER 2013 Keynote 46 Conceptual Modeling—the Backbone (\d)\.\s([A-Z][a-z]{3,7})\s([A-Z][a-z]{4,9}),\sb\.\s([i1]8\d\d) ER 2013 Keynote 47 Conceptual Modeling—the Backbone ER 2013 Keynote 48 Extraction Ontologies Linguistically Grounded Conceptual Models ER 2013 Keynote 49 Lexical Object-Set Recognizers BirthDate external representation: \b[1][6-9]\d\d\b left context: b\.\s right context: [.,] … ER 2013 Keynote 50 Non-lexical Object-Set Recognizers Person object existence rule: {Name} … Name external representation: \b{FirstName}\s{LastName}\b … ER 2013 Keynote 51 Relationship-Set Recognizers Person-BirthDate external representation: ^\d{1,3}\.\s{Person},\sb\.\s{BirthDate}[.,] … ER 2013 Keynote 52 Ontology-Snippet Recognizers ChildRecord external representation: ^(\d{1,3})\.\s+([A-Z]\w+\s[A-Z]\w+) (,\sb\.\s([1][6-9]\d\d))?(,\sd\.\s([1][6-9]\d\d))?\. ER 2013 Keynote 53 HMM Recognizers 54 OntoSoar Recognizers ER 2013 Keynote 55 OntoSoar Recognizers +---------------------------------Xp------------------------------+ | +--------Ost--------+ +-----Js-----+ | +-Wd-+-Ss-+ +-----A-----+--Mp---+ +---DG--+ | | | | | | | | | | ^ Emma was.v official.a historian.n of the NYCDAR . Soar OntoES “of”(x1,x2) “NYCDAR”(x2) “Emma”(x1) “historian”(x1) “official”(x1) Name(“Emma”) Officer(“historian”) Organization(“NYCDAR”) Person–Name(y1,“Emma”) Person-Officer-Organization(y1,“official historian”,“NYCDAR”) ER 2013 Keynote 56 Beyond Extraction • Canonicalization • Reasoning – Extraction of implied assertions – Generation of implied assertions – Object identity resolution • Free-form query processing • Form-based advanced query processing All based on Conceptual Modeling ER 2013 Keynote 57 Canonicalization for Lexical Object Sets • • • • “Easter 1832” JulianDate(1832113) JulianDate(1832113) 22 Apr 1832 “Sam’l” and “Geo.” “Samuel” and “George” “Boonton, N.J.” “Boonton, NJ, USA” • Operations: – before(Date1, Date2): Boolean – probabilityMale(Name): 0.0..1.0 ER 2013 Keynote 58 Implied Assertions Author’s View Desired View Gender: Female Maria Jennings … daughter of … William Gerard Lathrop ER 2013 Keynote Name: GivenName: Maria Jennings Surname: Lathrop 59 Implied Assertions Maria Jennings Lathrop … child of … William Gerard Lathrop … son of … Mary Ely … Female Mary Ely … grandmother of … Maria Jennings Lathrop ER 2013 Keynote 60 Object Identity Resolution 0.032081 0.995030 0.032081 ER 2013 Keynote 61 Free-Form Query Processing Persons born in 1838 ER 2013 Keynote 62 Free-Form Query Processing Persons born in 1838 born Person(s)? ER 2013 Keynote 63 Free-Form Query Processing Persons born in 1838 born = 1838 Person(s)? Person Name BirthDate Person11 Gerard Lathrop McKenzie 1838 Person18 Maria Jennings Lathrop 1838 ER 2013 Keynote 64 Free-Form Query Processing Persons born in 1838 Person Name BirthDate Person11 Gerard Lathrop McKenzie 1838 Person18 Maria Jennings Lathrop 1838 “Gerard Lathrop McKenzie” because: Person(Person11) has GivenName (“Gerard Lathrop”) and Child(Person11) of Person(Person9) and Person(Person9) has Gender(“Male”) and Person(Person9) has Surname(“McKenzie”) ER 2013 Keynote 65 Form-Based Advanced Query Processing Cousins of Donald Lathrop who died before he was born or were born after he died. Cousin ER 2013 Keynote 66 Form-Based Advanced Query Processing Cousins of Donald Lathrop who died before he was born or were born after he died. … … 1. Mary Ely, b, 1836, d. 1859. 2. Gerard Lathrop, b. 1838. … 1. Maria Jennings, b. 1838, d. 1840. 2. William Gerard, b. 1840. ) . 3. Donald McKenzie, b. 1840, d. 1843. ] 4. Anna Margaretta, b. 1843. 5. Anna Catherine, b. 1845. … 1. Charles Halstead, b. 1857, d. 1861. 2. William Gerard, b. 1858, d. 1861. 3. Theodore Andruss, b. i860. 4. Emma Goble, b. 1862. ER 2013 Keynote 67 Veracity Mitigating Uncertainty with Conceptual Modeling • Knowledge – Populated conceptual model – Plato: “justified true belief” • FamilySearch – Conceptual model of reality – Constraint violation (discovery) – Assertion verification (evidence) • Conceptual modeling for veracity ER 2013 Keynote 68 Veracity: “Justified True Belief” Persons born in 1838 Person Name BirthDate Person11 Gerard Lathrop McKenzie 1838 Person18 Maria Jennings Lathrop 1838 “Gerard Lathrop McKenzie” because: Person(Person11) has GivenName (“Gerard Lathrop”) and Child(Person11) of Person(Person9) and Person(Person9) has Gender(“Male”) and Person(Person9) has Surname(“McKenzie”) ER 2013 Keynote 69 FamilySearch: Wiki-like Updates + Uncertain Information Cyclic Pedigree: Sources of error: 1. Incorrect person merges 2. Incorrect parent-child relationship assertions FamilySearch: Useful More Expressive Conceptual Model Specifications p = 0.79 2 Nov 1846 x 1 Nov 1845 p = 0.35 1:* 1:2.1:* Evidence-Based Conceptual Modeling (1) Model Reality, (2) Allow/Discover Discrepancies, (3) Add Evidence p = 0.79 2 Nov 1846 x 1 Nov 1845 p = 0.35 1:* 1:2.1:* Roadmap • What is BIG DATA? • Why should Conceptual Modeling apply? • Examples to show how Conceptual Modeling can “come to the rescue” • Summary (and take-home message): – Principles that guide the use of Conceptual Modeling in BIG DATA applications – Challenges and Research Opportunities ER 2013 Keynote 73 Principles that guide the use of Conceptual Modeling in BIG DATA • Harvest wrt a conceptual model – Extraction ontologies – And … • Organize wrt a conceptual model – Rich conceptualizations – And … • Analyze wrt a conceptual model – Evidence-based reasoning – And … ER 2013 Keynote 74 More Examples of Conceptual Modeling in BIG DATA Applications • • • • • • • • Knowledge Bundle Building for Research Studies (KBB) Multi-Lingual Query Processing (ML-OntoES) Table Understanding (TISP, Table Ontology) Dream! Automating Ontology Creation Think Big! (TANGO) Contribute! Automated Reading (OntoSoar) Homeland Security Twitter Suicide Study Human Genome Project ER 2013 Keynote 75 Knowledge Bundle Building (i.e., Construct and Populate CMs) Example: Bio-Medical Research • Objective: Study the association of: – TP53 polymorphism and – Lung cancer • Task: locate, gather, organize data from: – Single Nucleotide Polymorphism database – Medical journal articles – Medical-record database – Radiology images and reports ER 2013 Keynote 76 Form-Based Extraction Ontologies Gather SNP Information from the NCBI dbSNP Repository ER 2013 Keynote 77 Linguistically Grounded Conceptual Models Search PubMed Literature ER 2013 Keynote 78 Reverse-Engineer Human Subject Information from INDIVO into a Conceptual Model ER 2013 Keynote 79 Add Annotated Images into the Conceptual Knowledge Bundle Radiology Report (John Doe, July 19, 12:14 pm) ER 2013 Keynote 80 Query and Analyze Data in Knowledge the Bundle ER 2013 Keynote 81 Honda moins de 8000 en «excellent état» Q 한국어 marque prix Honda 7826€ 8000€ mots de clé Honda (2) Multi-Lingual Query Processing 가격 주행거 리 색상 자동차 제조 사 특징 + 연식 모델등 급 차종 엔진 모델 등급 액세서리 변속기 français ER 2013 Keynote 82 Honda moins de 8000 en «excellent état» Q 한국어 marque prix Honda 7826€ 8000€ mots de clé Honda (2) 가격 주행거 리 색상 자동차 제조 사 특징 + 연식 모델등 급 차종 엔진 모델 등급 액세서리 변속기 français ER 2013 Keynote 83 Table Understanding • Tables on the web – 14.1 billion HTML tables [Cafarella et al. 08] • Most are tables for layout • 154 million high-quality relational tables – 50 million spreadsheet tables [Adelfio & Samet 13] • Web table complexity (sampling statistics) [ibid] – Simple relational table: 25% (spreadsheet) 68% (HTML) – Multiple header rows: 15% (spreadsheet) 7% (HTML) – More complex: 60% (spreadsheet) 25% (HTML) ER 2013 Keynote 84 Table Understanding A 1 2 3 4 5 6 7 8 9 10 11 12 12 12 ER 2013 Keynote Schools Pupils B 2003/04 2004/05 2005/06 2006/07 2007/08 2008/09 2009/101 2003/04 2004/05 2005/06 2006/07 2007/08 2008/09 2009/101 C D E Less than 100 100-299 p 300 pupils or more 36.2 39 24.8 35.2 39 25.8 35.2 39 25.8 34.3 40 25.7 34 39.6 26.4 33.3 40 26.7 32 40.7 27.3 8.7 39.3 52 8.7 38.3 53 8.8 38.3 52.9 8.4 39 52.6 8.3 38.2 53.5 8.1 38.2 53.7 7.7 38.2 54.1 85 Table Understanding A 1 2 3 4 5 6 7 8 9 10 11 12 12 12 ER 2013 Keynote Schools Pupils B 2003/04 2004/05 2005/06 2006/07 2007/08 2008/09 2009/101 2003/04 2004/05 2005/06 2006/07 2007/08 2008/09 2009/101 C D E Less than 100 100-299 p 300 pupils or more 36.2 39 24.8 35.2 39 25.8 35.2 39 25.8 34.3 40 25.7 34 39.6 26.4 33.3 40 26.7 32 40.7 27.3 8.7 39.3 52 8.7 38.3 53 8.8 38.3 52.9 8.4 39 52.6 8.3 38.2 53.5 8.1 38.2 53.7 7.7 38.2 54.1 86 Table Understanding A 1 2 3 4 5 6 7 8 9 10 11 12 12 12 ER 2013 Keynote Schools Pupils B 2003/04 2004/05 2005/06 2006/07 2007/08 2008/09 2009/101 2003/04 2004/05 2005/06 2006/07 2007/08 2008/09 2009/101 C D E Less than 100 100-299 p 300 pupils or more 36.2 39 24.8 35.2 39 25.8 35.2 39 25.8 34.3 40 25.7 34 39.6 26.4 33.3 40 26.7 32 40.7 27.3 8.7 39.3 52 8.7 38.3 53 8.8 38.3 52.9 8.4 39 52.6 8.3 38.2 53.5 8.1 38.2 53.7 7.7 38.2 54.1 87 Table Understanding A 1 2 3 4 5 6 7 8 9 10 11 12 12 12 ER 2013 Keynote Schools Pupils B 2003/04 2004/05 2005/06 2006/07 2007/08 2008/09 2009/101 2003/04 2004/05 2005/06 2006/07 2007/08 2008/09 2009/101 C D E Less than 100 100-299 p 300 pupils or more 36.2 39 24.8 35.2 39 25.8 35.2 39 25.8 34.3 40 25.7 34 39.6 26.4 33.3 40 26.7 32 40.7 27.3 8.7 39.3 52 8.7 38.3 53 8.8 38.3 52.9 8.4 39 52.6 8.3 38.2 53.5 8.1 38.2 53.7 7.7 38.2 54.1 88 Table Understanding A 1 2 3 4 5 6 7 8 9 10 11 12 12 12 ER 2013 Keynote Schools Pupils B 2003/04 2004/05 2005/06 2006/07 2007/08 2008/09 2009/101 2003/04 2004/05 2005/06 2006/07 2007/08 2008/09 2009/101 C D E Less than 100 100-299 p 300 pupils or more 36.2 39 24.8 35.2 39 25.8 35.2 39 25.8 34.3 40 25.7 34 39.6 26.4 33.3 40 26.7 32 40.7 27.3 8.7 39.3 52 8.7 38.3 53 8.8 38.3 52.9 8.4 39 52.6 8.3 38.2 53.5 8.1 38.2 53.7 7.7 38.2 54.1 89 Table Understanding A 1 2 3 4 5 6 7 8 9 10 11 12 12 12 ER 2013 Keynote Schools Pupils B 2003/04 2004/05 2005/06 2006/07 2007/08 2008/09 2009/101 2003/04 2004/05 2005/06 2006/07 2007/08 2008/09 2009/101 C D E Less than 100 100-299 p 300 pupils or more 36.2 39 24.8 35.2 39 25.8 35.2 39 25.8 34.3 40 25.7 34 39.6 26.4 33.3 40 26.7 32 40.7 27.3 8.7 39.3 52 8.7 38.3 53 8.8 38.3 52.9 8.4 39 52.6 8.3 38.2 53.5 8.1 38.2 53.7 7.7 38.2 54.1 90 Automating Ontology Creation with TANGO Agglomeration Population Continent Country Tokyo 31,139,900 Asia Japan New York-Philadelphia 30,286,900 The Americas United States of America Mexico 21,233,900 The Americas Mexico Seoul 19,969,100 Asia Korea (South) Sao Paulo 18,847,400 The Americas Brazil Jakarta 17,891,000 Asia Indonesia Osaka-Kobe-Kyoto 17,621,500 Asia Japan … … … Agglomeration Population Country Continent … Niigata 503,500 Asia Japan Raurkela 503,300 Asia India Homjel 502,200 Europe Belarus Zunyi 501,900 Asia China Santiago 501,800 The Americas Dominican Republic Pingdingshan 501,500 Asia China Fargona 501,000 Asia Uzbekistan Kirov 500,200 Europe Russia Newcastle 500,000 Australia /Oceania Australia Automating Ontology Creation with TANGO Longitude Latitude Latitude and longitude designates location Name Geopolitical Entity Population Country Continent Merge Location names Agglomeration has Results Has GMT Time Country Population City Longitude Latitude Latitude and longitude designates location Name Continent Geopolitical Entity Country Location Agglomeration City Automated Reading: NELL http://rtw.ml.cmu.edu ER 2013 Keynote 93 Automated Reading: OntoSoar • Populate conceptual model from text – Directly – By inference • Augment conceptual model and populate ER 2013 Keynote 94 Homeland Security: Terrorist Example ER 2013 Keynote 95 Homeland Security: Terrorist Example ER 2013 Keynote 96 Homeland Security: Terrorist Example Abu Aziz ? White House White House ER 2013 Keynote 97 Homeland Security: Terrorist Example Abu Aziz ? White House White House ER 2013 Keynote 98 Homeland Security: Terrorist Example Abu Aziz What If! ? White House White House ER 2013 Keynote 99 Twitter Suicide Study Oct. 10 2013 Tweets could warn of a suicide risk, BYU study says … Over three months, the computer scientists screened millions of tweets and identified 37,717 that were "genuinely troubling" from 28,088 unique users … ER 2013 Keynote 100 Conceptual Modeling for Studying the Human Genome ER 2013 Keynote 101 Conceptual Modeling for Studying the Human Genome ER 2013 Keynote 102 Roadmap • What is BIG DATA? • Why should Conceptual Modeling apply? • Examples to show how Conceptual Modeling can “come to the rescue” • Summary (and take-home message): – Principles that guide the use of Conceptual Modeling in BIG DATA applications – Challenges and Research Opportunities ER 2013 Keynote 103 Principles that guide the use of Conceptual Modeling in BIG DATA • Harvest wrt a conceptual model – Extraction ontologies – And: table understanding, automated reading, … • Organize wrt a conceptual model – Rich conceptualizations – And: KBs for research studies, multilingual web, … • Analyze wrt a conceptual model – Evidence-based reasoning – And: “what-if”, warning signs search, DNA, … ER 2013 Keynote 104 Summary & Challenge • Conceptual Modeling Applies to BIG DATA (perhaps more than you might have thought) • Challenge: find ways to use conceptual modeling to “rescue”—resolve BIG DATA issues www.deg.byu.edu ER 2013 Keynote 105 BYU Data Extraction Research Group Summary & Challenge • Conceptual Modeling Applies to BIG DATA (perhaps more than you might have thought) • Challenge: find ways to use conceptual modeling to “rescue”—resolve BIG DATA issues www.deg.byu.edu ER 2013 Keynote 106 BYU Data Extraction Research Group