BIG DATA * Conceptual Modeling to the Rescue

BIG DATA
Conceptual Modeling to the Rescue
David W. Embley
with special thanks to Stephen W. Liddle
and the Data Extraction Research Group at
Brigham Young University
Roadmap
• What is BIG DATA?
• Why should Conceptual Modeling apply?
• Examples to show how Conceptual Modeling
can “come to the rescue”
• Summary (and take-home message):
– Principles that guide the use of Conceptual
Modeling in BIG DATA applications
– Challenges and Research Opportunities
ER 2013 Keynote
2
Roadmap
• What is BIG DATA?
• Why should Conceptual Modeling apply?
• Examples to show how Conceptual Modeling
can “come to the rescue”
• Summary (and take-home message):
– Principles that guide the use of Conceptual
Modeling in BIG DATA applications
– Challenges and Research Opportunities
ER 2013 Keynote
3
BIG DATA
• Volume: typically exceeding terabytes
• Variety: heterogeneous sources; diverse needs
• Velocity: phenomenal rate of acquisition
• Veracity: trustworthiness & uncertainty
ER 2013 Keynote
4
Volume: Kilobyte (103)
A paragraph of text
ER 2013 Keynote
5
Volume: Megabyte (106)
A small novel
ER 2013 Keynote
6
Volume: Gigabyte (109)
Sound wave of Beethoven’s Fifth Symphony
ER 2013 Keynote
7
Volume: Terabyte (1012)
All the X-ray images in a large hospital
ER 2013 Keynote
8
Volume: Petabyte (1015)
10 billion Facebook photos
ER 2013 Keynote
9
Volume: Exabyte (1018)
1/5 of the words ever spoken
ER 2013 Keynote
10
Volume: Zettabyte (1021)
Grains of sand on all the world’s beaches
ER 2013 Keynote
11
Volume: Yottabyte (1024)
Atoms in 7,000 human bodies
NSA data site – purportedly
designed to store yottabytes of data.
ER 2013 Keynote
12
Variety: Heterogeneous Sources
& Diverse Needs
Radiology
Report
(John Doe, July
19, 12:14 pm)
ER 2013 Keynote
13
Velocity
Astronomers expect to be processing 10 petabytes
of data every hour from the SKA telescope.
Square Kilometer Array Telescope
ER 2013 Keynote
14
Velocity
One minute on the Internet:
640TB data transferred,
100k tweets,
204 million e-mails sent
ER 2013 Keynote
15
Veracity: Uncertainty
• An age-old question:
“What is truth?”
• Einstein: “The pursuit of truth and
beauty is a sphere of activity in which
we are permitted to remain children all
our lives.”
• Of one thing we
can be certain:
ER 2013 Keynote
16
Roadmap
• What is BIG DATA?
• Why should Conceptual Modeling apply?
• Examples to show how Conceptual Modeling
can “come to the rescue”
• Summary (and take-home message):
– Principles that guide the use of Conceptual
Modeling in BIG DATA applications
– Challenges and Research Opportunities
ER 2013 Keynote
17
Conceptual Modeling & BIG DATA
• Main thrust: organizing data [Chen, TODS’76]
• And, that’s one of the challenges of BIG DATA …
but
– Volume: too big
– Variety: too much
– Velocity: too fast
– Veracity: too uncertainty
ER 2013 Keynote
18
Looking Backward
select PART-NO, QUANTITY-ON-HAND
where …
ER 2013 Keynote
19
Looking Forward
• Conceptualization of the Web
– Semantic search as well as keyword search
– World-wide knowledge sharing
• Examples:
– DB-pedia
– Conceptual Graphs
•
•
•
•
Google’s Knowledge Graph
Yahoo!’s Web of Objects
Facebook’s Graph Search
Microsoft’s/Bing’s Satori Knowledge Base
– Metaweb
– FamilySearch
• Conceptual Modeling should apply!
ER 2013 Keynote
20
SELECT ?name ?description_en ?description_de ?musician WHERE {
?musician <http://purl.org/dc/terms/subject>
<http://dbpedia.org/resource/Category:German_musicians> .
?musician foaf:name ?name .
OPTIONAL {
?musician rdfs:comment ?description_en .
FILTER (LANG(?description_en) = 'en') .
}
OPTIONAL {
?musician rdfs:comment ?description_de .
ER 2013 Keynote
21
Google’s Knowledge Graph
ER 2013 Keynote
22
Yahoo!’s Web of Objects
Yahoo!’s image answer to: What is a food?
ER 2013 Keynote
23
Facebook’s Graph Search
ER 2013 Keynote
24
Satori Knowledge Base
ER 2013 Keynote
25
Metaweb
Boston ?
ER 2013 Keynote
26
Metaweb
Don’t forget to take Wendy to
Boston’s birthday party at 2:00.
ER 2013 Keynote
27
Metaweb
Don’t forget to take Wendy to
Boston’s birthday party at 2:00.
ER 2013 Keynote
28
Roadmap
• What is BIG DATA?
• Why should Conceptual Modeling apply?
• Examples to show how Conceptual Modeling
can “come to the rescue”
• Summary (and take-home message):
– Principles that guide the use of Conceptual
Modeling in BIG DATA applications
– Challenges and Research Opportunities
ER 2013 Keynote
29
A service provided by The Church of Jesus
Christ of Latter-day Saints. © 2013 by
Intellectual Reserve, Inc. All rights reserved.
A free family history web site
Visitors per day: 85,000+
Pages viewed per day: 5M+
ER 2013 Keynote
30
& the WoK-HD Project
• FamilySearch
– Volume:
• 1.8PB+ online (1.2B records along with 900M 2MB jpeg images)
• 42PB+ offline (1.2B 30–40MB tiff images)
– Velocity:
• 500M+ images in 2013
• 200K+ volunteer indexers
• WoK-HD scanned-book project (within FamilySearch)
– Volume: 100,000 books (3.5TB expected)
– Velocity: 25,000 books / year
ER 2013 Keynote
31
WoK-HD
(A Web of Knowledge Superimposed over Historical Documents)
…
…
…
…
ER 2013 Keynote
32
WoK-HD
(A Web of Knowledge Superimposed over Historical Documents)
grandchildren of Mary Ely
…
…
…
…
ER 2013 Keynote
33
WoK-HD
(A Web of Knowledge Superimposed over Historical Documents)
grandchildren of Mary Ely
…
…
…
…
ER 2013 Keynote
34
WoK-HD
(A Web of Knowledge Superimposed over Historical Documents)
grandchildren of Mary Ely
…
…
…
…
ER 2013 Keynote
35
WoK-HD
(A Web of Knowledge Superimposed over Historical Documents)
grandchildren of Mary Ely
…
…
…
…
ER 2013 Keynote
36
WoK-HD Construction
• Mitigating Velocity, Variety, & Volume
– CM-based information extraction
• PatternReader (for semi-structured text)
• OntoSoar (for unstructured text)
– Automated information harvesting & organization
• Assuring Veracity
– CM-based query processing (with links and
reasoning chains for extracted information)
– Automated analysis with evidence-based CMs
ER 2013 Keynote
37
PatternReader
THE ELY ANCESTRY. 419
SEVENTH GENERATION.
241213. Mary Eliza Warner, b. 1826, dau. of Samuel Selden Warner
and Azubah Tully; m. 1850, Joel M. Gloyd (who was connected with
Chief Justice Waite's family),
24331 1. Abigail Huntington Lathrop (widow), Boonton, N. J., b.
1810, dau. of Mary Ely and Gerard Lathrop ; m. 1835, Donald McKenzie.
West Indies, who was b. 1812, d. 1839.
(The widow is unable to give the names of her husband's parents.)
Their children
1. Mary Ely, b, 1836, d. 1859.
2. Gerard Lathrop, b. 1838.
243312. William Gerard Lathrop, Boonton, N. J., b. 1812, d. 1882,
son of Mary Ely and Gerard Lathrop; m. 1837, Charlotte Brackett
Jennings, New York City, who was b. 1818, dau. of Nathan Tilestone
Jennings and Maria Miller. Their children:
1. Maria Jennings, b. 1838, d. 1840.
2. William Gerard, b. 1840. ) .
3. Donald McKenzie, b. 1840, d. 1843. ]
4. Anna Margaretta, b. 1843.
5. Anna Catherine, b. 1845.
243314. Charles Christopher Lathrop, N. Y. City, b. 1817, d. 1865,
son of Mary Ely and Gerard Lathrop ; m. 1856, Mary Augusta Andruss,
992 Broad St., Newark, N. J., who was b. 1825, dau. of Judge Caleb
Halstead Andruss and Emma Sutherland Goble. Mrs. Lathrop died
at her home, 992 Broad St., Newark, N. J., Friday morning, Nov. 4,
1898. The funeral services were held at her residence on Monday, Nov.
7, 1898, at half-past two o'clock P. M. Their children:
1. Charles Halstead, b. 1857, d. 1861.
2. William Gerard, b. 1858, d. 1861.
3. Theodore Andruss, b. i860.
4. Emma Goble, b. 1862.
Miss Emma Goble Lathrop, official historian of the New York Chapter of the
Daughters of the American Revolution, is one of the youngest members to hold
office, but one whose intelligence and capability qualify her for such distinction.
Miss Lathrop is not without experience; in her present home and native city, Newark,
N. J., she has filled the positions of secretary and treasurer to the Girls'
Friendly Society for nine years, secretary and president of the Woman's Auxiliary
of Trinity Church Parish, treasurer of the St. Catherine's Guild of St. Barnabas
Hospital, and manager of several of Newark's charitable institutions which her
grandparents were instrumental in founding. Miss Lathrop traces her lineage
back through many generations of famous progenitors on both sides. Her maternal
ancestors were among the early settlers of New Jersey, among them John Ogden,
who received patent in 1664 for the purchase of Elizabethtown, and who in 1673 was
ER 2013 Keynote
38
PatternReader
THE ELY ANCESTRY. 419
SEVENTH GENERATION.
241213. Mary Eliza Warner, b. 1826, dau. of Samuel Selden Warner
and Azubah Tully; m. 1850, Joel M. Gloyd (who was connected with
Chief Justice Waite's family),
24331 1. Abigail Huntington Lathrop (widow), Boonton, N. J., b.
1810, dau. of Mary Ely and Gerard Lathrop ; m. 1835, Donald McKenzie.
West Indies, who was b. 1812, d. 1839.
(The widow is unable to give the names of her husband's parents.)
Their children
1. Mary Ely, b, 1836, d. 1859.
2. Gerard Lathrop, b. 1838.
243312. William Gerard Lathrop, Boonton, N. J., b. 1812, d. 1882,
son of Mary Ely and Gerard Lathrop; m. 1837, Charlotte Brackett
Jennings, New York City, who was b. 1818, dau. of Nathan Tilestone
Jennings and Maria Miller. Their children:
1. Maria Jennings, b. 1838, d. 1840.
2. William Gerard, b. 1840. ) .
3. Donald McKenzie, b. 1840, d. 1843. ]
4. Anna Margaretta, b. 1843.
5. Anna Catherine, b. 1845.
243314. Charles Christopher Lathrop, N. Y. City, b. 1817, d. 1865,
son of Mary Ely and Gerard Lathrop ; m. 1856, Mary Augusta Andruss,
992 Broad St., Newark, N. J., who was b. 1825, dau. of Judge Caleb
Halstead Andruss and Emma Sutherland Goble. Mrs. Lathrop died
at her home, 992 Broad St., Newark, N. J., Friday morning, Nov. 4,
1898. The funeral services were held at her residence on Monday, Nov.
7, 1898, at half-past two o'clock P. M. Their children:
1. Charles Halstead, b. 1857, d. 1861.
2. William Gerard, b. 1858, d. 1861.
3. Theodore Andruss, b. i860.
4. Emma Goble, b. 1862.
Miss Emma Goble Lathrop, official historian of the New York Chapter of the
Daughters of the American Revolution, is one of the youngest members to hold
office, but one whose intelligence and capability qualify her for such distinction.
Miss Lathrop is not without experience; in her present home and native city, Newark,
N. J., she has filled the positions of secretary and treasurer to the Girls'
Friendly Society for nine years, secretary and president of the Woman's Auxiliary
of Trinity Church Parish, treasurer of the St. Catherine's Guild of St. Barnabas
Hospital, and manager of several of Newark's charitable institutions which her
grandparents were instrumental in founding. Miss Lathrop traces her lineage
back through many generations of famous progenitors on both sides. Her maternal
ancestors were among the early settlers of New Jersey, among them John Ogden,
who received patent in 1664 for the purchase of Elizabethtown, and who in 1673 was
…
1. Mary Ely, b, 1836, d. 1859.
2. Gerard Lathrop, b. 1838.
…
1. Maria Jennings, b. 1838, d. 1840.
2. William Gerard, b. 1840. ) .
3. Donald McKenzie, b. 1840, d. 1843. ]
4. Anna Margaretta, b. 1843.
5. Anna Catherine, b. 1845.
…
1. Charles Halstead, b. 1857, d. 1861.
2. William Gerard, b. 1858, d. 1861.
3. Theodore Andruss, b. i860.
4. Emma Goble, b. 1862.
ER 2013 Keynote
39
PatternReader
…
1. Mary Ely, b, 1836, d. 1859.
2. Gerard Lathrop, b. 1838.
…
1. Maria Jennings, b. 1838, d. 1840.
2. William Gerard, b. 1840. ) .
3. Donald McKenzie, b. 1840, d. 1843. ]
4. Anna Margaretta, b. 1843.
5. Anna Catherine, b. 1845.
…
1. Charles Halstead, b. 1857, d. 1861.
2. William Gerard, b. 1858, d. 1861.
3. Theodore Andruss, b. i860.
4. Emma Goble, b. 1862.
ER 2013 Keynote
40
PatternReader
…
1. Mary Ely, b, 1836, d. 1859.
2. Gerard Lathrop, b. 1838.
…
1. Maria Jennings, b. 1838, d. 1840.
2. William Gerard, b. 1840. ) .
3. Donald McKenzie, b. 1840, d. 1843. ] “
4. Anna Margaretta, b. 1843.
5. Anna Catherine, b. 1845.
…
1. Charles Halstead, b. 1857, d. 1861.
2. William Gerard, b. 1858, d. 1861.
3. Theodore Andruss, b. i860.
4. Emma Goble, b. 1862.
}
Twins” (lost in OCR)
OCR Error
ER 2013 Keynote
41
PatternReader
Conflate symbols and induce grammar
…
#. Aaaa Aaaa, b, 18##, d. 18##.
#. Aaaa Aaaa, b. 18##.
…
#. Aaaa Aaaa, b. 18##, d. 18##.
#. Aaaa Aaaa, b. 18##. ) .
#. Aaaa AaAa, b. 18##, d. 18##. ]
#. Aaaa Aaaa, b. 18##.
#. Aaaa Aaaa, b. 18##.
…
#. Aaaa Aaaa, b. 18##, d. 18##.
#. Aaaa Aaaa, b. 18##, d. 18##.
#. Aaaa Aaaa, b. i8##.
#. Aaaa Aaaa, b. 18##.
^(\d)\.\s([A-Z][a-z]{3,7})\s([A-Z][a-z]{4,9}),\sb\.\s([i1]8\d\d)$
^(\d)\.\s(([A-Z][a-z][A-Z][a-z]{5})|([A-Z][a-z]{3,7}))\s([A-Z][a-z]{4,9}),\sb[.,]\s(18\d\d)\sd.\s(18\d\d)\.$
ER 2013 Keynote
42
PatternReader
…
1. Mary Ely, b, 1836, d. 1859.
2. Gerard Lathrop, b. 1838.
…
1. Maria Jennings, b. 1838, d. 1840.
2. William Gerard, b. 1840. ) .
3. Donald McKenzie, b. 1840, d. 1843. ]
4. Anna Margaretta, b. 1843.
5. Anna Catherine, b. 1845.
…
1. Charles Halstead, b. 1857, d. 1861.
2. William Gerard, b. 1858, d. 1861.
3. Theodore Andruss, b. i860.
4. Emma Goble, b. 1862.
ER 2013 Keynote
43
PatternReader
…
1. Mary Ely, b, 1836, d. 1859.
2. Gerard Lathrop, b. 1838.
…
1. Maria Jennings, b. 1838, d. 1840.
2. William Gerard, b. 1840. ) .
3. Donald McKenzie, b. 1840, d. 1843. ]
4. Anna Margaretta, b. 1843.
5. Anna Catherine, b. 1845.
…
1. Charles Halstead, b. 1857, d. 1861.
2. William Gerard, b. 1858, d. 1861.
3. Theodore Andruss, b. i860.
4. Emma Goble, b. 1862.
ER 2013 Keynote
44
Conceptual Modeling—the Backbone
ER 2013 Keynote
45
Conceptual Modeling—the Backbone
(\d)\.\s([A-Z][a-z]{3,7})\s([A-Z][a-z]{4,9}),\sb\.\s([i1]8\d\d)
ER 2013 Keynote
46
Conceptual Modeling—the Backbone
(\d)\.\s([A-Z][a-z]{3,7})\s([A-Z][a-z]{4,9}),\sb\.\s([i1]8\d\d)
ER 2013 Keynote
47
Conceptual Modeling—the Backbone
ER 2013 Keynote
48
Extraction Ontologies
Linguistically Grounded Conceptual Models
ER 2013 Keynote
49
Lexical Object-Set Recognizers
BirthDate
external representation: \b[1][6-9]\d\d\b
left context: b\.\s
right context: [.,]
…
ER 2013 Keynote
50
Non-lexical Object-Set Recognizers
Person
object existence rule: {Name}
…
Name
external representation: \b{FirstName}\s{LastName}\b
…
ER 2013 Keynote
51
Relationship-Set Recognizers
Person-BirthDate
external representation: ^\d{1,3}\.\s{Person},\sb\.\s{BirthDate}[.,]
…
ER 2013 Keynote
52
Ontology-Snippet Recognizers
ChildRecord
external representation: ^(\d{1,3})\.\s+([A-Z]\w+\s[A-Z]\w+)
(,\sb\.\s([1][6-9]\d\d))?(,\sd\.\s([1][6-9]\d\d))?\.
ER 2013 Keynote
53
HMM Recognizers
54
OntoSoar Recognizers
ER 2013 Keynote
55
OntoSoar Recognizers
+---------------------------------Xp------------------------------+
|
+--------Ost--------+
+-----Js-----+ |
+-Wd-+-Ss-+
+-----A-----+--Mp---+ +---DG--+ |
|
|
|
|
|
| |
| |
^ Emma was.v official.a historian.n of the NYCDAR .
Soar
OntoES
“of”(x1,x2)
“NYCDAR”(x2)
“Emma”(x1)
“historian”(x1)
“official”(x1)
Name(“Emma”)
Officer(“historian”)
Organization(“NYCDAR”)
Person–Name(y1,“Emma”)
Person-Officer-Organization(y1,“official historian”,“NYCDAR”)
ER 2013 Keynote
56
Beyond Extraction
• Canonicalization
• Reasoning
– Extraction of implied assertions
– Generation of implied assertions
– Object identity resolution
• Free-form query processing
• Form-based advanced query processing
All based on Conceptual Modeling
ER 2013 Keynote
57
Canonicalization for Lexical Object Sets
•
•
•
•
“Easter 1832”  JulianDate(1832113)
JulianDate(1832113)  22 Apr 1832
“Sam’l” and “Geo.”  “Samuel” and “George”
“Boonton, N.J.”  “Boonton, NJ, USA”
• Operations:
– before(Date1, Date2): Boolean
– probabilityMale(Name): 0.0..1.0
ER 2013 Keynote
58
Implied Assertions
Author’s View
Desired View
Gender: Female
Maria Jennings … daughter of …
William Gerard Lathrop
ER 2013 Keynote
Name:
GivenName: Maria
Jennings
Surname:
Lathrop
59
Implied Assertions
Maria Jennings Lathrop …
child of …
William Gerard Lathrop …
son of …
Mary Ely … Female

Mary Ely … grandmother of
… Maria Jennings Lathrop
ER 2013 Keynote
60
Object Identity Resolution
0.032081
0.995030
0.032081
ER 2013 Keynote
61
Free-Form Query Processing
Persons born in 1838
ER 2013 Keynote
62
Free-Form Query Processing
Persons born in 1838
born
Person(s)?
ER 2013 Keynote
63
Free-Form Query Processing
Persons born in 1838
born
= 1838
Person(s)?
Person Name
BirthDate
Person11 Gerard Lathrop McKenzie 1838
Person18 Maria Jennings Lathrop 1838
ER 2013 Keynote
64
Free-Form Query Processing
Persons born in 1838
Person Name
BirthDate
Person11 Gerard Lathrop McKenzie 1838
Person18 Maria Jennings Lathrop 1838
“Gerard Lathrop McKenzie” because:
Person(Person11) has GivenName (“Gerard Lathrop”)
and Child(Person11) of Person(Person9)
and Person(Person9) has Gender(“Male”)
and Person(Person9) has Surname(“McKenzie”)
ER 2013 Keynote
65
Form-Based Advanced Query Processing
Cousins of Donald Lathrop who died before he was born or were born after he died.
Cousin
ER 2013 Keynote
66
Form-Based Advanced Query Processing
Cousins of Donald Lathrop who died before he was born or were born after he died.
…
…
1. Mary Ely, b, 1836, d. 1859.
2. Gerard Lathrop, b. 1838.
…
1. Maria Jennings, b. 1838, d. 1840.
2. William Gerard, b. 1840. ) .
3. Donald McKenzie, b. 1840, d. 1843. ]
4. Anna Margaretta, b. 1843.
5. Anna Catherine, b. 1845.
…
1. Charles Halstead, b. 1857, d. 1861.
2. William Gerard, b. 1858, d. 1861.
3. Theodore Andruss, b. i860.
4. Emma Goble, b. 1862.
ER 2013 Keynote
67
Veracity
Mitigating Uncertainty with Conceptual Modeling
• Knowledge
– Populated conceptual model
– Plato: “justified true belief”
• FamilySearch
– Conceptual model of reality
– Constraint violation (discovery)
– Assertion verification (evidence)
• Conceptual modeling for veracity
ER 2013 Keynote
68
Veracity: “Justified True Belief”
Persons born in 1838
Person Name
BirthDate
Person11 Gerard Lathrop McKenzie 1838
Person18 Maria Jennings Lathrop 1838
“Gerard Lathrop McKenzie” because:
Person(Person11) has GivenName (“Gerard Lathrop”)
and Child(Person11) of Person(Person9)
and Person(Person9) has Gender(“Male”)
and Person(Person9) has Surname(“McKenzie”)
ER 2013 Keynote
69
FamilySearch:
Wiki-like Updates + Uncertain Information
Cyclic Pedigree:
Sources of error:
1. Incorrect person merges
2. Incorrect parent-child
relationship assertions
FamilySearch: Useful More Expressive
Conceptual Model Specifications
p = 0.79
2 Nov 1846
x
1 Nov 1845
p = 0.35
1:*
1:2.1:*
Evidence-Based Conceptual Modeling
(1) Model Reality, (2) Allow/Discover Discrepancies, (3) Add Evidence
p = 0.79
2 Nov 1846
x
1 Nov 1845
p = 0.35
1:*
1:2.1:*
Roadmap
• What is BIG DATA?
• Why should Conceptual Modeling apply?
• Examples to show how Conceptual Modeling
can “come to the rescue”
• Summary (and take-home message):
– Principles that guide the use of Conceptual
Modeling in BIG DATA applications
– Challenges and Research Opportunities
ER 2013 Keynote
73
Principles that guide the use of
Conceptual Modeling in BIG DATA
• Harvest wrt a conceptual model
– Extraction ontologies
– And …
• Organize wrt a conceptual model
– Rich conceptualizations
– And …
• Analyze wrt a conceptual model
– Evidence-based reasoning
– And …
ER 2013 Keynote
74
More Examples of Conceptual Modeling
in BIG DATA Applications
•
•
•
•
•
•
•
•
Knowledge Bundle Building for Research Studies (KBB)
Multi-Lingual Query Processing (ML-OntoES)
Table Understanding (TISP, Table Ontology)
Dream!
Automating Ontology
Creation
Think
Big! (TANGO)
Contribute!
Automated Reading (OntoSoar)
Homeland Security
Twitter Suicide Study
Human Genome Project
ER 2013 Keynote
75
Knowledge Bundle Building
(i.e., Construct and Populate CMs)
Example: Bio-Medical Research
• Objective: Study the association of:
– TP53 polymorphism and
– Lung cancer
• Task: locate, gather, organize data from:
– Single Nucleotide Polymorphism database
– Medical journal articles
– Medical-record database
– Radiology images and reports
ER 2013 Keynote
76
Form-Based Extraction Ontologies
Gather SNP Information from the NCBI
dbSNP Repository
ER 2013 Keynote
77
Linguistically Grounded Conceptual
Models Search PubMed Literature
ER 2013 Keynote
78
Reverse-Engineer Human Subject
Information from INDIVO into a
Conceptual Model
ER 2013 Keynote
79
Add Annotated Images into the
Conceptual Knowledge Bundle
Radiology Report
(John Doe, July 19, 12:14 pm)
ER 2013 Keynote
80
Query and Analyze Data in Knowledge
the Bundle
ER 2013 Keynote
81
Honda moins de 8000
en «excellent état»
Q 한국어
marque prix
Honda 7826€
 8000€
mots de clé
Honda (2)
Multi-Lingual Query Processing
가격
주행거
리
색상
자동차
제조
사
특징
+
연식
모델등
급
차종
엔진
모델
등급
액세서리
변속기
français
ER 2013 Keynote
82
Honda moins de 8000
en «excellent état»
Q 한국어
marque prix
Honda 7826€
 8000€
mots de clé
Honda (2)
가격
주행거
리
색상
자동차
제조
사
특징
+
연식
모델등
급
차종
엔진
모델
등급
액세서리
변속기
français
ER 2013 Keynote
83
Table Understanding
• Tables on the web
– 14.1 billion HTML tables [Cafarella et al. 08]
• Most are tables for layout
• 154 million high-quality relational tables
– 50 million spreadsheet tables [Adelfio & Samet 13]
• Web table complexity (sampling statistics) [ibid]
– Simple relational table: 25% (spreadsheet) 68% (HTML)
– Multiple header rows: 15% (spreadsheet) 7% (HTML)
– More complex: 60% (spreadsheet) 25% (HTML)
ER 2013 Keynote
84
Table Understanding
A
1
2
3
4
5
6
7
8
9
10
11
12
12
12
ER 2013 Keynote
Schools
Pupils
B
2003/04
2004/05
2005/06
2006/07
2007/08
2008/09
2009/101
2003/04
2004/05
2005/06
2006/07
2007/08
2008/09
2009/101
C
D
E
Less than 100 100-299 p 300 pupils or more
36.2
39
24.8
35.2
39
25.8
35.2
39
25.8
34.3
40
25.7
34
39.6
26.4
33.3
40
26.7
32
40.7
27.3
8.7
39.3
52
8.7
38.3
53
8.8
38.3
52.9
8.4
39
52.6
8.3
38.2
53.5
8.1
38.2
53.7
7.7
38.2
54.1
85
Table Understanding
A
1
2
3
4
5
6
7
8
9
10
11
12
12
12
ER 2013 Keynote
Schools
Pupils
B
2003/04
2004/05
2005/06
2006/07
2007/08
2008/09
2009/101
2003/04
2004/05
2005/06
2006/07
2007/08
2008/09
2009/101
C
D
E
Less than 100 100-299 p 300 pupils or more
36.2
39
24.8
35.2
39
25.8
35.2
39
25.8
34.3
40
25.7
34
39.6
26.4
33.3
40
26.7
32
40.7
27.3
8.7
39.3
52
8.7
38.3
53
8.8
38.3
52.9
8.4
39
52.6
8.3
38.2
53.5
8.1
38.2
53.7
7.7
38.2
54.1
86
Table Understanding
A
1
2
3
4
5
6
7
8
9
10
11
12
12
12
ER 2013 Keynote
Schools
Pupils
B
2003/04
2004/05
2005/06
2006/07
2007/08
2008/09
2009/101
2003/04
2004/05
2005/06
2006/07
2007/08
2008/09
2009/101
C
D
E
Less than 100 100-299 p 300 pupils or more
36.2
39
24.8
35.2
39
25.8
35.2
39
25.8
34.3
40
25.7
34
39.6
26.4
33.3
40
26.7
32
40.7
27.3
8.7
39.3
52
8.7
38.3
53
8.8
38.3
52.9
8.4
39
52.6
8.3
38.2
53.5
8.1
38.2
53.7
7.7
38.2
54.1
87
Table Understanding
A
1
2
3
4
5
6
7
8
9
10
11
12
12
12
ER 2013 Keynote
Schools
Pupils
B
2003/04
2004/05
2005/06
2006/07
2007/08
2008/09
2009/101
2003/04
2004/05
2005/06
2006/07
2007/08
2008/09
2009/101
C
D
E
Less than 100 100-299 p 300 pupils or more
36.2
39
24.8
35.2
39
25.8
35.2
39
25.8
34.3
40
25.7
34
39.6
26.4
33.3
40
26.7
32
40.7
27.3
8.7
39.3
52
8.7
38.3
53
8.8
38.3
52.9
8.4
39
52.6
8.3
38.2
53.5
8.1
38.2
53.7
7.7
38.2
54.1
88
Table Understanding
A
1
2
3
4
5
6
7
8
9
10
11
12
12
12
ER 2013 Keynote
Schools
Pupils
B
2003/04
2004/05
2005/06
2006/07
2007/08
2008/09
2009/101
2003/04
2004/05
2005/06
2006/07
2007/08
2008/09
2009/101
C
D
E
Less than 100 100-299 p 300 pupils or more
36.2
39
24.8
35.2
39
25.8
35.2
39
25.8
34.3
40
25.7
34
39.6
26.4
33.3
40
26.7
32
40.7
27.3
8.7
39.3
52
8.7
38.3
53
8.8
38.3
52.9
8.4
39
52.6
8.3
38.2
53.5
8.1
38.2
53.7
7.7
38.2
54.1
89
Table Understanding
A
1
2
3
4
5
6
7
8
9
10
11
12
12
12
ER 2013 Keynote
Schools
Pupils
B
2003/04
2004/05
2005/06
2006/07
2007/08
2008/09
2009/101
2003/04
2004/05
2005/06
2006/07
2007/08
2008/09
2009/101
C
D
E
Less than 100 100-299 p 300 pupils or more
36.2
39
24.8
35.2
39
25.8
35.2
39
25.8
34.3
40
25.7
34
39.6
26.4
33.3
40
26.7
32
40.7
27.3
8.7
39.3
52
8.7
38.3
53
8.8
38.3
52.9
8.4
39
52.6
8.3
38.2
53.5
8.1
38.2
53.7
7.7
38.2
54.1
90
Automating Ontology Creation
with TANGO
Agglomeration
Population
Continent
Country
Tokyo
31,139,900
Asia
Japan
New York-Philadelphia
30,286,900
The Americas
United States of
America
Mexico
21,233,900
The Americas
Mexico
Seoul
19,969,100
Asia
Korea (South)
Sao Paulo
18,847,400
The Americas
Brazil
Jakarta
17,891,000
Asia
Indonesia
Osaka-Kobe-Kyoto
17,621,500
Asia
Japan
…
…
…
Agglomeration
Population
Country
Continent
…
Niigata
503,500
Asia
Japan
Raurkela
503,300
Asia
India
Homjel
502,200
Europe
Belarus
Zunyi
501,900
Asia
China
Santiago
501,800
The Americas
Dominican Republic
Pingdingshan
501,500
Asia
China
Fargona
501,000
Asia
Uzbekistan
Kirov
500,200
Europe
Russia
Newcastle
500,000
Australia
/Oceania
Australia
Automating Ontology Creation
with TANGO
Longitude
Latitude
Latitude and longitude
designates location
Name
Geopolitical Entity
Population
Country
Continent
Merge
Location
names
Agglomeration
has
Results
Has
GMT
Time
Country
Population
City
Longitude
Latitude
Latitude and longitude
designates location
Name
Continent
Geopolitical Entity
Country
Location
Agglomeration
City
Automated Reading: NELL
http://rtw.ml.cmu.edu
ER 2013 Keynote
93
Automated Reading: OntoSoar
• Populate conceptual model from text
– Directly
– By inference
• Augment conceptual model and populate
ER 2013 Keynote
94
Homeland Security: Terrorist Example
ER 2013 Keynote
95
Homeland Security: Terrorist Example
ER 2013 Keynote
96
Homeland Security: Terrorist Example
Abu
Aziz
?
White House
White House
ER 2013 Keynote
97
Homeland Security: Terrorist Example
Abu
Aziz
?
White House
White House
ER 2013 Keynote
98
Homeland Security: Terrorist Example
Abu
Aziz
What If!
?
White House
White House
ER 2013 Keynote
99
Twitter Suicide Study
Oct. 10 2013
Tweets could warn of a suicide risk,
BYU study says
… Over three months, the computer scientists
screened millions of tweets and identified 37,717
that were "genuinely troubling" from 28,088 unique
users …
ER 2013 Keynote
100
Conceptual Modeling for Studying the
Human Genome
ER 2013 Keynote
101
Conceptual Modeling for Studying the
Human Genome
ER 2013 Keynote
102
Roadmap
• What is BIG DATA?
• Why should Conceptual Modeling apply?
• Examples to show how Conceptual Modeling
can “come to the rescue”
• Summary (and take-home message):
– Principles that guide the use of Conceptual
Modeling in BIG DATA applications
– Challenges and Research Opportunities
ER 2013 Keynote
103
Principles that guide the use of
Conceptual Modeling in BIG DATA
• Harvest wrt a conceptual model
– Extraction ontologies
– And: table understanding, automated reading, …
• Organize wrt a conceptual model
– Rich conceptualizations
– And: KBs for research studies, multilingual web, …
• Analyze wrt a conceptual model
– Evidence-based reasoning
– And: “what-if”, warning signs search, DNA, …
ER 2013 Keynote
104
Summary & Challenge
• Conceptual Modeling Applies to BIG DATA
(perhaps more than you might have thought)
• Challenge: find ways to use conceptual
modeling to “rescue”—resolve BIG DATA issues
www.deg.byu.edu
ER 2013 Keynote
105
BYU Data Extraction Research Group
Summary & Challenge
• Conceptual Modeling Applies to BIG DATA
(perhaps more than you might have thought)
• Challenge: find ways to use conceptual
modeling to “rescue”—resolve BIG DATA issues
www.deg.byu.edu
ER 2013 Keynote
106
BYU Data Extraction Research Group