Bargmeyer XMDR & Ecoinformatics 10-20 V10

advertisement
eXtended Metadata Registry
(XMDR)
International Ecoinformatics Technical Collaboration
Berkeley, California
October 24, 2006
Bruce Bargmeyer,
Lawrence Berkley National Laboratory
University of California
Tel: +1 510-495-2905
bebargmeyer@lbl.gov
1
Topics
 Challenges
to address
 A brief tutorial on Semantics and semantic
computing
 where XMDR fits
 Semantic
computing technologies
 Traditional Data Administration
 XMDR
project
 Test Bed demonstrations
2
The Internet Revolution
A world wide web of diverse content:
The information glut is nothing new. The access to it is astonishing.3
Challenge: Find and process nonexplicit data
For example…
Patient data on drugs contains brand
names (e.g. Tylenol, Anacin-3,
Datril,…);
Analgesic Agent
Non-Narcotic Analgesic
Analgesic and Antipyretic
However, want to study patients taking
analgesic agents
Nonsteroidal
Antiinflammatory
Drug
Tylenol
Acetominophen
Anacin-3
Datril
4
Challenge: Specify and compute across
Relations, e.g., within a food web in an
Arctic ecosystem
An organism is connected to another organism for which it is a source
of food energy and material by an arrow representing the direction of
biomass transfer.
Source: http://en.wikipedia.org/wiki/Food_web#Food_web (from SPIRE)
5
Challenge: Combine Data, Metadata &
Concept Systems
Inference Search Query:
“find water bodies downstream from Fletcher
Creek where chemical contamination was
over 10 micrograms per liter between
December 2001 and March 2003”
Data:
ID Date
Temp
Hg
A
06-09-13
4.4
4
B
06-09-13
9.3
2
X
06-09-13
6.7
78
Concept system:
Contamination
Biological
Radioactive
mercury
Chemical
lead
cadmium
Metadata:
Name
Datatype
Definition
Units
ID
text
Monitoring
Station Identifier
not
applicable
Date
date
Date
yy-mm-dd
number
Temperature (to
0.1 degree C)
degrees
Celcius
number
Mercury
contamination
micrograms
per liter
Temp
Hg
6
Challenge: Use data from systems that record
the same facts with different terms
Database
Catalogs
Common Content
ISO 11179
Registries
Common Content
Data
Element
UDDI
Registries
Table
Column
Common Content
Business
Specification
OASIS/ebXML
Registries
XML Tag
Country
IdentifierAttribute
Common Content
CASE Tool
Repositories
Common Content
Business
Object
Coverage
Software
Component
Registries
Common Content
Term
Hierarchy
Ontological
Registries
Dublin
Core
Registries
Common Content
Common Content
8
Challenge: Draw information together from a
broad range of studies, databases, reports, etc.
10
Challenge: Gain Common Understanding of
meaning between Data Creators and Data Users
A common interpretation of what the data
represents
EEA
text
environ
agriculture
climate
human health
industry
tourism
soil
water
air
USGS
12312332683268
34534508250825
44544513481348
67067050385038
24824827082708
59159100000000
30830821782178
1231233268
3268
3453450825
0825
4454451348
1348
6706705038
5038
2482482708
2708
5915910000
0000
3083082178
2178
text
environ
agriculture
climate
human health
industry
tourism
soil
water
air
text
ambiente
agricultura
tiempo
salud huno
industria
turismo
tierra
agua
aero
Users
data
data
DoD
EPA
text
ambiente
agricultura
tiempo
salud hunano
industria
turismo
tierra
agua
aero
text
data
environ
agriculture
climate
human health 12312332683268
34534508250825
industry
44544513481348
tourism
67067050385038
soil
24824827082708
59159100000000
water
30830821782178
air
data
12312332683268
34534508250825
44544513481348
67067050385038
24824827082708
59159100000000
30830821782178
3268
data
0825
123
1348
5038
345
123
3268
2708
0000
445
345
0825
2178
6701348
445
2485038
670
591
248
308
591
308
Information
systems
Others . . .
Data Creation
11
Semantic Computing and XMDR
 We
are laying the foundation to make a quantum
leap toward a substantially new way of
computing: Semantic Computing
 How can we make use of semantic computing for
the environment and health?
 What do environmental agencies need to do to
prepare for and stimulate semantic computing?
 What are the ecoinformatics challenges?
12
Coming: A Semantic Revolution
Searching and ranking
Pattern analysis
Knowledge discovery
Question answering
Reasoning
Semi-automated
decision making
13
The Nub of It
 Processing
that takes “meaning” into
account
 Processing based on the relations between
things not just computing about the things
themselves.
 Processing that takes people out of the
processing, reducing the human toil
 Data
access, extraction, mapping, translation,
formatting, validation, inferencing, …
 Delivering higher-level results
that are more
helpful for the user’s thought and action
14
A Brief Tutorial on Semantics
 What
is meaning?
 What are concepts?
 What are relations?
 What are concept systems?
 What is “reasoning”?
16
Meaning: The Semiotic Triangle
Thought or Reference (Concept)
Refers to
Referent
Symbolises
Stands for
C.K Ogden and I. A. Richards. The Meaning of Meaning.
Symbol
“Rose”, “ClipArt”
17
Semiotic Triangle:
Concepts, Definitions and Signs
Definition
CONCEPT
Refers To
Symbolizes
“Rose”,
“ClipArt”
Referent
Sign
Stands For
18
Forms of Definitions
Definition - Define by:
--Essence & Differentia
--Relations
--Axioms
CONCEPT
Refers To
Symbolizes
“Rose”,
“ClipArt”
Referent
Sign
Stands For
20
Definition of Concept - Rose:
Dictionary - Essence & Differentia
 1.
any of the wild or cultivated, usually
prickly-stemmed, pinnate-leaved, showyflowered shrubs of the genus Rosa. Cf. rose
family.
 2. any of various related or similar plants.
 3. the flower of any such shrub, of a red,
pink, white, or yellow color.
--Random House Webster’s Unabridged
Dictionary (2003)
21
Definitions in the EPA
Environmental Data Registry
Mailing
Address:
State
USPS
Code:
Mailing
Address
State
Name:
http://www.epa/gov/edr/sw/AdministeredItem#MailingAddress
The exact address where a mail piece is intended to be delivered,
including urban-style address, rural route, and PO Box
http://www.epa/gov/edr/sw/AdministeredItem#StateUSPSCode
The U.S. Postal Service (USPS) abbreviation that represents a state
or state equivalent for the U.S. or Canada
http://www.epa/gov/edr/sw/AdministeredItem#StateName
The name of the state where mail is delivered
22
Definition of Concept - Rose:
Relations to Other Concepts
Love
Romance
Marriage
CONCEPT
Refers To
Symbolizes
“Rose”,
“ClipArt”
Referent
Stands For
23
SNOMED – Terms Defined by
Relations
24
Definition of Concept - Rose:
Defined by Axioms in OWL
rdfs:subClassOf
owl:equivalentClass
owl:disjointWith
CONCEPT
Refers To
Symbolizes
“Rose”,
“ClipArt”
Referent
Stands For
25
Class Axiom (Definitions)
Class Description is Building Block of Class
Axiom










A class description is the term used in this document (and in the OWL Semantics and
Abstract Syntax) for the basic building blocks of class axioms (informally called class
definitions in the Overview and Guide documents). A class description describes an
OWL class, either by a class name or by specifying the class extension of an unnamed
anonymous class.
OWL distinguishes six types of class descriptions:
a class identifier (a URI reference)
an exhaustive enumeration of individuals that together form the instances of a class
a property restriction
the intersection of two or more class descriptions
the union of two or more class descriptions
the complement of a class description
The first type is special in the sense that it describes a class through a class name
(syntactically represented as a URI reference). The other five types of class descriptions
describe an anonymous class by placing constraints on the class extension.
Class descriptions of type 2-6 describe, respectively, a class that contains exactly the
enumerated individuals (2nd type), a class of all individuals which satisfy a particular
property restriction (3rd type), or a class that satisfies boolean combinations of class
descriptions (4th, 5th and 6th type). Intersection, union and complement can be
respectively seen as the logical AND, OR and NOT operators. The four latter types of
class descriptions lead to nested class descriptions and can thus in theory lead to
arbitrarily complex class descriptions. In practice, the level of nesting is usually limited.
26
Class Descriptions -> Class Axiom

Class descriptions form the building blocks for defining classes through class
axioms. The simplest form of a class axiom is a class description of type 1, It
just states the existence of a class, using owl:Class with a class identifier.
 For example, the following class axiom declares the URI reference #Human to
be the name of an OWL class:

<owl:Class rdf:ID="Human"/> This is correct OWL, but does not tell us very much
about the class Human. Class axioms typically contain additional components that
state necessary and/or sufficient characteristics of a class. OWL contains three
language constructs for combining class descriptions into class axioms:

rdfs:subClassOf allows one to say that the class extension of a class
description is a subset of the class extension of another class description.
 owl:equivalentClass allows one to say that a class description has exactly the
same class extension as another class description.
 owl:disjointWith allows one to say that the class extension of a class
description has no members in common with the class extension of another
class description.
27
Computable Meaning
rdfs:subClassOf
owl:equivalentClass
owl:disjointWith
CONCEPT
Refers To
Symbolizes
“Rose”,
“ClipArt”
Referent
Stands For
If “rose” is owl:disjointWith “daffodil”, then a computer can determine that an
assertion is invalid, if it states that a rose is also a daffodil (e.g., in a knowledgebase).
28
What are Relations?
Relation
WaterBody
Merced River
Fletcher Creek
isA
isA
Merced Lake
Merced
Lake
Fletcher Creek
Concepts and relations can be represented
as nodes and edges in formal graph
structures, e.g., “is-a” hierarchies.
29
Concept Systems have Nodes and may
have Relations
Nodes represent concepts
A
Lines (arcs) represent relations
1
a
2
b
c
d
Concept systems can be represented & queried as graphs
30
A More Complex Concept Graph
Concept lattice of inland water features
Linear
Large linear
Large
Non-linear
Non-linear
Small linear
Small non- linear
Deep
Natural
Flowing
Shallow
Stagnant
Artificial
River
Stream
Canal
Reservoir
Lake
Marsh
Pond
From Supervaluation Semantics for an Inland Water Feature Ontology
Paulo Santos
and Brandon Bennett http://ijcai.org/papers/1187.pdf#search=%22terminology%20water%20ontology%22
31
Types of Concept System Graph Structures

Trees
 Partially Ordered Trees
 Ordered Trees
 Faceted Classifications
 Directed Acyclic Graphs
 Partially Ordered Graphs
 Lattices
 Bipartite Graphs
 Directed Graphs
 Cliques
 Compound Graphs
32
Types of Concept System Graph Structures
Tree
Partial Order Tree
Ordered Tree
Partial Order Graph
Bipartite Graph
Faceted Classification
Powerset of 3 element set
Directed Acyclic Graph
Clique
Compound Graph
33
Graph Taxonomy
Graph
Directed Graph
Undirected Graph
Directed Acyclic Graph
Bipartite Graph
Clique
Partial Order Graph
Faceted Classification
Lattice
Partial Order Tree
Tree
Note: not all bipartite graphs
are undirected.
Ordered Tree
34
What Kind of Relations are There?
Lots!
Relationship class: A particular type of connection existing between
people related to or having dealings with each other.
 acquaintanceOf - A person having more than slight or superficial
knowledge of this person but short of friendship.
 ambivalentOf - A person towards whom this person has mixed feelings
or emotions.
 ancestorOf - A person who is a descendant of this person.
 antagonistOf - A person who opposes and contends against this person.
 apprenticeTo - A person to whom this person serves as a trusted
counselor or teacher.
 childOf - A person who was given birth to or nurtured and raised by
this person.
 closeFriendOf - A person who shares a close mutual friendship with
this person.
 collaboratesWith - A person who works towards a common goal with
this person.
…
35
Example of relations in a food web
in an Arctic ecosystem
An organism is connected to another organism for which it is a source
of food energy and material by an arrow representing the direction of
biomass transfer.
Source: http://en.wikipedia.org/wiki/Food_web#Food_web (from SPIRE)
36
Ontologies
are a type of Concept System



Ontology: explicit formal specifications of the terms in the
domain and relations among them (Gruber 1993)
An ontology defines a common vocabulary for researchers
who need to share information in a domain. It includes
machine-interpretable definitions of basic concepts in the
domain and relations among them.
Why would someone want to develop an ontology? Some
of the reasons are:





To share common understanding of the structure of information
among people or software agents
To enable reuse of domain knowledge
To make domain assumptions explicit
To separate domain knowledge from the operational knowledge
To analyze domain knowledge
http://www.ksl.stanford.edu/people/dlm/papers/ontology101/ontology101-noy-mcguinness.html
37
What is Reasoning?
Inference
Disease
is-a
is-a
Infectious Disease
is-a
Polio
Chronic Disease
is-a
Smallpox
is-a
Diabetes
is-a
Heart disease
Signifies inferred is-a relationship
38
Reasoning: Taxonomies & partonomies can
be used to support inference queries
E.g., if a database contains
information on events by city,
we could query that database
for events that happened in a
particular county or state,
even though the event data
does not contain explicit state
or county codes.
part-of
Oakland
California
part-of
part-of
Alameda County
part-of
Berkeley
part-of
Santa Clara County
part-of
Santa Clara San Jose
39
Reasoning: Relationship metadata can
be used to infer non-explicit data
For example…
(1) patient data on drugs currently
being taken contains brand names
(e.g. Tylenol, Anacin-3, Datril,…);
Analgesic Agent
Non-Narcotic Analgesic
(2) concept system connects different
drug types and names with one
another (via is-a, part-of, etc.
relationships);
(3) so… patient data can be linked and
searched by inferred terms like
“acetominophen” and “analgesic” as
well as trade names explicitly stored
as text strings in the database
Analgesic and Antipyretic
Nonsteroidal
Antiinflammatory
Drug
Tylenol
Acetominophen
Anacin-3
Datril
40
Reasoning: Least Common Ancestor Query
What is the least common ancestor concept in the NCI Thesaurus for
Acetominophen and Morphine Sulfate? (answer = Analgesic Agent)
Analgesic Agent
Opioid
Non-Narcotic Analgesic
Analgesic and Antipyretic
Opiate
Morphine Codeine
Sulfate
Phosphate
Nonsteroidal
Antiinflammatory
Drug
Acetominophen
41
Reasoning: Example “sibling” queries:
concepts that share a common ancestor
Environmental:

"siblings" of Wetland (in NASA SWEET ontology)
Health
Siblings of ERK1 finds all 700+ other kinase enzymes
 Siblings of Novastatin finds all other statins

11179 Metadata

Sibling values in an enumerated
value domain
42
Reasoning: More complex “sibling”
queries: concepts with multiple ancestors
Health




breast disorders
Find all the siblings of
Breast Neoplasm
Environmental

site neoplasms
Breast
Eye
Respiratory
neoplasm neoplasm
System
neoplasm
Non-Neoplastic
Breast
Disorder
Find all chemicals that are a
carcinogen (cause cancer) and
toxin (are poisonous) and
terratogenic (cause birth defects)
43
End of Tutorial about concept systems
Where does ISO/IEC 11179 fit?
44
Data Generation and Use
Cost vs. Coordination
Full Control
$
Community of Interest
Data
Creation
Reporting
Autonomous
Coordination
45
Data Generation and Use
Cost vs. Coordination
Data
Use
$
Full Control
Community of Interest
Data
Creation
Reporting
Autonomous
Coordination
46
ISO/IEC 11179 Metadata Registries Reduce
Cost of Data Creation and Use
Data
Use
$
Full Control
Community of Interest
Data
Creation
Reporting
Autonomous
Coordination
47
Metadata Registries Increase the Benefit
from Data (Strategic Effectiveness)
Benefit
Community of Interest
Autonomous
Reporting
MDR
Full Control
48
What Can ISO/IEC 11179 MDR Do?
Traditional Data Management (11179 Edition 2)
 Register metadata which describes data—in databases,
applications, XML Schemas, data models, flat files, paper
 Assist in harmonizing, standardizing, and vetting metadata
 Assist data engineering
 Provide a source of well formed data designs for system
designers
 Record reporting requirements
 Assist data generation, by describing the meaning of data
entry fields and the potential valid values
 Register provenance information that can be provided to
end users of data
 Assist with information discovery by pointing to systems
where particular data is maintained.
49
Traditional MDR:
Manage Code Sets
Data
Element
Concept
Name: Country Identifiers
Context:
Definition:
Unique ID: 5769
Conceptual Domain:
Maintenance Org.:
Steward:
Classification:
Registration Authority:
Others
Algeria
Belgium
China
Denmark
Egypt
France
...
Zimbabwe
Data Elements
Name:
Context:
Definition:
Unique ID: 4572
Value Domain:
Maintenance Org.
Steward:
Classification:
Registration
Authority:
Others
Algeria
L`Algérie
DZ
DZA
012
Belgium
Belgique
BE
BEL
056
China
Chine
CN
CHN
156
Denmark
Danemark
DK
DNK
208
Egypt
Egypte
EG
EGY
818
France
La France
FR
FRA
250
...
...
...
...
...
Zimbabwe
Zimbabwe
ZW
ZWE
716
ISO 3166
French Name
ISO 3166
2-Alpha Code
ISO 3166
3-Alpha Code
ISO 3166
3-Numeric Code
ISO 3166
English Name
50
What Can XMDR Do?
Support a new generation of semantic computing
 Concept system management
 Harmonizing and vetting concept systems
 Linkage of concept systems to data
 Interrelation of multiple concept systems
 Grounding ontologies and RDF in agreed upon
semantics
 Reasoning across XMDR content
 Provision of Semantic Services
51
Coming: A Semantic Revolution
Searching and ranking
Pattern analysis
Knowledge discovery
Question answering
Reasoning
Semi-automated decision
making
Full Control
Community of Interest
Reporting
Autonomous
52
We are trying to manage semantics in
an increasingly complex content space
Structured data
Semi-structured data
Unstructured data
Text
Pictographic
Graphics
Multimedia
Voice video
53
11179-3 (E3) Increases MDR Benefit
When communities create information according to a common
vocabulary the value of the resulting information
increases dramatically.
Benefit
Community of Interest
Autonomous
Reporting
MDR
Full Control
54
Example
 Combining
Concept Systems, Data, and
Metadata to answer queries.
55
Linking Concepts: Text Document
Title 40--Protection of Environment
CHAPTER I--ENVIRONMENTAL PROTECTION AGENCY
PART 141--NATIONAL PRIMARY DRINKING WATER REGULATIONS
§ 141.62 40 CFR Ch. I (7–1–02 Edition)
§ 141.62 Maximum contaminant levels
for inorganic contaminants.
(a) [Reserved]
(b) The maximum contaminant levels
for inorganic contaminants specified in
paragraphs (b) (2)–(6), (b)(10), and (b)
(11)–(16) of this section apply to community
water systems and non-transient,
non-community water systems.
The maximum contaminant level specified
in paragraph (b)(1) of this section
only applies to community water systems.
The maximum contaminant levels
specified in (b)(7), (b)(8), and (b)(9)
of this section apply to community
water systems; non-transient, noncommunity
water systems; and transient
non-community water systems.
Contaminant MCL (mg/l)
(1) Fluoride ............................ 4.0
(2) Asbestos .......................... 7 Million Fibers/liter (longer
than 10 μm).
(3) Barium .............................. 2
(4) Cadmium .......................... 0.005
(5) Chromium ......................... 0.1
(6) Mercury ............................ 0.002
(7) Nitrate ............................... 10 (as Nitrogen)
56
Thesaurus Concept System
(From GEMET)
Chemical Contamination
Definition The addition or presence of chemicals to, or in, another
substance to such a degree as to render it unfit for its intended purpose.
Broader Term contamination
Narrower Terms cadmium contamination, lead contamination,
mercury contamination
Related Terms chemical pollutant, chemical pollution
Deutsch: Chemische Verunreinigung
English (US): chemical contamination
Español: contaminación química
SOURCE General Multi-Lingual
Environmental Thesaurus (GEMET)
57
Concept System (Thesaurus)
Contamination
chemical pollutant
Biological
Radioactive
cadmium
Chemical
lead
chemical pollution
mercury
58
Chemicals in EPA Environmental Data Registry
Environmental Data Registry
Name
Mercury
Mercury, bis(acetato.kappa.O)
(benzenamine)-
Mercury, (acetato.kappa.O)
phenyl-, mixt. with
phenylmercuric
propionate
Type
Biological
Recent Additions | Contact
Us
Organism
Chemical
Chemical
Chemical
CAS
Number
7439-97-6
63549-47-3
No CAS Number
TSN
Acalypha
ostryifolia
28189
ICTV
EPA ID
E17113275
E965269
59
Data
X
Merced River
Fletcher Creek
B
A
Merced Lake
Monitoring Stations
Name
A
B
X
Latitude
41.45 N
43.23 N
39.45 N
Longitude
Measurements
Location
ID
125.99 W
Merced Lake
A
2006-09-13
4.4
4
B
2006-09-13
9.3
2
120.50 W
Merced
River
X
2006-09-15
5.2
3
118.12 W
Fletcher
Creek
X
2006-09-13
6.7
78
Date
Temp
Hg
60
Metadata
Contaminants
Contaminant
Threshold
mercury
5
lead
42?
cadmium
250?
Metadata
System
Data Element
Definition
Units
Precision
Measurements
ID
Monitoring Station Identifier
not applicable
not applicable
Measurements
Date
Date sample was collected
not applicable
not applicable
Measurements
Temp
Temperature
degrees Celcius
0.1
Measurements
Hg
Mercury contamination
micrograms per liter
0.004
Monitoring Stations
Name
Monitoring Station Identifier
Monitoring Stations
Latitude
Latitude where sample was taken
Monitoring Stations
Longitude
Longitude where sample was
taken
Monitoring Stations
Location
Body of water monitored
Contaminants
Contaminant
Name of contaminant
Contaminants
Threshold
Acceptable threshold value
61
Relations among Inland Bodies of Water
Fletcher Creek
feeds into
Merced River
feeds into
Merced River
fed from
Fletcher Creek
feeds into
Merced Lake
Merced Lake
62
Combining Data, Metadata & Concept
Systems
Inference Search Query:
“find water bodies downstream from Fletcher
Creek where chemical contamination was
over 2 parts per billion between December
2001 and March 2003”
Data
ID
Date
Temp
Hg
A
06-09-13
4.4
4
B
06-09-13
9.3
2
X
06-09-13
6.7
78
Concept system
Contamination
Biological
Radioactive
mercury
Chemical
lead
cadmium
Metadata
Name
Datatype
Definition
Units
ID
text
Monitoring
Station Identifier
not
applicable
Date
date
Date
yy-mm-dd
number
Temperature (to
0.1 degree C)
degrees
Celcius
number
Mercury
contamination
micrograms
per liter
Temp
Hg
63
Example – Environmental Text
Corpus
 Idea:
Develop an environmental research
corpus that could attract R&D efforts.
Include the reports and other material from
over $1b EPA sponsored research.
 Prepare

the corpus and make it available
Research results from years of ORD R&D
 Publish
associated metadata and concept
systems in XMDR
 Use open source software for EPA testing
64
Extraction Engines

Find concepts and relations between concepts in
text, tables, data, audio, video, …
 Produce databases (relational tables, graph
structures), and other output
 Functions:
 Segment – find text snippets (boundaries
important)
 Classify – determines database field for text
segment
 Association – which text segments belong
together
 Normalization – put information into standard
form
 Deduplication – collapse redundant information
65
Metadata Registries are Useful
Registered semantics
 For “training” extraction engines
 The“Normalize” function can make use of
standard code sets that have mapping
between representation forms.
 The “Classify” function can interact with
pre-established concept systems.
Provenance
 High precision for proper nouns, less
precision (e.g., 70%) for other concepts ->
impacts downstream processing, Need to
track precision
66
Normalize – Need Registered and Mapped
Concepts/Code Sets
Data
Element
Concept
Name: Country Identifiers
Context:
Definition:
Unique ID: 5769
Conceptual Domain:
Maintenance Org.:
Steward:
Classification:
Registration Authority:
Others
Algeria
Belgium
China
Denmark
Egypt
France
...
Zimbabwe
Data Elements
Name:
Context:
Definition:
Unique ID: 4572
Value Domain:
Maintenance Org.
Steward:
Classification:
Registration
Authority:
Others
Algeria
L`Algérie
DZ
DZA
012
Belgium
Belgique
BE
BEL
056
China
Chine
CN
CHN
156
Denmark
Danemark
DK
DNK
208
Egypt
Egypte
EG
EGY
818
France
La France
FR
FRA
250
...
...
...
...
...
Zimbabwe
Zimbabwe
ZW
ZWE
716
ISO 3166
French Name
ISO 3166
2-Alpha Code
ISO 3166
3-Alpha Code
ISO 3166
3-Numeric Code
ISO 3166
English Name
67
Information Extraction & Semantic Computing
Extraction
Engine
Segment
Classify
Discover
patterns
Associate
Select models
Normalize
Fit parameters
Deduplicate
Inference
Report results
11179-3
(E3)
XMDR
Actionable
Information
Decision
Support
68
Example – 11179-3 (E3) Support
Semantic Web Applications
XMDR may be used to “ground” the Semantics
of an RDF Statement.
The address state code is “AB”. This can be expressed as a directed
Graph e.g., an RDF statement:
Graph
Node
RDF
Subject
Address
Edge
Predicate
Node
Object
State Code
AB
69
Example: Grounding RDF nodes and relations:
URIs Reference a Metadata Registry
dbA:e0139
ai: MailingAddress
dbA:ma344
ai: StateUSPSCode
“AB”^^ai:StateCode
@prefix dbA: “http:/www.epa.gov/databaseA”
@prefix ai: “http://www.epa.gov/edr/sw/AdministeredItem#”
70
Definitions in the EPA
Environmental Data Registry
Mailing
Address:
State
USPS
Code:
Mailing
Address
State
Name:
http://www.epa/gov/edr/sw/AdministeredItem#MailingAddress
The exact address where a mail piece is intended to be delivered,
including urban-style address, rural route, and PO Box
http://www.epa/gov/edr/sw/AdministeredItem#StateUSPSCode
The U.S. Postal Service (USPS) abbreviation that represents a state
or state equivalent for the U.S. or Canada
http://www.epa/gov/edr/sw/AdministeredItem#StateName
The name of the state where mail is delivered
71
Ontologies for Data Mapping
Ontologies can help to capture and express semantics
Concept
Concept
Concept
Concept
Geographic Area
Geographic Sub-Area
Country
Country Identifier
Country Name
Short Name
Mailing Address
Country Name
Long Name
Distributor
Country Name
Country Code
ISO 3166
2-Character
Code
ISO 3166
3-Numeric Code
ISO 3166
3- Character
Code
FIPS Code
73
Example: Content Mapping Service
data from many sources – files contain
data that has the same facts represented by
different terms. E.g., one system responds with
Danemark, DK, another with DNK, another with
208; map all to Denmark.
 XMDR could accept XML files with the data from
different code sets and return a result mapped to a
single code set.
 Collect
74
Ecoinformatics: Concept System Store
Concept systems:
Concept System
Thesaurus
Themes
Ontology
GEMET
Structured
Metadata
Data
Standards
}
Metadata Registry
Keywords
Controlled Vocabularies
Thesauri
Taxonomies
Ontologies
Axiomatized Ontologies
(Essentially graphs:
node-relation-node +
axioms)
75
Ecoinformatics: Management of
Concept Systems
Metadata Registry
Concept System
Thesaurus
Themes
Ontology
GEMET
Structured
Metadata
Data
Standards
Concept system:
}
Registration
Harmonization
Standardization
Acceptance (vetting)
Mapping
(correspondences)
76
Ecoinformatics: Life Cycle
Management
Metadata Registry
Concept System
Thesaurus
Themes
Ontology
GEMET
Structured
Metadata
Data
Standards
Life cycle
management:
Data and
Concept systems
(ontologies)
77
Ecoinformatics: Grounding
Semantics
Metadata
Registries
Metadata Registry
Concept System
Thesaurus
Themes
Ontology
GEMET
Structured
Metadata
Semantic Web
RDF Triples
Subject (node URI)
Verb (relation URI)
Object (node URI)
Ontologies
Data
Standards
78
XMDR Project Collaboration
 Collaborative, interagency effort
EPA,
USGS, NCI, Mayo Clinic, DOD, LBNL
…& others
 Draws on and contributes to
interagency/International Cooperation on
Ecoinformatics
 Involves Ecoterm, international, national, state,
local government agencies, other organizations
as content providers and potential users
 Interacts with many organizations around the
world through ISO/IEC standards committees
 Only loosely aligned with Ecoinformatics
Cooperation
79
XMDR Project
 High
risk R&D, sponsor expected likelihood of
failure
 Targeted toward leading-edge semantics
applications in a highly strategic environment
 Conceptualization of new capabilities, creation of
designs (expressed as standards), development of
a software architecture and prototype system for
demonstrating capabilities and testing designs

Reasoning, inference, linkage of concepts to data, ….
 Demonstration
of fundamental semantic
management capabilities for metadata registries,
understanding the potential applications that could
be built in-house
80
Results to Date
 Completed
the first version of designs for next
generation metadata registries—expressed as figures in a
UML model that is proposed for next edition of the
ISO/IEC 11179 standard
 Developed XMDR Prototype -- available as open source
software
 Content loaded in prototype: broad range of traditional
metadata and concept systems
 Designs and prototype being explored and used in
several locations. Potential for facilitating development
and sharing of content by wide diversity of users.
 Starting the next version of designs, taking on more
81
challenging content and capabilities
Status of Project

NSF has funded a three-year project, providing a funding
base





Strong emphasis on the computer science R&D results and
collaboration with EU and Asia
Limited staffing
Proposing further high risk R&D
Developing proposals for collaborative efforts to
demonstrate capabilities, especially in the area of water.
Opportunity to collaborate with JRC and projects under the
European Commission 7th Framework Program
82
Ecoinformatics Test Bed

Proposed in Brussels in September 2004


Purpose


Project direction and statement developed
Research and technical informatics to investigate metadata
management techniques. Practical experiment for testing
usability.
Initial Focus



Use metadata and semantic technologies for air quality
(transportation) health effects
Potential for extension to other areas
Need for engaging ongoing operations and/or indicators
Bruce the unready
83
Ecoinformatics Test Bed


Extend original charter to Water
Use Water as example content


Look for opportunities to coordinate with
EU projects


Metadata, concept systems
WISE, EC 7th Framework program
Identify and propose possible
demonstrations
84
Download