NLDB-04 - College of Engineering and Computer Science

advertisement
On Embedding Machine-Processable
Semantics into Documents
Krishnaprasad Thirunarayan
Department of Computer Science & Engineering
Wright State University
Dayton, OH-45435, USA
1
Talk Outline
Background and Motivation (Why?)
Goals (What?)
Details (How?)
Conclusions
2
Background and Motivation
3
Content Extraction:
Formalize doc, using controlled vocabulary
Heterogeneous Doc.
Spec. Defn. Rep.
4
Problems with this approach to
content extraction
Archiving spec (for human
comprehension) separately from its
formalization is not conducive
traceability.
Manual extraction from spec (from
scratch) for each use is labor
intensive, time consuming, and prone
to typographical errors.
5
Observation
Conceptually, every piece of
information in an extraction owes its
existence to a phrase in spec, and
possibly, controlled vocabulary.
So, explore techniques to maintain
correspondence between a spec
fragment and its formalization.
6
Goal
7
General Problem
Embed domain-specific mark-up
(annotations) into human sensible
document
to make explicit semantics of “content”
text and complex data, and
 to augment an interpretation in a
modular fashion.



Document text:
Human comprehensible
Semantic Mark-up: Machine processable
8
Details (How?)
9
Nature of Specs
Semi-structured
Heterogeneous
Text
 Tables
 Images

Constrained technical vocabulary
Available as MS Word document
10
Pre-processing Spec
Abstract content from spec
document by removing display
oriented information
Save text
 Save tabular data, preserving grid layout
 Retain links to images
 …


Note: “Save As text” option in MS Word
inadequate
11
Heterogeneous Document
12
XML generated by Majix
13
ASCII Output
14
Annotating Pre-processed Spec
Embedding Machine Processable Semantics

Recognizing and tagging text using controlled
vocabulary


By product of: Document Indexing and Semantic
Search
Tagging tabular data to make explicit its
semantics : Same grid layout, but different
interpretation and dependencies based on
headings

Explore: XML-based programming language Water for
defining data and its behavior (semantics)
15
Locating Controlled Vocabulary Terms
16
Example Table
Thickness
(mm)
Tensile
Yield
Strength (ksi) Strength (ksi)
0.50 and
under
165
155
0.05 – 1.00
160
150
1.00 – 1.50
155
145
17
Example of Tagged Table
Thickness (mm)
Tensile Strength (ksi) Yield Strength (ksi)
table.<setHeading thickness strength.tensile strength.yield/>
0.50 and under
165
table.<addRow 0 0.50
0.50 - 1.00
1.00 - 1.50
165
160
table.<addRow 0.50 1.00
155
/>
150
160
155
table.<addRow 1.00 1.50
155
150
/>
145
/> ...
145
155
18
Example of Processing Code
<defclass table rows=required=vector heading=optional=vector>
<defmethod setHeading t=required ts=required ys=required>
<set heading=<vector t ts ys/>/>
</>
<defmethod addRow smin smax ts ys>
<set rows=
table.rows.<insert <vector smin smax ts ys/>/>/>
</>
<defmethod computeYieldStrength>
… </>
<defmethod computeTensileStrength>
… </>
…
</>
19
(cont’d)
<defclass table rows=required=vector heading=optional=vector>
…
<defmethod computeTensileStrength>
<set temp=fluid.Thickness/>
<set i=0/>
<do>
<until <and temp.<less table.rows.<get i/>.1/>
temp.<more_or_equal table.rows.<get i/>.0/> /> >
table.rows.<get i/>.2
</until>
<set i=i.<plus 1/>/>
</do>
</>
</>
20
(cont’d)
<defclass table rows=required=vector heading=optional=vector>
…
</>
fluid.<set Thickness=0.60>
<try
<set TensileStrength=table.<computeTensileStrength/>/>
TensileStrength
>
"TABLE: out of range error occurred"
</try>
21
Water
XML-based OO Scripting Language
Facilitates creating Web Services

Run methods remotely via web-browser
Generalizes dynamic typing to
constraint checking

Conformance of actuals to formals
22
Pros and cons
Encoding Improvement

Amount of tagging can be controlled by suitably
delimiting table data and annotating it with
corresponding “string-processing” method
Master Copy Update

Changes to spec requires manual modification to
archived annotated version.
Irregular Tables in Specs

Different units, etc
23
Some Related Work
Microsoft Smart Tags

Recognize “controlled” words in Office
2003 documents and associate
predefined list of actions with each
occurrence
SHOE

Table data in a declarative (logic)
language
24
Prolog rendition
strengthTableRow( 0, 0.50, 165, 155).
strengthTableRow(0.50, 1.00, 160, 150).
strengthTableRow(1.00, 1.50, 155, 145).
...
strengthTable(Thickness, TensileStrength, YieldStrength) :strengthTableRow(L, U, TensileStrength, YieldStrength),
L =< Thickness, U > Thickness.
thicknessToTensileStrength(Thickness, TensileStrength) :strengthTable(Thickness, TensileStrength, _).
thicknessToYieldStrength(Thickness, YieldStrength) :strengthTable(Thickness, _, YieldStrength).
?- thicknessToYieldStrength(0.6,YS).
25
Conclusions
26
A Step towards Holy Grail
Ultimately enable authoring and/or
extracting, human-comprehensible
and machine-processable parts of a
document “hand in hand”, and keep
them “side by side”.
27
Download