On Embedding Machine-Processable Semantics into Documents Krishnaprasad Thirunarayan Department of Computer Science & Engineering Wright State University Dayton, OH-45435, USA 1 Talk Outline Background and Motivation (Why?) Goals (What?) Details (How?) Conclusions 2 Background and Motivation 3 Content Extraction: Formalize doc, using controlled vocabulary Heterogeneous Doc. Spec. Defn. Rep. 4 Problems with this approach to content extraction Archiving spec (for human comprehension) separately from its formalization is not conducive traceability. Manual extraction from spec (from scratch) for each use is labor intensive, time consuming, and prone to typographical errors. 5 Observation Conceptually, every piece of information in an extraction owes its existence to a phrase in spec, and possibly, controlled vocabulary. So, explore techniques to maintain correspondence between a spec fragment and its formalization. 6 Goal 7 General Problem Embed domain-specific mark-up (annotations) into human sensible document to make explicit semantics of “content” text and complex data, and to augment an interpretation in a modular fashion. Document text: Human comprehensible Semantic Mark-up: Machine processable 8 Details (How?) 9 Nature of Specs Semi-structured Heterogeneous Text Tables Images Constrained technical vocabulary Available as MS Word document 10 Pre-processing Spec Abstract content from spec document by removing display oriented information Save text Save tabular data, preserving grid layout Retain links to images … Note: “Save As text” option in MS Word inadequate 11 Heterogeneous Document 12 XML generated by Majix 13 ASCII Output 14 Annotating Pre-processed Spec Embedding Machine Processable Semantics Recognizing and tagging text using controlled vocabulary By product of: Document Indexing and Semantic Search Tagging tabular data to make explicit its semantics : Same grid layout, but different interpretation and dependencies based on headings Explore: XML-based programming language Water for defining data and its behavior (semantics) 15 Locating Controlled Vocabulary Terms 16 Example Table Thickness (mm) Tensile Yield Strength (ksi) Strength (ksi) 0.50 and under 165 155 0.05 – 1.00 160 150 1.00 – 1.50 155 145 17 Example of Tagged Table Thickness (mm) Tensile Strength (ksi) Yield Strength (ksi) table.<setHeading thickness strength.tensile strength.yield/> 0.50 and under 165 table.<addRow 0 0.50 0.50 - 1.00 1.00 - 1.50 165 160 table.<addRow 0.50 1.00 155 /> 150 160 155 table.<addRow 1.00 1.50 155 150 /> 145 /> ... 145 155 18 Example of Processing Code <defclass table rows=required=vector heading=optional=vector> <defmethod setHeading t=required ts=required ys=required> <set heading=<vector t ts ys/>/> </> <defmethod addRow smin smax ts ys> <set rows= table.rows.<insert <vector smin smax ts ys/>/>/> </> <defmethod computeYieldStrength> … </> <defmethod computeTensileStrength> … </> … </> 19 (cont’d) <defclass table rows=required=vector heading=optional=vector> … <defmethod computeTensileStrength> <set temp=fluid.Thickness/> <set i=0/> <do> <until <and temp.<less table.rows.<get i/>.1/> temp.<more_or_equal table.rows.<get i/>.0/> /> > table.rows.<get i/>.2 </until> <set i=i.<plus 1/>/> </do> </> </> 20 (cont’d) <defclass table rows=required=vector heading=optional=vector> … </> fluid.<set Thickness=0.60> <try <set TensileStrength=table.<computeTensileStrength/>/> TensileStrength > "TABLE: out of range error occurred" </try> 21 Water XML-based OO Scripting Language Facilitates creating Web Services Run methods remotely via web-browser Generalizes dynamic typing to constraint checking Conformance of actuals to formals 22 Pros and cons Encoding Improvement Amount of tagging can be controlled by suitably delimiting table data and annotating it with corresponding “string-processing” method Master Copy Update Changes to spec requires manual modification to archived annotated version. Irregular Tables in Specs Different units, etc 23 Some Related Work Microsoft Smart Tags Recognize “controlled” words in Office 2003 documents and associate predefined list of actions with each occurrence SHOE Table data in a declarative (logic) language 24 Prolog rendition strengthTableRow( 0, 0.50, 165, 155). strengthTableRow(0.50, 1.00, 160, 150). strengthTableRow(1.00, 1.50, 155, 145). ... strengthTable(Thickness, TensileStrength, YieldStrength) :strengthTableRow(L, U, TensileStrength, YieldStrength), L =< Thickness, U > Thickness. thicknessToTensileStrength(Thickness, TensileStrength) :strengthTable(Thickness, TensileStrength, _). thicknessToYieldStrength(Thickness, YieldStrength) :strengthTable(Thickness, _, YieldStrength). ?- thicknessToYieldStrength(0.6,YS). 25 Conclusions 26 A Step towards Holy Grail Ultimately enable authoring and/or extracting, human-comprehensible and machine-processable parts of a document “hand in hand”, and keep them “side by side”. 27