SSS A legal-X Markup Languages SG X ML HT W CP VOX DHT G math C DS Yaakov J. Stein Chief Scientist RAD Data Communications Stein Markup 1.1 What do I do? I digest, edit and produce documents business letters email meeting summaries proposals reports requirement specifications project plans web pages research articles review articles books Stein Markup 1.2 What do others do? Pretty much the same US corporations produce >100 billion documents per year 90% of a modern institution’s information is in documents >50% of typical corporation’s efforts involves documents That’s why word processing SW was expected to bring efficiency increases But didn’t! Stein Markup 1.3 Word processing? PROs makes nicer looking documents expedites document sharing during creation CONs typically 30% of effort on format and reformat doesn’t increase information accessibility doesn’t facilitate information mining Stein Markup 1.4 Databases? The natural alternative to documents are databases PROs increase information accessibility facilitate information mining CONs not human readable format inflexible Stein Markup 1.5 The solution What we really want is to write unconstrained text but to have information retrieval as well ! Method 1 Automatic text analysis AI program analyzes text Recognizes document structure, sentence syntax Performs gisting, facilitates information mining Complete solution equivalent to solving Turing test Method 2 Manual markup Document author responsible for marking Clarifies document structure Enables automated retrieval of selected information Suggests presentation format Stein Markup 1.6 Why is text analysis hard? The man cried FIRE ! The man cried FIRE the gun ! The man cried FIRE the gun maker ! Stein Markup 1.7 Are MLs computer languages? There are many different types of computer languages: procedural languages for (n=0;n<10;i++) if (n>5) printf(“markup languages are fun!\n”); graphic languages newpath 0 0 moveto 0 1 lineto 1 1 lineto 1 0 lineto closepath fill database languages SELECT book FROM biblio WHERE subject=‘DSP’ AND author=‘STEIN’ ; logical languages useful(DSP), useful(hardware), fun(DSP), fun(web) interesting(X) if useful(X) and fun(X) ?-interesting(X) Stein Markup 1.8 They are! Markup languages do not directly instruct computers like procedural languages rather indirectly instruct computer like logical languages They do this by using: <BOOK SUBJECT=“dsp”> elements (tags) <TITLE FORMAT=“short”>DSP-CSP</TITLE> attributes <AUTHOR>J. Stein</AUTHOR> This is a great book! entities &standard-disclaimer text </BOOK> } Stein Markup 1.9 Some markup element functions Structural – Clarifies document structure – Delineates document parts Descriptive (informative) – Indicates – Facilitates information retrieval Presentational (display) – Presents information in nice format – Helps human readability Referential (links, applications) – Provide hypertext links – Launch applications Stein Markup 1.10 Structural Markup <HEADING>September 1, 2000</HEADING> <GREETING>Dear Prof. Stein, </GREETING> <BODY> I would like to tell you how much I enjoyed reading your new text “Digital Signal Processing, A Computer Science Perspective”. I hope we will be able to meet at the next conference. </BODY> <SIGNATURE> Sincerely, Dee Espy </SIGNATURE> Stein Markup 1.11 Descriptive Markup <DATE>September 1, 2000</DATE> Dear <PERSON>Prof. Stein,</PERSON> I would like to tell you how much I enjoyed reading your new text <BOOK> “Digital Signal Processing, A Computer Science Perspective”. </BOOK> I hope we will be able to meet at the next <EVENT>conference.</EVENT> Sincerely, <PERSON>Dee Espy</PERSON> Stein Markup 1.12 Presentational Markup <RIGHT-JUSTIFY>September 1, 2000</RIGHT-JUSTIFY> <BOLD>Dear Prof. Stein,</BOLD> I would like to tell you how much I enjoyed reading your new text <UNDERLINE> “Digital Signal Processing, A Computer Science Perspective”. </UNDERLINE> I hope we will be able to meet at the next <BLINK>conference.</BLINK> Sincerely, <IMAGE SRC=“deesignature.jpg” ALIGN=“left”> <FONT FACE=“Times-Roman”>Dee Espy</FONT> Stein Markup 1.13 Relational Markup <today xlink:form=“simple” href=“date” actuate=“auto”> Dear Prof. Stein, I would like to tell you how much I enjoyed reading your new text <A HREF=“www.amazon.com/exec/obidos/ASIN/04712954”> “Digital Signal Processing, A Computer Science Perspective”. </A> I hope we will be able to meet at the next <A HREF=“conference”>conference.</A> Sincerely, <IMAGE SRC=“dee-signature.jpg” ALIGN=“left”> <A HREF=“mailto:dee@dee-epsy.net”>Dee Espy</A> Stein Markup 1.14 Generalized Markup Language William Tunnicliffe, Stanley Rice [1960s] (independently) invent idea of structural markup language Problem: need different ML for each type of document (letter, report, article, book, etc) Charles Goldfarb, Edward Mosher, Raymond Lorie (IBM) [1973] invent Generalized Markup Language (GML) Solution: use metalanguage Document Type Definition (DTD) defines tags IBM marked up 90% of its documents with GML Stein Markup 1.15 With GML structure is evident Library Novels Journals Textbooks Algebraic zoology Botanical history Computer poetry DSP DSP-CSP DSP just for fun Elementary QED Title Full: Digital Signal Processing a Computer Science Perspective Short: DSPCSP Author Name: Jonathan (Y) Stein Association: RAD Data Comm. Publication Publisher: John Wiley Year: 2000 Location: New York ISBN: 04712954 Stein Markup 1.16 Standard Generalized Markup Language Problems with GML: – No validating parser – Not portable (between computer systems) Solution: SGML ANSI [1978] ISO/IEC 8879 [1986] (Intl Org for Standardization / Intl Electrotechnical Commission) JTC1/SC34/WG1 (WG 1 of SubCommittee 34 of Joint Technical Committee 1) For presentation: Document Style Semantics and Specification Language Stein Markup 1.17 SGML - cont. If SGML is so good why doesn’t anyone use it ? Complexity – – – – – base standard >500 pages SGML is a metalanguage writing DTD is complex programming marked up text is hard to read DSSSL adds to complexity Inflexibility - requires absolute conformity – assumes only one correct way to markup – constrains author to dictated structure – not good at capturing author’s structure Stein Markup 1.18 HyperText Markup Language CERN (particle physics institute in Switzerland) was an early Internet adopter Used extensively for collaboration (articles have long author lists) Major problems with format incompatibility – only straight ASCII worked reliably Tim Berners-Lee (computer specialist) defined requirements simplicity (couldn’t expect physicists to use SGML) freedom (didn’t need validation, let browser ignore bad markup) needed hypertext links (including to documents over Internet) presentational markup (papers must look nice - authors used to TEX) Solution: HTML - a specific application of SGML (not metalanguage) Stein Markup 1.19 HTML versions HTML 1.0 (1989) Berners-Lee original CERN version hypertext, images, head+body structure, presentational markup HTML 2.0 (1994) IETF standard - RFC 1866 added lists, forms, etc. HTML 3.2 (1997) W3C recommendation (incorporates Netscape extensions) added tables, applets, super/sub-scripts HTML 4.0 (1997) W3C recommendation (and similar ISO/IEC 15445) minimizes presentational markup XHTML 1.0 (2000) present W3C recommendation reformulates HTML in XML Stein Markup 1.20 HTML document structure <HTML> <HEAD> global definitions such as <TITLE>Web page title</TITLE> </HEAD> <BODY> marked-up text </BODY> </HTML> Stein Markup 1.21 Some HTML (body) elements <H1>Level 1 Heading</H1> <H2>Level 2 Heading</H2> <H3>Level 3 Heading</H3> <EM> emphasized </EM> <P> Paragraph </P> <A HREF=url>link</A> <UL> <LI> item 1 </LI> <LI> item 2 </LI> </UL> <OL> <LI> item 1 </LI> <LI> item 2 </LI> </OL> <IMG SRC=url> Level 1 Heading Level 2 Heading Level 3 Heading emphasized Paragraph link .. item 1 item 2 1 item 1 2 item 2 Stein Markup 1.22 Problems with HTML Presentational aspects have predominated <B> bold text </B> <BLINK> blinking text </BLINK> <FONT COLOR=“red”> red text </FONT> Practically no descriptive markup Search engines are reduced to flat text search Search by topic only through keywords or portals Not extensible Can’t add new tags Unknown tags ignored Links are relatively simple Usually user action is required (except IMG) Only full document (with offset) linkable Link management is logistic nightmare Stein Markup 1.23 Not everything is HTML Due to HTML limitations other tools are also used: Multimedia extensions – (dynamic) gif, jpg, … – streaming audio Common Gateway Interface – generate HTML on-the-fly – Perl, C, … Server Push - Server Pull Javascript Java Stein Markup 1.24 eXtensible Markup Language Simplified (best parts of) SGML (subset of features) Flexible content management tool W3C recommendation(s) Extensible - can add new elements (even without DTD) Easy to create special purpose languages (with DTD/SCHEMA) Includes HTML-like hypertext links – and extensions (XLINK, XPOINTER) The future of the web ! Stein Markup 1.25 XML - an Example <?xml version="1.0" standalone="yes"?> <bibliography> <book isbn=04712954> <title>Digital Signal Processing: a Computer Science Perspective</title> <author>Jonathan (Y) Stein</author> <publisher>John Wiley and Sons</publisher> </book> <article> <title>False Alarm Reduction for ASR and OCR</title> <author>Yaakov Stein</author> <proceedings>Tenth AICVNN Symposium</proceedings> <pages>195-200</pages> </article> ... </bibliography> Stein Markup 1.26 What can we do with an XML file? Check if well-formed Check if valid (against DTD or schema) Display “as-is” in browser Parse in special-purpose program (SAX, DOM) Process (XSL) to XML, HTML, etc. Display after processing Stein Markup 1.27 Wireless Markup Language Markup language element of Wireless Application Protocol WAP forum (1997) – Ericsson, Motorola, Nokia, Unwired Planet (phone.com) – bring Internet to cellular phone users – re-use fundamental Internet concepts (TCP/IP, http, html, javascript) but adapted to lower bandwidth smaller screen limited input facilities limited computational resources – applications scale across transport options (GSM, TDMA, CDMA, 3G) and device types (mobile phones, personal assistants) Stein Markup 1.28 WML Philosophy Defined using XML Transported in compressed binary (for BW reduction) Applications are modeled as decks of cards Features: Actions (OK, navigation, help) can be performed Hyperlinks (like in HTML) String variables Timers wbmp images (B&W) Select boxes, forms (for input) wmlscript (like javascript) Stein Markup 1.29 WML structure < ? xml version=“1.0” ? > <!DOCTYPE wml …> <wml> <card> <p> text </p> <p> text </p> </card> <card> ... </card> </wml> Stein Markup 1.30 Some WML elements <p> </p> text <a href=...> </a> hyperlink (anchor) <do> </do> action <go href=.../> goto wml page <timer> trigger event (units = tenths of a second) <input/> input user text <prev/> return to previous page $(…) value of variable <img src=… /> display image <postfield name=… value=…/> set variable <select > <option> <option> </select> select box Stein Markup 1.31 Some more markup languages VML = Vector (graphics) Markup Language VoiceXML SSML = Speech Synthesis Markup Language CPML = Call Policy Markup Language DSML = Directory Services Markup Language MathML = Mathematical Markup Language CML = Chemical Markup Language AML = Astronomical Markup Language LegalXML BSML = Bioinformatic Sequence Markup Language GedML = Genealogical Data Markup Language FinXML = Financial market Markup Language ChessML SDML = Signed Document Markup Language RELML = Real Estate Listing Markup Language etc. etc. etc. ... Stein Markup 1.32 Examples HTML – html examples XML – xml-file xsl-file xml VML – vml-file WML (get M3gate emulator) – wml examples Stein Markup 1.33