Improve the way you create, manage and distribute information
INNOVATION
INSPIRATION
Bob DuCharme
XML 2008
December 9, 2008 www.innodata-isogen.com
2
We help companies lower the cost of creating and managing information.
2
3
• Solutions Architect, Innodata
Isogen
• weblog: http://www.snee.com/bobdc.blog
• other writing:
See http://www.snee.com/bob
• URLs referenced today: http://www.snee.com/xml/xml2008
3
4
Input
1
Input
2
Input
3
Input
4
Input
5
Process
D
Process
A
Process
B
Process
C
Editorial
Master (XML)
Process
E
Process
F
Output
1
Output
3
Output
2
4
5
• You’ve “inherited” some content
• Convert to your current editorial format
• Convert it to new output formats
• Efficient development of efficient conversion routines
5
6
• colors.txt: red green blue green blue blue red
$ sort colors.txt
blue blue blue green green red red
6
7
sort colors.txt | uniq -c
3 blue
2 green
2 red
7
8
8
9
From http://www.thaiopensource.com/relaxng/trang.html:
Trang converts between different schema languages for XML. It supports the following languages:
• RELAX NG (XML syntax)
• RELAX NG compact syntax
• XML 1.0 DTDs
• W3C XML Schema
A schema written in any of the supported schema languages can be converted into any of the other supported schema languages, except that
W3C XML Schema is supported for output only, not for input.
Trang can also infer a schema from one or more example XML documents.
9
10
10
11
<whatever>
<?xml version="1.0" encoding=“UTF-8" ?>
<somedoc>Here is one document</somedoc>
<somedoc>Here is another</somedoc>
<somedoc>Here is another</somedoc>
<somedoc>Here is another</somedoc>
</whatever>
11
• Elsevier article DTD: trang art510.dtd art510.rng
• Combined sample content: trang issueContents.xml issueContents.rng
• Compare results: saxon art510.rng compareElsRNG.xsl | sort > compareElsRNG.out
12
12
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:r="http://relaxng.org/ns/structure/1.0">
<xsl:strip-space elements="*"/>
<xsl:output method="text"/>
<xsl:variable name="schema“ select="document('issueContents.rng')"/>
<xsl:template match="text()"/>
13
13
<xsl:template match="r:element">
<xsl:variable name="name" select="@name"/>
<xsl:choose>
<xsl:when test="$schema/r:grammar//r:element/@name[. =
$name]">
Yes: <xsl:value-of select="$name"/>
</xsl:when>
<xsl:otherwise>
No: <xsl:value-of select="$name"/>
</xsl:otherwise>
</xsl:choose>
<xsl:apply-templates/>
</xsl:template>
</xsl:stylesheet>
14
14
No: tb:colspec
No: tb:left-border
No: tb:right-border
No: tb:top-border
Yes: aid
Yes: article
Yes: body
Yes: ce:abstract
Yes: ce:abstract-sec
Yes: ce:acknowledgment
Yes: ce:affiliation
15
15
• Or SGML, after using James Clark’s sx: sx -f err.out -x lower myfile.sgm > myfile.xml
16
16
<xsl:stylesheet version="1.0“ xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:strip-space elements="*"/>
<xsl:output method="text"/>
<xsl:template match="text()"/>
<xsl:template match="*">
<xsl:value-of select="name()"/>
<xsl:text>
</xsl:text>
<xsl:apply-templates/>
</xsl:template>
</xsl:stylesheet>
17
17
saxon issueContents.xml countElements.xsl
| sort | uniq -c | sort
18
18
Start of list:
1 ce:chem
1 ce:displayed-quote
1 ce:inline-figure
1 ce:nomenclature
1 ce:textbox
1 ce:textbox-body
1 ce:underline
1 ce:vsp
1 doc
1 sb:e-host
2 small-caps
3 display
3 formula
End of list:
5726 ce:cross-ref
6916 entry
7225 mml:mo
7760 sb:maintitle
7760 sb:title
7929 ce:label
8458 ce:hsp
9326 mml:mi
10331 mml:mrow
12438 ce:italic
16453 sb:author
17082 ce:given-name
17095 ce:surname
19
19
<xsl:stylesheet version="1.0“ xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:strip-space elements="*"/>
<xsl:output method="text"/>
<xsl:template match="text()"/>
<xsl:template match="*">
<xsl:value-of select="name(..)"/>/< xsl:value-of select="name()"/>
<xsl:text>
</xsl:text>
<xsl:apply-templates/>
</xsl:template>
</xsl:stylesheet>
20
20
1 ce:displayed-quote/ce:simple-para
59 ce:biography/ce:simple-para
107 ce:legend/ce:simple-para
115 ce:abstract-sec/ce:simple-para
859 ce:caption/ce:simple-para
21
21
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:strip-space elements="*"/>
<xsl:output method="text"/>
<xsl:template match="text()"/>
<xsl:template match="@*">
<xsl:value-of select="name(..)"/>
<xsl:text>/@</xsl:text>
<xsl:value-of select="name()"/>
<xsl:text>
</xsl:text>
</xsl:template>
<xsl:template match="*">
<xsl:apply-templates select="*|@*"/>
</xsl:template>
</xsl:stylesheet>
22
22
1 ce:textbox/@id
28 ce:enunciation/@id
44 ce:table-footnote/@id
50 ce:biography/@id
79 ce:footnote/@id
104 ce:correspondence/@id
142 ce:table/@id
175 ce:affiliation/@id
180 ce:formula/@id
182 ce:section/@id
713 ce:figure/@id
4224 ce:bib-reference/@id
23
23
<xsl:stylesheet version="1.0" xmlns:ce="http://www.elsevier.com/xml/common/dtd" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
Yes: <!-- finds 180 -->
<xsl:value-of select="count(//ce:formula[@id])"/>
No: <!-- finds 208 -->
<xsl:value-of select="count(//ce:formula[not(@id)])"/>
</xsl:template>
</xsl:stylesheet>
24
24
<xsl:stylesheet version="1.0" xmlns:ce="http://www.elsevier.com/xml/common/dtd" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="*">
<xsl:apply-templates select="*|@*"/>
</xsl:template>
<xsl:template match="text()|@*"/>
<xsl:template match=" ce:link/@locator ">
<xsl:value-of select="."/><xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
25
25
xsltproc OneAttvalue.xsl issueContents.xml | sort | uniq -c | sort
• Output ending like this:
10 gr12
11 gr11
14 gr10
17 fx1
17 fx2
18 gr9
24 gr8
37 gr7
55 gr6
67 gr5
91 gr4
99 gr3
103 gr1
103 gr2
26
26
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="text()"/>
<xsl:template match="comment()">
<xsl:copy/>
</xsl:template>
</xsl:stylesheet>
27
27
Output just the processing instructions in a document
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml"/>
<xsl:template match="processing-instruction()">
<xsl:copy/>
</xsl:template>
</xsl:stylesheet>
28
28
• Go through rng schema
• For each element, output dtdname.dtd
\t elementName
• For each attribute, output dtdname.dtd
\t elementName \t attributeName
29
29
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:r="http://relaxng.org/ns/structure/1.0" version="1.0">
<xsl:param name="dtdname"
>no dtdname parameter supplied</xsl:param>
<xsl:strip-space elements="*"/>
<xsl:output method="text"/>
<xsl:template match="r:files|r:attribute| r:value "/>
30
30
<xsl:template match="r:element">
<xsl:variable name="elName" select="@name"/>
<xsl:value-of select="$dtdname"/>
<xsl:text>	</xsl:text>
<xsl:value-of select="@name"/>
<xsl:text> </xsl:text>
<xsl:for-each select="r:attribute | r:optional/r:attribute">
<xsl:value-of select="$dtdname"/>
<xsl:text>	</xsl:text>
<xsl:value-of select="$elName"/>
<xsl:text>	</xsl:text>
<xsl:value-of select="@name"/>
<xsl:text> </xsl:text>
</xsl:for-each>
<xsl:apply-templates/>
</xsl:template>
</xsl:stylesheet>
31
31
<xsl:stylesheet version="1.0“ xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:r="http://relaxng.org/ns/structure/1.0" >
<xsl:output indent="yes"/>
<xsl:template match="r:element/r:ref | r:optional/r:ref">
<xsl:variable name="referent" select="@name"/>
<xsl:apply-templates select="//r:define[@name = $referent]“ mode="copying"/>
</xsl:template>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="r:define" mode="copying">
<xsl:apply-templates select="node()"/>
</xsl:template>
</xsl:stylesheet>
32
32
• Why? When migrating away from it
• RNG or W3C XSD both XML, but not SGML
• Using Earl Hood’s perlSGML DTD analysis tools
33
33
1.
Run Earl Hood’s dtd2html utility
2. Run tagsoup or HTML Tidy on output files
3.
Now you’ve got XML where you can pull out element information with XSLT
34
34
1. Tweak dtd2html to add <div class=“whatever”></div> elements
2.
Run Earl Hood’s dtd2html utility
3. Run tagsoup or HTML Tidy on output files
4.
Now you’ve got XML where you can pull out element information with XSLT
35
35
• This is not an integrated report generator. It’s
Legos.
• Pipelining data between existing tools, re-usable scripts, and quick hacks.
• Document your command lines, e.g. saxon temp1.xml temp3.xsl > temp1a.xml
• Clients like reports, especially in spreadsheets.
36
36
• Referenced resources: http://www.snee.com/xml/xml2008
37
37