xsl:template

advertisement

Improve the way you create, manage and distribute information

INNOVATION

INSPIRATION

Automating Content Analysis with Trang and

Simple XSLT Scripts

Bob DuCharme

XML 2008

December 9, 2008 www.innodata-isogen.com

2

What We Do

We help companies lower the cost of creating and managing information.

2

3

About me

• Solutions Architect, Innodata

Isogen

• weblog: http://www.snee.com/bobdc.blog

• other writing:

See http://www.snee.com/bob

• URLs referenced today: http://www.snee.com/xml/xml2008

3

Single source publishing and “editorial” XML

4

Input

1

Input

2

Input

3

Input

4

Input

5

Process

D

Process

A

Process

B

Process

C

Editorial

Master (XML)

Process

E

Process

F

Output

1

Output

3

Output

2

4

5

Content analysis: why?

• You’ve “inherited” some content

• Convert to your current editorial format

• Convert it to new output formats

• Efficient development of efficient conversion routines

5

6

Handy tool 1 before we get to the XML parts: sort

• colors.txt: red green blue green blue blue red

$ sort colors.txt

blue blue blue green green red red

6

7

Handy tool 2 before we get to the XML parts: uniq

sort colors.txt | uniq -c

3 blue

2 green

2 red

7

8

Sample data

8

9

trang

From http://www.thaiopensource.com/relaxng/trang.html:

Trang converts between different schema languages for XML. It supports the following languages:

• RELAX NG (XML syntax)

• RELAX NG compact syntax

• XML 1.0 DTDs

• W3C XML Schema

A schema written in any of the supported schema languages can be converted into any of the other supported schema languages, except that

W3C XML Schema is supported for output only, not for input.

Trang can also infer a schema from one or more example XML documents.

9

10

trang

Trang can also infer a schema from one or more example XML documents!!!!!

10

11

Analyzing content with trang

<whatever>

<?xml version="1.0" encoding=“UTF-8" ?>

<somedoc>Here is one document</somedoc>

<somedoc>Here is another</somedoc>

<somedoc>Here is another</somedoc>

<somedoc>Here is another</somedoc>

</whatever>

11

Create RELAX NG versions of …

• Elsevier article DTD: trang art510.dtd art510.rng

• Combined sample content: trang issueContents.xml issueContents.rng

• Compare results: saxon art510.rng compareElsRNG.xsl | sort > compareElsRNG.out

12

12

compareElsRNG.xsl (1 of 2)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:r="http://relaxng.org/ns/structure/1.0">

<xsl:strip-space elements="*"/>

<xsl:output method="text"/>

<xsl:variable name="schema“ select="document('issueContents.rng')"/>

<xsl:template match="text()"/>

13

13

compareElsRNG.xsl (2 of 2)

<xsl:template match="r:element">

<xsl:variable name="name" select="@name"/>

<xsl:choose>

<xsl:when test="$schema/r:grammar//r:element/@name[. =

$name]">

Yes: <xsl:value-of select="$name"/>

</xsl:when>

<xsl:otherwise>

No: <xsl:value-of select="$name"/>

</xsl:otherwise>

</xsl:choose>

<xsl:apply-templates/>

</xsl:template>

</xsl:stylesheet>

14

14

compareElsRNG.xsl: some sample output

No: tb:colspec

No: tb:left-border

No: tb:right-border

No: tb:top-border

Yes: aid

Yes: article

Yes: body

Yes: ce:abstract

Yes: ce:abstract-sec

Yes: ce:acknowledgment

Yes: ce:affiliation

15

15

Analyzing the XML itself

• Or SGML, after using James Clark’s sx: sx -f err.out -x lower myfile.sgm > myfile.xml

16

16

Counting elements: countElements.xsl

<xsl:stylesheet version="1.0“ xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:strip-space elements="*"/>

<xsl:output method="text"/>

<xsl:template match="text()"/>

<xsl:template match="*">

<xsl:value-of select="name()"/>

<xsl:text>

</xsl:text>

<xsl:apply-templates/>

</xsl:template>

</xsl:stylesheet>

17

17

Using countElements.xsl to count elements

saxon issueContents.xml countElements.xsl

| sort | uniq -c | sort

18

18

Result of counting elements

Start of list:

1 ce:chem

1 ce:displayed-quote

1 ce:inline-figure

1 ce:nomenclature

1 ce:textbox

1 ce:textbox-body

1 ce:underline

1 ce:vsp

1 doc

1 sb:e-host

2 small-caps

3 display

3 formula

End of list:

5726 ce:cross-ref

6916 entry

7225 mml:mo

7760 sb:maintitle

7760 sb:title

7929 ce:label

8458 ce:hsp

9326 mml:mi

10331 mml:mrow

12438 ce:italic

16453 sb:author

17082 ce:given-name

17095 ce:surname

19

19

Count element/parent combinations

<xsl:stylesheet version="1.0“ xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:strip-space elements="*"/>

<xsl:output method="text"/>

<xsl:template match="text()"/>

<xsl:template match="*">

<xsl:value-of select="name(..)"/>/< xsl:value-of select="name()"/>

<xsl:text>

</xsl:text>

<xsl:apply-templates/>

</xsl:template>

</xsl:stylesheet>

20

20

Some parent/child counts

1 ce:displayed-quote/ce:simple-para

59 ce:biography/ce:simple-para

107 ce:legend/ce:simple-para

115 ce:abstract-sec/ce:simple-para

859 ce:caption/ce:simple-para

21

21

countAttributes.xsl

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:strip-space elements="*"/>

<xsl:output method="text"/>

<xsl:template match="text()"/>

<xsl:template match="@*">

<xsl:value-of select="name(..)"/>

<xsl:text>/@</xsl:text>

<xsl:value-of select="name()"/>

<xsl:text>

</xsl:text>

</xsl:template>

<xsl:template match="*">

<xsl:apply-templates select="*|@*"/>

</xsl:template>

</xsl:stylesheet>

22

22

Counting the attributes: an excerpt

1 ce:textbox/@id

28 ce:enunciation/@id

44 ce:table-footnote/@id

50 ce:biography/@id

79 ce:footnote/@id

104 ce:correspondence/@id

142 ce:table/@id

175 ce:affiliation/@id

180 ce:formula/@id

182 ce:section/@id

713 ce:figure/@id

4224 ce:bib-reference/@id

23

23

Count formula elements with/without ID values

<xsl:stylesheet version="1.0" xmlns:ce="http://www.elsevier.com/xml/common/dtd" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="text"/>

<xsl:template match="/">

Yes: <!-- finds 180 -->

<xsl:value-of select="count(//ce:formula[@id])"/>

No: <!-- finds 208 -->

<xsl:value-of select="count(//ce:formula[not(@id)])"/>

</xsl:template>

</xsl:stylesheet>

24

24

Find all values of a particular attribute

<xsl:stylesheet version="1.0" xmlns:ce="http://www.elsevier.com/xml/common/dtd" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="text"/>

<xsl:template match="*">

<xsl:apply-templates select="*|@*"/>

</xsl:template>

<xsl:template match="text()|@*"/>

<xsl:template match=" ce:link/@locator ">

<xsl:value-of select="."/><xsl:text>

</xsl:text>

</xsl:template>

</xsl:stylesheet>

25

25

Running OneAttValue.xsl

xsltproc OneAttvalue.xsl issueContents.xml | sort | uniq -c | sort

• Output ending like this:

10 gr12

11 gr11

14 gr10

17 fx1

17 fx2

18 gr9

24 gr8

37 gr7

55 gr6

67 gr5

91 gr4

99 gr3

103 gr1

103 gr2

26

26

Output just the comments in a document

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="text()"/>

<xsl:template match="comment()">

<xsl:copy/>

</xsl:template>

</xsl:stylesheet>

27

27

Output just the processing instructions in a document

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="xml"/>

<xsl:template match="processing-instruction()">

<xsl:copy/>

</xsl:template>

</xsl:stylesheet>

28

28

elAttList.xsl goal

• Go through rng schema

• For each element, output dtdname.dtd

\t elementName

• For each attribute, output dtdname.dtd

\t elementName \t attributeName

29

29

elAttList.xsl part 1 of 2

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:r="http://relaxng.org/ns/structure/1.0" version="1.0">

<xsl:param name="dtdname"

>no dtdname parameter supplied</xsl:param>

<xsl:strip-space elements="*"/>

<xsl:output method="text"/>

<xsl:template match="r:files|r:attribute| r:value "/>

30

30

elAttList.xsl part 1 of 2

<xsl:template match="r:element">

<xsl:variable name="elName" select="@name"/>

<xsl:value-of select="$dtdname"/>

<xsl:text>&#9;</xsl:text>

<xsl:value-of select="@name"/>

<xsl:text>&#10;</xsl:text>

<xsl:for-each select="r:attribute | r:optional/r:attribute">

<xsl:value-of select="$dtdname"/>

<xsl:text>&#9;</xsl:text>

<xsl:value-of select="$elName"/>

<xsl:text>&#9;</xsl:text>

<xsl:value-of select="@name"/>

<xsl:text>&#10;</xsl:text>

</xsl:for-each>

<xsl:apply-templates/>

</xsl:template>

</xsl:stylesheet>

31

31

normalizeRNG.xsl

<xsl:stylesheet version="1.0“ xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:r="http://relaxng.org/ns/structure/1.0" >

<xsl:output indent="yes"/>

<xsl:template match="r:element/r:ref | r:optional/r:ref">

<xsl:variable name="referent" select="@name"/>

<xsl:apply-templates select="//r:define[@name = $referent]“ mode="copying"/>

</xsl:template>

<xsl:template match="@*|node()">

<xsl:copy>

<xsl:apply-templates select="@*|node()"/>

</xsl:copy>

</xsl:template>

<xsl:template match="r:define" mode="copying">

<xsl:apply-templates select="node()"/>

</xsl:template>

</xsl:stylesheet>

32

32

Analyzing an SGML DTD

• Why? When migrating away from it

• RNG or W3C XSD both XML, but not SGML

• Using Earl Hood’s perlSGML DTD analysis tools

33

33

XML-based analysis of SGML DTD

1.

Run Earl Hood’s dtd2html utility

2. Run tagsoup or HTML Tidy on output files

3.

Now you’ve got XML where you can pull out element information with XSLT

34

34

XML-based analysis of SGML DTD (revised)

1. Tweak dtd2html to add <div class=“whatever”></div> elements

2.

Run Earl Hood’s dtd2html utility

3. Run tagsoup or HTML Tidy on output files

4.

Now you’ve got XML where you can pull out element information with XSLT

35

35

Summary

• This is not an integrated report generator. It’s

Legos.

• Pipelining data between existing tools, re-usable scripts, and quick hacks.

• Document your command lines, e.g. saxon temp1.xml temp3.xsl > temp1a.xml

• Clients like reports, especially in spreadsheets.

36

36

Thank you!

• Referenced resources: http://www.snee.com/xml/xml2008

37

37

Download