Word - VuFind

advertisement
[VUFIND-493] XSLT Transformation: split keywords (single or compose terms)
into distinct entries Created: 30/Dec/11 Updated: 27/Jan/12 Resolved: 27/Jan/12
Status:
Project:
Component/s:
Affects
Version/s:
Fix Version/s:
Resolved
VuFind
OAI
None
Type:
Reporter:
Resolution:
Labels:
Improvement
Filipe M S Bento
Fixed
oai, xml,, xslt
1.3
Priority:
Assignee:
Votes:
Trivial
Unassigned
1
Description
Some OAI sources send values in a single instance of a certain field instead of repeating that
same field with the different values.
For instances:
<dc:subject>Brugada-like ECG, speckle tracking, two-dimensional strain imaging, TEI index,
sodium channel blocker</dc:subject>
enters VuFinf as a single keyword
"Brugada-like ECG, speckle tracking, two-dimensional strain imaging, TEI index, sodium
channel blocker"
instead of 5 different ones: "Brugada-like ECG", "speckle tracking", etc.
This affects mainly alphabetical browsing and "direct" searches by clicking in that entry within
the record's full view.
A possible solution is to define a xsl:template to perform that job and call it within the main one
(<xsl:template match="oai_dc:dc">):
<xsl:template name="split_values">
<xsl:param name="string" />
<xsl:param name="delimiter" select="', '" />
<xsl:choose>
<xsl:when test="$delimiter and contains($string, $delimiter)">
<field name="topic">
<xsl:value-of select="substring-before($string, $delimiter)" />
</field>
<xsl:text></xsl:text>
<xsl:call-template name="split_values">
<xsl:with-param name="string" select="substring-after($string, $delimiter)" />
<xsl:with-param name="delimiter" select="$delimiter" />
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<field name="topic">
<xsl:value-of select="$string" />
</field>
<xsl:text> </xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
... right before:
<xsl:template match="oai_dc:dc">
... and within this:
<xsl:call-template name=" split_values">
<xsl:with-param name="string" select="//dc:subject" />
</xsl:call-template>
If the values are separated by dots, then change the value of the parameter "delimiter" to ".":
<xsl:param name="delimiter" select="'. '" />
Have a great 2012 when the time arrives!
All the best,
Filipe Bento
Comments
Comment by Filipe M S Bento [ 30/Dec/11 ]
PS: if someone knows a better / faster way to achieve this (somekind of "substring-split"
function that is implemented in XSLT 1.0), please do share with us. Thanks!
Comment by Demian Katz [ 03/Jan/12 ]
In the .properties file associated with your import process (for example, import/ojs.properties),
there is a php_function[] setting which you can use to allow access to PHP functions from
within the XSLT. So you could edit the properties file to add this line:
php_function[] = explode
And then you could do something like this in the XSLT:
<xsl:for-each select="php:function('explode', $delimiter, string(//dc:subject))">
<field name="topic"><xsl:value-of select="normalize-space(string(.))" /></field>
</xsl:for-each>
That's untested code, and my knowledge of XSLT is a little shaky, so the syntax may not be
entirely correct... but hopefully it gives you the general idea.
Comment by Filipe M S Bento [ 03/Jan/12 ]
Thanks, Demian (and have a great year)!
I'll sure test and use that solution!
I have noticed that "php:function" call in some .properties files, yet didn't know how to invoke it
(thought those functions were defined somewhere along/within VuFind).
And yet, it's so simple! :)
Thanks and please do accept my best wishes of a great year to you, Demian and all your family
(@ home, Villanova Univ and VuFind's community one!),
Filipe
Comment by Demian Katz [ 04/Jan/12 ]
Thanks, let me know how it works out.
If you do get things working, would you be able to share an example properties/xslt file? It
would be nice to get a demonstration of this principle into the trunk so we can close out this
ticket.
Comment by Demian Katz [ 05/Jan/12 ]
It turns out that my proposed solution doesn't work -- you can't send a PHP array straight into
XSLT; XSLT expects DOMDocument objects. That makes my solution a bit less elegant...
I added a new custom function to import/xsl/vufind.php:
public static function explode($delimiter, $string)
{
$parts = explode($delimiter, $string);
$dom = new DOMDocument('1.0', 'utf-8');
foreach ($parts as $part) {
$element = $dom->createElement('part', $part);
$dom->appendChild($element);
}
return $dom;
}
And then I made this change to the XSLT:
<xsl:for-each select="php:function('VuFind::explode', $delimiter, string(//dc:subject))/part">
<field name="topic"><xsl:value-of select="normalize-space(string(.))" /></field>
</xsl:for-each>
Note the “/part” at the end of the select to correspond with the part tags created by the PHP
code.
Let me know if you can think of a way to improve on this!
Comment by Filipe M S Bento [ 23/Jan/12 ]
Dear Demian, thank you for the improvements.
It works and actually I am taking it a little bit further within the XSLT, trying to cover all the
possible situations, not leaving behind the possible combinations and effects on them that they
may produce.
So I've used something like:
<!-- SUBJECT -->
<xsl:if test="//dc:subject">
<xsl:for-each select="//dc:subject">
<xsl:choose>
<xsl:when test="contains(., '. ')">
<xsl:for-each select="php:function('VuFind::explode', '. ', string(//dc:subject))/part">
<field name="topic"><xsl:value-of select="normalize-space(string(.))" /></field>
</xsl:for-each>
</xsl:when>
<xsl:when test="contains(., ', ')">
<xsl:for-each select="php:function('VuFind::explode', ', ', string(//dc:subject))/part">
<field name="topic"><xsl:value-of select="normalize-space(string(.))" /></field>
</xsl:for-each>
</xsl:when>
<xsl:when test="contains(., '; ')">
<xsl:for-each select="php:function('VuFind::explode', '; ', string(//dc:subject))/part">
<field name="topic"><xsl:value-of select="normalize-space(string(.))" /></field>
</xsl:for-each>
</xsl:when>
<xsl:otherwise>
<field name="topic">
<xsl:value-of select="." />
</field>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
</xsl:if>
Thanks,
Filipe
Comment by Demian Katz [ 23/Jan/12 ]
The only problem I see with your solution is that it can cause weird results if a string contains a
mix of separators. For example, something like "a, b; c" would result in these subjects being
indexed:
a
b; c
a, b
c
To avoid that issue, perhaps it would make more sense to use preg_split, which is like explode
but based on regular expressions. Then you could do something like preg_split('/[.,;]\s*/',
$string) to get better results -- i.e. in my previous example (assuming my regular expression
syntax is correct), you would get:
a
b
c
Comment by Filipe M S Bento [ 23/Jan/12 ]
Demian,
Sorry, when saying "not leaving behind the possible combinations and effects on them that they
may produce" I wanted to say that the order of the different <xsl:when test="contains(., ... >
within the <xsl:choose> is by no means innocent; it does matter and a lot! And if the fields do
not use them as separators, just take them out.
I mean, correct me if I wrong, but the <xsl:choose> will only let one of those combinations get
in, according to the order they are; I mean if the first delimiter is "; " (and even if the second is
", "), the result will be two entries (only and may I say, the wanted ones!):
"a, b"
and
"c"
because it will not enter the second one (<xsl:when test="contains(., ', ')">)
and if the main delimiter is ";" then the combination "a, b" is for sure more important than to
have "a" + "b" (I mean when "b" is a narrow term of the main subject "a" or any other kind of
subdivision).
I guess one has to look carefully into some random XMLs from the source and check if they are
consistent in the way they are indexed.
So far I have been lucky with this approach and worked perfectly, but it's a choice one has to
make.
All the best,
Filipe
PS: please do apologize, but I am no PHP or XSLT expert, programming by example and try
and fail basis; well, hopefully, "try and succeed", sometimes many, many hours later :)
Comment by Demian Katz [ 23/Jan/12 ]
You are correct -- the xsl:choose works as you describe. I wasn't thinking straight and confused
it with an if statement, but really it's a switch! Though my comment about using preg_split still
might be relevant if you end up in a situation where you might have multiple valid separators in
a single string.
Comment by Demian Katz [ 27/Jan/12 ]
A fix for the problem is demonstrated by the NDLTD import example committed as r4988.
Generated at Tue Feb 09 13:25:44 EST 2016 using JIRA 6.2.6#6264sha1:ee7642271310c09537d01e5848a003c4498a0eed.
Download