Word - VuFind

advertisement
[VUFIND-581] Authority Recommendation Module: search for non-preferred
term should return also the ones that have the preferred term & vice-versa.
Created: 16/May/12 Updated: 09/May/13
Status:
Project:
Component/s:
Affects
Version/s:
Fix Version/s:
Open
VuFind
Search
1.3
Type:
Reporter:
Resolution:
Labels:
Improvement
Filipe M S Bento
Unresolved
Search, feature
Wishlist
Priority:
Assignee:
Votes:
Minor
Unassigned
0
Description
Following the great new module, ticket VUFIND-538, by Ronan McHugh with enhancements
from Demian, it would be nice if a search for a non-preferred term returns the records that have
that exact terms, not only but also the ones that have the preferred term.
This is quite a common feature in most of ILS systems that have AUT bases up & running. I
mean, for search purposes they are equivalent.
Example from ours:
Poluição do ar
Use for:
Poluição da atmosfera
Poluição atmosférica
Poluentes atomosférico
a search for "Poluição atmosférica" should return the records that have (wrongly) this nonpreferred term, but also the ones with "Poluição do ar" in their subject.
VuFind as a synonymous table; would it be too hard to connect them?
Well, I leave the idea for further analysis, I hope.
Filipe
Comments
Comment by Ronan McHugh (Inactive) [ 17/May/12 ]
If I understand correctly, this feature would essentially cut out the middle-man of the See Also
box, and send users straight to search results for the authoritative term. This might be
appropriate for searches where there is only one suggestion from the module, but I wonder if it
is appropriate for searches where there are several suggestions? In this case, being forced to be
more specific might actually be a good user experience, otherwise they will get a long list of
results that they did not want.
Comment by Filipe M S Bento [ 17/May/12 ]
@Ronan McHugh:
Hi!
No, was not talking about "see also" (that is great as it is!), by no means: that would completely
ruin the search experience, giving results not wanted at all!
I was talking about, and only, the non-preferred terms (Use for:, block 4, not block 5 [see also]),
terms that were replaced by new terminogy, that shouldn't have associated records in the
database! In that case, the Authority Recommendation Module would display at the top: use xxx
instead of your term and display them in the result list instead of nothing found. Most of the ILS
do that.
Thanks,
Filipe
PS: I kindly suggest consult for reference about preferred terms and non-preferred terms, an
example found on the run, http://publish.uwo.ca/~craven/677/thesaur/main04.htm
Comment by Demian Katz [ 17/May/12 ]
The recommendation module interface allows the module to modify the main search before it is
executed... so it may be a matter of adjusting the query to include an "OR" clause. As Ronan
says, though, the trick is deciding when it is appropriate to do this -- if you get many possible
cross-references, you don't want to apply all of them or you will have anarchy.
Perhaps another possibility is to offer the option of either "change your query to this" or "add
this to your query" (sort of like the "expand search" option of the existing spell check code). The
problem there is finding a user interface that makes sense; I suspect nobody understands the
purpose of the expand spelling option as it stands, so it's not exactly a great model to build upon.
Comment by Filipe M S Bento [ 17/May/12 ]
I am so sorry, but, and again, I am talking about non-preferred terms. And sorry to ask, but are
you into AUT DBs deeply? I mean, are you familiar enough with "preferred and non-preferred
terms" ~ "authorized forms of headings and unauthorized forms"?
Sorry to be so rough, but this quite different from "see also", preferred and non-preferred terms
are in practice equivalent ones; it's just a matter of evolution of the the term or use it to store the
so called unauthorized forms of an heading.
For instances, from MeSH:
B, Lecithinase (no records in our DB)
Use instead:
Lysophospholipase (has link)
and if we click in it (Lysophospholipase):
Lysophospholipase > we get 3 records in our DB and a note:
Note:
91(80); was see under PHOSPHOLIPASES 1980-90
I mean in AlphaBrowse.
You should never get results from non-preferred terms if your ILS messaging queue is working
ok (update Bib records from AUT ones).
So to sum up: they are equivalent and "get many possible cross-references, you don't want to
apply all of them or you will have anarchy" will never apply, Demian! Sorry... :)
The problem is a note I am going to put in VUFIND-538 in the line of what you suggest,
Demian, if we search for a non-preferred term (if it shows records --> well, they should be
corrected to the preffered term) the Authority Recommendation Module shows at top, when
searching for the subject "Contaminação ambiental" (for instances):
See also:
Poluição
having the AUT record as follows:
Poluição
See also:
Engenharia ambiental
Resíduos perigosos
Biorremediação
Use for:
Contaminação
Contaminação ambiental
CORRECT > It should show:
Use instead:
Poluição
like AlphaBrowse shows! Seems little or no difference but ask your librarians and listen to what
they say (I'm not a Librarian, but taught this and implemented from scratch our AUT DB).
You should considered these 4xx fields (a help for the users searching for the old term to get to
the right, current term in use in subject headings / fields in records as the old one should have
been replaced meanwhile to the new prefered terms; and also a log, history of that term
evolution. This block is also used to store unauthorized forms of headings (hey, that's why they
call it an Authority database), that ILS redirect the search to te forms of authorized form of the
term.
Anyway, for the ones that like to have some strong authoritive reference, here you go:
http://www.loc.gov/marc/authority/ad4xx.html.
Hope this make this discussion a little bit more clear.
Filipe
PS: another example from MeSH, with a date range for the ex-preferred term, now a nonpreferred one (NANOSTRUCTURES > Nanoparticles, after 2007) - NANOSTRUCTURES
should return no records, yet this term should appear a lot in the literature writen between 2005
and 2006:
Nanoparticles
Note:
2007; use NANOSTRUCTURES 2005-2006
Comment by Ronan McHugh (Inactive) [ 18/May/12 ]
ok, apologies for the misunderstanding, I think I've got it now... If the user searches for a nonpreferred term, the module will automatically modify the search to be for the preferred term,
correct? In this case, there should be no problem with the redirect, since there is only the one
term. Now, it's worth pointing out that not every Authority Recommendation will only return
one result. If a user's search is vague for example, there may well be several recommendations,
in this case, a redirect is clearly impossible.
In terms of implementation, I can imagine two paths.
1) An additional method to check for use of a non-preferred term in the authority index. If it
returns true, it will modify the search terms. This method will have to be quite strict in terms of
only returning true when it is definitely a non-preferred term and not just something which
could be a non-preferred term, but could also be something else.
2) Functionality in the Authority Recommendation module that will check the return from the
Authority index and modify the search terms if there is only one return which is a preferred
term.
I'd probably lean towards 1), but I'd be interested to hear other opinions.
Comment by Demian Katz [ 18/May/12 ]
I think we have to be a little bit careful here. I didn't previously comment about the see also/use
instead distinction for a few reasons:
1.) Since we're doing a keyword search against the authority records, just because we find a
"preferred term" in the results matching the user's keywords, that doesn't necessarily mean that
it applies to the user's intended search. We're not in left-anchored heading search anymore, and
our users generally aren't thinking that way, so our strategies have to change.
2.) For optimal authority functionality, your records need to be consistently generated using
authorities. If authorities are applied properly, you have a guarantee that you'll never run across
a non-preferred term. However, in reality, records aren't necessarily going to be so clean. If you
harvest from multiple data sources, they may not use the same authorities. If your local
catalogers haven't updated their authorities in a while, they may get out of sync with the ones in
VuFind (if you're loading FAST data instead of a local authority file). Authorities may help
users find things, but I don't think we can safely assume that they offer the only answer.
3.) Librarians understand and care about this distinction. I'm a librarian -- I care too. However,
end users generally do not. I figured if we provided a bunch of links that might lead to better
search results, the user would click one without worrying about its exact nature or origin.
Perhaps it would be helpful to separate the results into "see also" and "use for" lists, but in the
keyword-searching context, I don't know if that is especially meaningful.
Anyway, all of that being said, I think there is some room for improvement... but a few
thoughts:
1.) Any functionality that modifies the search terms should be configurable so it can be turned
off. Some libraries will want it, but others will run into undesired side effects and will want to
turn it off. I would recommend adding an "OR" instead of completely changing a search query,
just in case some non-preferred versions of a term are lurking in the index in old records that
haven't been corrected yet.
2.) If the search is modified, there should probably be an on-screen message indicating what
happened and why.
3.) Perhaps search modification should happen after the search has been executed so we can
account for how many records matched the original search term.
Whew, this is getting long-winded. In any case, I agree with Filipe that we should use the
authority data to its best advantage. I just think we need to be careful that we account for the
unique strengths and weaknesses of VuFind's style of search. We can't simply behave like an
old OPAC, because the data and the interface work differently.
Comment by Filipe M S Bento [ 18/May/12 ]
Deamian,
> 3.) Librarians understand and care about this distinction. I'm a librarian -- I care too. However,
end users generally do not
that was why I was proposing that... as long as they get the relevant records, users don't care if
they inserted a preferred or non-preferred terms / authorized forms of headings and unauthorized
forms... blame Google and alike for that.. who cares to insert a term well written? Google will
suggest the correct spelling and even show the results for this correctly spelled term… hey: that
is exactly I am suggesting in this ticket!! :)
... as long as...
> 2.) If the search is modified, there should probably be an on-screen message indicating what
happened and why.
Bingo! ... EDIT: Google again! :)
Ok, but as it is now, with the mentioned correction of not displaying "see also" in the cases of
4xx terms (use instead), and Ronan solution is a good one (contingency one), we are good and
ready to go!
I mean, VuFind being a NGC solution should port to its core the same advanced features we
find in OGC… :)
Btw,
> 1.) Since we're doing a keyword search against the authority records, just because we find a
"preferred term" in the results matching the user's keywords, that doesn't necessarily mean that
it applies to the user's intended search. We're not in left-anchored heading search anymore, and
our users generally aren't thinking that way, so our strategies have to change.
Sorry, could you please elaborate this... my bad, If I undertstood correctly that is what I am
proposing, not the opposite. I think we should "skype"... :) perhaps in other words I will
understand what you are saying... :)
Filipe
PS: but wait.... You know what would be really, really nice, as we have our Library records
indexed with subjects in Portuguese? To have this feature enabled for the term in another
language:
e.g.,
AUT record:
Main heading: Contaminação da água
Term in English: Water - Pollution
Term in English: Water pollution
When a user searchs for "Water pollution" the system should retrieve our Libraries' Catalog
records with "Contaminação da água" in their subject list (and yes, with a warning message too,
at the top).... and again, vice-versa. As long as the AUT DB is updated (or use external ones,
like EUROVOC, http://eurovoc.europa.eu/) this is an entirely new world for searches > search a
term (subject) in your native language and it would retrieve records indexed with the
corresponding term in any language present in AUT DB records' fields.
Ah... ah... another one to think about (and I guess plenty of discussion meanwhile... :) )
PPS: we should mark these discussions as CONFIDENTIAL... :) Ok, talking serious, for sure
there are ILSs out there that do this (I'm pretty sure ALEPH does... perhaps, we just don't have it
configured to do it).
Comment by Filipe M S Bento [ 24/May/12 ]
Know what?
I've test ran a solution for this, very pragmatic one:
1) Using base instructions here: http://vufind.org/wiki/stop_words_and_synonyms#synonyms
2) Data here: http://eurovoc.europa.eu/drupal/?q=pt/download/list_pt&cl=en (using EUROVOC
as a test bed)
EDIT: if you don't feel confortable enough (yet) with EUROVOC's Portuguese interface, you
can use EN ones: http://eurovoc.europa.eu/drupal/?q=download/list_pt&cl=en (noticed? Drupal;
OSS it taking the world... for free!)
(download accordingly to your main language and let’s say... top of my head... I don't know
which to choose, really, ok, I think I go for... closed yes choice… English! :) ) or even more
langs if you have records enough to return something in that other language);
2.1) column F: =B1 & "," & C1 (fill-down for the remaining lines / expand the formula if you
have more langs, beside yours and the other one I’ve randomly choose… I think it was
English… :) ).
3) Copy-Paste column F to ./solr/biblio/conf/synonyms.txt (append);
3.1) Pay special attention to convert the file to UTF-8 encoding, if you have special chars in
your main language --- else, SOLR won't start, when you…
3) Restart VuFind;
4) Test with some of those words or expressions, any field, subject, etc. (well, those two... all
the other fields are the same whatever the lang);
4.1) Your search should retrieve records indexed with the term you have inserted, not only, but
also records with that term translation to the other language;
5) If it is ok, good! If not, go to sleep (after all it’s 4am+) and with a fresh start tomorrow
(yours) it will work ok (happened to me twice, but instead of 4am was... 7am!);
6) This is not, but I mean really, really, really a solution at all; it is just for you to have an idea
of a multi-language theasures (much faster --- reading from a SOLR index not a txt file) may
bring the discovery experience to a brand new level, that even the most expensive solutions do
not "offer", if not in mistake (do it in a dev/ server... don't know the load it will put on a prod/
server!).
All the best,
Filipe
PS: Demian: try in mines' for "poluição do ar" for instances (= air pollution).
PPS: Demian, I've warned that it is the ugliest solution... is just to have an idea of a new degree
in discovery... no more language barriers in finding the right terms to search for; users will have
the possibility of just access the resources they feel comfortable in that language, facet filtering
it.
PPPS: Demian (sorry, again): your fault for me being reporting this now... was analyzing
http://vufind.org/wiki/developers_call:minutes20120529 and had to relate this experience done
a couple (4h) ago... :)
Comment by Demian Katz [ 25/May/12 ]
Filipe,
As you say, this is definitely a straightforward, pragmatic solution. The only limitation is that
you can't turn off the synonyms, and I suspect that in some (many?) cases they might cause the
result list to include undesirable or confusing results. I think the advantage to building a
recommendation module is that it could display at the top of the results:
"You can broaden your search by including synonyms. Click here to try it!"
...and then add some parameter to the URL to activate synonyms. Without doing some research,
I'm not sure if there's an easy way to toggle Solr's use of synonyms.txt -- I suspect it would
require creating two different field types (one with synonyms, one without) and two copies of
every field. I wouldn't call that an easy way. But another option is to simply let the
recommendation module pre-process the search query, as previously discussed -- not as elegant
or fast as letting Solr do the work for you, but possibly a less complicated solution if you want
toggling.
- Demian
Comment by Filipe M S Bento [ 25/May/12 ]
Demian, hi again!
The ideal would be using the associated entry in AUT(horities) DB for the term in other
languages (not discovering it in MARC21, but in UNIMARC is block 6xx, if not in mistake), if
the AUT DB was well maintained for the present terminology.
OR have a text version / DB (even in MySQL) version of this http://www.amazon.com/EnglishPortuguese-Portuguese-English-Dictionary-Technical-Scientific/dp/9722214926 loaded and an
OR behind the scenes (switchable at search form). Demian, I also have some philosophies about
certain stuff... :) one is that in a Discovery Solution any extra click represents thousands of lost
clients... give them all and if they want, able them to narrow the search (facets).
Most of the times, less is more, but in Discovery, this might render in letting some preciouse
resources hidden (as they were before), which is step back... unless going the wrong direction
(then it would be good to give a step back :) ).
Filipe
PS: Demian, so sorry, but not seeing any situation at all when the solution I’ve sent (proof of
concept) could produce the effect of "include(ing) undesirable or confusing results"; like we
(and I mean you too :) ) in high level support always ask for: examples, please :)
Comment by Demian Katz [ 25/May/12 ]
The "undesirable or confusing results" comment really depends on the nature of the synonyms
in your system. For example, suppose we set up this synonym:
lift = elevator
Now someone does a search for lift, intending the concept in physics. Their search results are
now going to be polluted with results about mechanisms for moving people from floor to floor
within buildings.
I understand that within the context of exactly matching terms from a controlled vocabulary,
you are protected against this kind of thing. However, we are working in a keyword searching
environment here, so these situations are more dangerous.
I agree that in a discovery environment, it is good to provide the user with a lot of results and let
them filter down. However, it is important that the top results returned at least make some
degree of sense. Synonyms have the potential to give high relevance boosting to things that
don't have an obvious relationship to the user's search query. In my lift example, if the user gets
a result set where some things are about lift (the object) and others are about lift (the concept), at
least it is obvious that the word "lift" is present in all of these things. If they do the same search
and their top hit is the "Elevator Repair Manual" and the word lift is nowhere to be found, they
may or may not figure out what has happened. If the results appear to be completely illogical,
they are not going to stick around and refine them -- they are going to start over and try a
different strategy. I feel that an "opt-in" strategy is safer when you risk dramatically changing
the user's query.
I admit that this is all speculation -- the real risk depends entirely upon the data set used... and of
course it's very easy to implement this so that it can be configured to be either automatic or optin based on local preferences, so it's not like we need to come up with a single definitive answer
that works for everyone. I just feel like precision still counts, even if recall is more popular in
the age of discovery.
Comment by Filipe M S Bento [ 25/May/12 ]
Ok, you got me with that one:
(to) lift (verb) = elevator (object)
But I wonder, like you said, under a controlled vocabulary is there a chance of that happening
more than a mere couple of times in hundreds... and I guaranty you if using a AUT DB don't
won't happen because it only retrieves the related records in context, that is something that
Authorities have the onus of giving.
Hey, but this is me saying...
Have a nice holiday weekend and see you Tuesday at “Developers’” Call;
- Filipe
PS: I guess this a non-ending discussion (good or bad, I just don’t know anymore :) ) --- But I'm
sure something positive for VuFind will come out from it, either way it goes (or ways, as you
said, let the final "customer" decide, better than not have a solution or at least have analyzed a
possible one, even the decision is to maintain things as they are)!
PPS: Yes, it is very annoying to explain why /Author/Home?author=Austen%2C%20Jane was
retrieving info from Wikipedia about another completely different author (now it's ok:
http://193.137.169.90/Author/Home?author=Austen%2C%20Jane%2C%201775-1817 >> even
without the dates) and have no explanation for it, just, sorry, that is the info the system is
seeing… I’m with you: accuracy above all (no excuses).
Generated at Tue Feb 09 09:39:58 EST 2016 using JIRA 6.2.6#6264sha1:ee7642271310c09537d01e5848a003c4498a0eed.
Download