[VUFIND-581] Authority Recommendation Module: search for non-preferred term should return also the ones that have the preferred term & vice-versa. Created: 16/May/12 Updated: 09/May/13 Status: Project: Component/s: Affects Version/s: Fix Version/s: Open VuFind Search 1.3 Type: Reporter: Resolution: Labels: Improvement Filipe M S Bento Unresolved Search, feature Wishlist Priority: Assignee: Votes: Minor Unassigned 0 Description Following the great new module, ticket VUFIND-538, by Ronan McHugh with enhancements from Demian, it would be nice if a search for a non-preferred term returns the records that have that exact terms, not only but also the ones that have the preferred term. This is quite a common feature in most of ILS systems that have AUT bases up & running. I mean, for search purposes they are equivalent. Example from ours: Poluição do ar Use for: Poluição da atmosfera Poluição atmosférica Poluentes atomosférico a search for "Poluição atmosférica" should return the records that have (wrongly) this nonpreferred term, but also the ones with "Poluição do ar" in their subject. VuFind as a synonymous table; would it be too hard to connect them? Well, I leave the idea for further analysis, I hope. Filipe Comments Comment by Ronan McHugh (Inactive) [ 17/May/12 ] If I understand correctly, this feature would essentially cut out the middle-man of the See Also box, and send users straight to search results for the authoritative term. This might be appropriate for searches where there is only one suggestion from the module, but I wonder if it is appropriate for searches where there are several suggestions? In this case, being forced to be more specific might actually be a good user experience, otherwise they will get a long list of results that they did not want. Comment by Filipe M S Bento [ 17/May/12 ] @Ronan McHugh: Hi! No, was not talking about "see also" (that is great as it is!), by no means: that would completely ruin the search experience, giving results not wanted at all! I was talking about, and only, the non-preferred terms (Use for:, block 4, not block 5 [see also]), terms that were replaced by new terminogy, that shouldn't have associated records in the database! In that case, the Authority Recommendation Module would display at the top: use xxx instead of your term and display them in the result list instead of nothing found. Most of the ILS do that. Thanks, Filipe PS: I kindly suggest consult for reference about preferred terms and non-preferred terms, an example found on the run, http://publish.uwo.ca/~craven/677/thesaur/main04.htm Comment by Demian Katz [ 17/May/12 ] The recommendation module interface allows the module to modify the main search before it is executed... so it may be a matter of adjusting the query to include an "OR" clause. As Ronan says, though, the trick is deciding when it is appropriate to do this -- if you get many possible cross-references, you don't want to apply all of them or you will have anarchy. Perhaps another possibility is to offer the option of either "change your query to this" or "add this to your query" (sort of like the "expand search" option of the existing spell check code). The problem there is finding a user interface that makes sense; I suspect nobody understands the purpose of the expand spelling option as it stands, so it's not exactly a great model to build upon. Comment by Filipe M S Bento [ 17/May/12 ] I am so sorry, but, and again, I am talking about non-preferred terms. And sorry to ask, but are you into AUT DBs deeply? I mean, are you familiar enough with "preferred and non-preferred terms" ~ "authorized forms of headings and unauthorized forms"? Sorry to be so rough, but this quite different from "see also", preferred and non-preferred terms are in practice equivalent ones; it's just a matter of evolution of the the term or use it to store the so called unauthorized forms of an heading. For instances, from MeSH: B, Lecithinase (no records in our DB) Use instead: Lysophospholipase (has link) and if we click in it (Lysophospholipase): Lysophospholipase > we get 3 records in our DB and a note: Note: 91(80); was see under PHOSPHOLIPASES 1980-90 I mean in AlphaBrowse. You should never get results from non-preferred terms if your ILS messaging queue is working ok (update Bib records from AUT ones). So to sum up: they are equivalent and "get many possible cross-references, you don't want to apply all of them or you will have anarchy" will never apply, Demian! Sorry... :) The problem is a note I am going to put in VUFIND-538 in the line of what you suggest, Demian, if we search for a non-preferred term (if it shows records --> well, they should be corrected to the preffered term) the Authority Recommendation Module shows at top, when searching for the subject "Contaminação ambiental" (for instances): See also: Poluição having the AUT record as follows: Poluição See also: Engenharia ambiental Resíduos perigosos Biorremediação Use for: Contaminação Contaminação ambiental CORRECT > It should show: Use instead: Poluição like AlphaBrowse shows! Seems little or no difference but ask your librarians and listen to what they say (I'm not a Librarian, but taught this and implemented from scratch our AUT DB). You should considered these 4xx fields (a help for the users searching for the old term to get to the right, current term in use in subject headings / fields in records as the old one should have been replaced meanwhile to the new prefered terms; and also a log, history of that term evolution. This block is also used to store unauthorized forms of headings (hey, that's why they call it an Authority database), that ILS redirect the search to te forms of authorized form of the term. Anyway, for the ones that like to have some strong authoritive reference, here you go: http://www.loc.gov/marc/authority/ad4xx.html. Hope this make this discussion a little bit more clear. Filipe PS: another example from MeSH, with a date range for the ex-preferred term, now a nonpreferred one (NANOSTRUCTURES > Nanoparticles, after 2007) - NANOSTRUCTURES should return no records, yet this term should appear a lot in the literature writen between 2005 and 2006: Nanoparticles Note: 2007; use NANOSTRUCTURES 2005-2006 Comment by Ronan McHugh (Inactive) [ 18/May/12 ] ok, apologies for the misunderstanding, I think I've got it now... If the user searches for a nonpreferred term, the module will automatically modify the search to be for the preferred term, correct? In this case, there should be no problem with the redirect, since there is only the one term. Now, it's worth pointing out that not every Authority Recommendation will only return one result. If a user's search is vague for example, there may well be several recommendations, in this case, a redirect is clearly impossible. In terms of implementation, I can imagine two paths. 1) An additional method to check for use of a non-preferred term in the authority index. If it returns true, it will modify the search terms. This method will have to be quite strict in terms of only returning true when it is definitely a non-preferred term and not just something which could be a non-preferred term, but could also be something else. 2) Functionality in the Authority Recommendation module that will check the return from the Authority index and modify the search terms if there is only one return which is a preferred term. I'd probably lean towards 1), but I'd be interested to hear other opinions. Comment by Demian Katz [ 18/May/12 ] I think we have to be a little bit careful here. I didn't previously comment about the see also/use instead distinction for a few reasons: 1.) Since we're doing a keyword search against the authority records, just because we find a "preferred term" in the results matching the user's keywords, that doesn't necessarily mean that it applies to the user's intended search. We're not in left-anchored heading search anymore, and our users generally aren't thinking that way, so our strategies have to change. 2.) For optimal authority functionality, your records need to be consistently generated using authorities. If authorities are applied properly, you have a guarantee that you'll never run across a non-preferred term. However, in reality, records aren't necessarily going to be so clean. If you harvest from multiple data sources, they may not use the same authorities. If your local catalogers haven't updated their authorities in a while, they may get out of sync with the ones in VuFind (if you're loading FAST data instead of a local authority file). Authorities may help users find things, but I don't think we can safely assume that they offer the only answer. 3.) Librarians understand and care about this distinction. I'm a librarian -- I care too. However, end users generally do not. I figured if we provided a bunch of links that might lead to better search results, the user would click one without worrying about its exact nature or origin. Perhaps it would be helpful to separate the results into "see also" and "use for" lists, but in the keyword-searching context, I don't know if that is especially meaningful. Anyway, all of that being said, I think there is some room for improvement... but a few thoughts: 1.) Any functionality that modifies the search terms should be configurable so it can be turned off. Some libraries will want it, but others will run into undesired side effects and will want to turn it off. I would recommend adding an "OR" instead of completely changing a search query, just in case some non-preferred versions of a term are lurking in the index in old records that haven't been corrected yet. 2.) If the search is modified, there should probably be an on-screen message indicating what happened and why. 3.) Perhaps search modification should happen after the search has been executed so we can account for how many records matched the original search term. Whew, this is getting long-winded. In any case, I agree with Filipe that we should use the authority data to its best advantage. I just think we need to be careful that we account for the unique strengths and weaknesses of VuFind's style of search. We can't simply behave like an old OPAC, because the data and the interface work differently. Comment by Filipe M S Bento [ 18/May/12 ] Deamian, > 3.) Librarians understand and care about this distinction. I'm a librarian -- I care too. However, end users generally do not that was why I was proposing that... as long as they get the relevant records, users don't care if they inserted a preferred or non-preferred terms / authorized forms of headings and unauthorized forms... blame Google and alike for that.. who cares to insert a term well written? Google will suggest the correct spelling and even show the results for this correctly spelled term… hey: that is exactly I am suggesting in this ticket!! :) ... as long as... > 2.) If the search is modified, there should probably be an on-screen message indicating what happened and why. Bingo! ... EDIT: Google again! :) Ok, but as it is now, with the mentioned correction of not displaying "see also" in the cases of 4xx terms (use instead), and Ronan solution is a good one (contingency one), we are good and ready to go! I mean, VuFind being a NGC solution should port to its core the same advanced features we find in OGC… :) Btw, > 1.) Since we're doing a keyword search against the authority records, just because we find a "preferred term" in the results matching the user's keywords, that doesn't necessarily mean that it applies to the user's intended search. We're not in left-anchored heading search anymore, and our users generally aren't thinking that way, so our strategies have to change. Sorry, could you please elaborate this... my bad, If I undertstood correctly that is what I am proposing, not the opposite. I think we should "skype"... :) perhaps in other words I will understand what you are saying... :) Filipe PS: but wait.... You know what would be really, really nice, as we have our Library records indexed with subjects in Portuguese? To have this feature enabled for the term in another language: e.g., AUT record: Main heading: Contaminação da água Term in English: Water - Pollution Term in English: Water pollution When a user searchs for "Water pollution" the system should retrieve our Libraries' Catalog records with "Contaminação da água" in their subject list (and yes, with a warning message too, at the top).... and again, vice-versa. As long as the AUT DB is updated (or use external ones, like EUROVOC, http://eurovoc.europa.eu/) this is an entirely new world for searches > search a term (subject) in your native language and it would retrieve records indexed with the corresponding term in any language present in AUT DB records' fields. Ah... ah... another one to think about (and I guess plenty of discussion meanwhile... :) ) PPS: we should mark these discussions as CONFIDENTIAL... :) Ok, talking serious, for sure there are ILSs out there that do this (I'm pretty sure ALEPH does... perhaps, we just don't have it configured to do it). Comment by Filipe M S Bento [ 24/May/12 ] Know what? I've test ran a solution for this, very pragmatic one: 1) Using base instructions here: http://vufind.org/wiki/stop_words_and_synonyms#synonyms 2) Data here: http://eurovoc.europa.eu/drupal/?q=pt/download/list_pt&cl=en (using EUROVOC as a test bed) EDIT: if you don't feel confortable enough (yet) with EUROVOC's Portuguese interface, you can use EN ones: http://eurovoc.europa.eu/drupal/?q=download/list_pt&cl=en (noticed? Drupal; OSS it taking the world... for free!) (download accordingly to your main language and let’s say... top of my head... I don't know which to choose, really, ok, I think I go for... closed yes choice… English! :) ) or even more langs if you have records enough to return something in that other language); 2.1) column F: =B1 & "," & C1 (fill-down for the remaining lines / expand the formula if you have more langs, beside yours and the other one I’ve randomly choose… I think it was English… :) ). 3) Copy-Paste column F to ./solr/biblio/conf/synonyms.txt (append); 3.1) Pay special attention to convert the file to UTF-8 encoding, if you have special chars in your main language --- else, SOLR won't start, when you… 3) Restart VuFind; 4) Test with some of those words or expressions, any field, subject, etc. (well, those two... all the other fields are the same whatever the lang); 4.1) Your search should retrieve records indexed with the term you have inserted, not only, but also records with that term translation to the other language; 5) If it is ok, good! If not, go to sleep (after all it’s 4am+) and with a fresh start tomorrow (yours) it will work ok (happened to me twice, but instead of 4am was... 7am!); 6) This is not, but I mean really, really, really a solution at all; it is just for you to have an idea of a multi-language theasures (much faster --- reading from a SOLR index not a txt file) may bring the discovery experience to a brand new level, that even the most expensive solutions do not "offer", if not in mistake (do it in a dev/ server... don't know the load it will put on a prod/ server!). All the best, Filipe PS: Demian: try in mines' for "poluição do ar" for instances (= air pollution). PPS: Demian, I've warned that it is the ugliest solution... is just to have an idea of a new degree in discovery... no more language barriers in finding the right terms to search for; users will have the possibility of just access the resources they feel comfortable in that language, facet filtering it. PPPS: Demian (sorry, again): your fault for me being reporting this now... was analyzing http://vufind.org/wiki/developers_call:minutes20120529 and had to relate this experience done a couple (4h) ago... :) Comment by Demian Katz [ 25/May/12 ] Filipe, As you say, this is definitely a straightforward, pragmatic solution. The only limitation is that you can't turn off the synonyms, and I suspect that in some (many?) cases they might cause the result list to include undesirable or confusing results. I think the advantage to building a recommendation module is that it could display at the top of the results: "You can broaden your search by including synonyms. Click here to try it!" ...and then add some parameter to the URL to activate synonyms. Without doing some research, I'm not sure if there's an easy way to toggle Solr's use of synonyms.txt -- I suspect it would require creating two different field types (one with synonyms, one without) and two copies of every field. I wouldn't call that an easy way. But another option is to simply let the recommendation module pre-process the search query, as previously discussed -- not as elegant or fast as letting Solr do the work for you, but possibly a less complicated solution if you want toggling. - Demian Comment by Filipe M S Bento [ 25/May/12 ] Demian, hi again! The ideal would be using the associated entry in AUT(horities) DB for the term in other languages (not discovering it in MARC21, but in UNIMARC is block 6xx, if not in mistake), if the AUT DB was well maintained for the present terminology. OR have a text version / DB (even in MySQL) version of this http://www.amazon.com/EnglishPortuguese-Portuguese-English-Dictionary-Technical-Scientific/dp/9722214926 loaded and an OR behind the scenes (switchable at search form). Demian, I also have some philosophies about certain stuff... :) one is that in a Discovery Solution any extra click represents thousands of lost clients... give them all and if they want, able them to narrow the search (facets). Most of the times, less is more, but in Discovery, this might render in letting some preciouse resources hidden (as they were before), which is step back... unless going the wrong direction (then it would be good to give a step back :) ). Filipe PS: Demian, so sorry, but not seeing any situation at all when the solution I’ve sent (proof of concept) could produce the effect of "include(ing) undesirable or confusing results"; like we (and I mean you too :) ) in high level support always ask for: examples, please :) Comment by Demian Katz [ 25/May/12 ] The "undesirable or confusing results" comment really depends on the nature of the synonyms in your system. For example, suppose we set up this synonym: lift = elevator Now someone does a search for lift, intending the concept in physics. Their search results are now going to be polluted with results about mechanisms for moving people from floor to floor within buildings. I understand that within the context of exactly matching terms from a controlled vocabulary, you are protected against this kind of thing. However, we are working in a keyword searching environment here, so these situations are more dangerous. I agree that in a discovery environment, it is good to provide the user with a lot of results and let them filter down. However, it is important that the top results returned at least make some degree of sense. Synonyms have the potential to give high relevance boosting to things that don't have an obvious relationship to the user's search query. In my lift example, if the user gets a result set where some things are about lift (the object) and others are about lift (the concept), at least it is obvious that the word "lift" is present in all of these things. If they do the same search and their top hit is the "Elevator Repair Manual" and the word lift is nowhere to be found, they may or may not figure out what has happened. If the results appear to be completely illogical, they are not going to stick around and refine them -- they are going to start over and try a different strategy. I feel that an "opt-in" strategy is safer when you risk dramatically changing the user's query. I admit that this is all speculation -- the real risk depends entirely upon the data set used... and of course it's very easy to implement this so that it can be configured to be either automatic or optin based on local preferences, so it's not like we need to come up with a single definitive answer that works for everyone. I just feel like precision still counts, even if recall is more popular in the age of discovery. Comment by Filipe M S Bento [ 25/May/12 ] Ok, you got me with that one: (to) lift (verb) = elevator (object) But I wonder, like you said, under a controlled vocabulary is there a chance of that happening more than a mere couple of times in hundreds... and I guaranty you if using a AUT DB don't won't happen because it only retrieves the related records in context, that is something that Authorities have the onus of giving. Hey, but this is me saying... Have a nice holiday weekend and see you Tuesday at “Developers’” Call; - Filipe PS: I guess this a non-ending discussion (good or bad, I just don’t know anymore :) ) --- But I'm sure something positive for VuFind will come out from it, either way it goes (or ways, as you said, let the final "customer" decide, better than not have a solution or at least have analyzed a possible one, even the decision is to maintain things as they are)! PPS: Yes, it is very annoying to explain why /Author/Home?author=Austen%2C%20Jane was retrieving info from Wikipedia about another completely different author (now it's ok: http://193.137.169.90/Author/Home?author=Austen%2C%20Jane%2C%201775-1817 >> even without the dates) and have no explanation for it, just, sorry, that is the info the system is seeing… I’m with you: accuracy above all (no excuses). Generated at Tue Feb 09 09:39:58 EST 2016 using JIRA 6.2.6#6264sha1:ee7642271310c09537d01e5848a003c4498a0eed.