Correct and Consumable Answers to Complex Questions Beverly Jamison, PhD, - Sr. Director IT Architecture American Psychological Association April 11, 2013 Correct and Consumable Answers to Complex Questions Agenda • • • • • • Slide 2 Overview of the Search Product How simple features cause complex queries Search Architecture I (it seemed like a good idea at the time) Search Architecture 2: Making it Correct Users 2.0: Making it Consumable Looking Ahead: Making it Cool Copyright © 2013 MarkLogic® Corporation. All rights reserved. APA PsycNET Content Types PsycINFO Database: Similar to MEDLINE 3.5M records 60M Cited References 500,000 Full Text items, including Journal Articles Book Chapters Psychology-based critiques of books and films “Gray Literature” 13,000 Psychological Tests and Measures 400 Streaming video psychotherapy demonstrations Thesaurus of Psychological Index Terms PsycNET delivery platform, powered by MarkLogic Slide 3 Copyright © 2013 MarkLogic® Corporation. All rights reserved. PsycNET Search Results ‘Autism ’ Slide 4 Copyright © 2013 MarkLogic® Corporation. All rights reserved. Thesaurus Selection Slide 5 Copyright © 2013 MarkLogic® Corporation. All rights reserved. Why MarkLogic Search Makes us Smile Takes a layer off the architecture for easier maintenance Performance tunes like a dream Access to full content, not just an index Ability to provide aggregated information from range indexes Smooth data delivery since ML is our content repository Unification of technologies as we move other searches to MarkLogic Allows us more options for how results are consumed: Human Readable: Facets, charts, tables, content snippets Machine to Machine: API, RDF, XML, Feeds Slide 6 Copyright © 2013 MarkLogic® Corporation. All rights reserved. But Complications Can Still Happen We expect complication in queries like this: Keywords: media OR "computer gam*" OR films OR movie* OR internet OR magazine* OR books OR multimedia OR music OR newspapers OR "social network*" OR photograph* OR radio OR "role playing gam*" OR "massively multiplayer" OR televis* OR TV OR websites AND NOT Document Type: Dissertation AND Year: 2000 TO 2013 But even simple queries can hold surprises: The thesaurus “shopping cart” yields queries such as: IndexTerms: (Depression OR Abandonment) Or relatively innocuous looking fields: Keywords = IndexTerms + Keywords + Title Anyfield = Searchable portion of the content Slide 7 Copyright © 2013 MarkLogic® Corporation. All rights reserved. So What Could Go Wrong? Constraint Plus Simple (non-nested) Boolean: IndexTerms: (Depression OR Abandonment) IndexTerms is looked up as a constraint All the nodes from the parse tree (in this case Depression and Abandonment) are concatenated in document order and the constraint is applied. So we end up with IndexTerms (Depression AND Abandonment) MarkLogic Field Plus Range Index: The Field mechanism is very helpful when we have one external field name for multiple elements The gotcha is that this mechanism is associated with a Word Index and we are increasingly attached to our Range matching Slide 8 Copyright © 2013 MarkLogic® Corporation. All rights reserved. Our “light-MVC” search architecture Slide 9 Copyright © 2013 MarkLogic® Corporation. All rights reserved. The Basic Search Flow Slide 10 Copyright © 2013 MarkLogic® Corporation. All rights reserved. Slide 11 Copyright © 2013 MarkLogic® Corporation. All rights reserved. The Solution Approach: A Custom Joiner Handler Ingredients Application-defined constraints Application-defined field-mappings Query parse trees Default Implementation: MLQP parses, then calls impl:textonly to extract text nodes Custom Implementation: Call through to cc:apply-constraint for each node of the parse tree Slide 12 Copyright © 2013 MarkLogic® Corporation. All rights reserved. What We are Excited About • We wanted to keep search:search at the core • We learned the hard way with Lucene to not mess up our upgrade path • We wanted to take advantage of all of the new things MarkLogic would do • We are excited about the increased access to the parse tree from published interfaces • These are convenient for us and they could be cool for new ways to interact with advanced users Slide 13 Copyright © 2013 MarkLogic® Corporation. All rights reserved. Any Questions? Slide 14 Copyright © 2013 MarkLogic® Corporation. All rights reserved. For More Information Beverly Jamison bjamison@apa.org Slide 15 Copyright © 2013 MarkLogic® Corporation. All rights reserved.