Automated Question Answering Motivation: support for students • Demand is for 365 x 24 support – Students set aside time to complete task – If problem encountered immediate help required • Majority of responses direct students to teaching materials; so not a case of “not there” • Poor search forums – Search per forum - not course – Free-text search options fixed by RDBMS • No explicit operators (AND, OR, NEAR) Research questions • Given the current level of development of natural language processing (NLP) tools, is it possible to: – Classify messages as question/non-question – Identify the topic of the question – Direct users to specific course resources Natural Language Processing tools • • • • • • • • Tokenisation (words, numbers, punctuation, whitespace) Sentence detection Part of speech tagging (verbs, nouns, pronouns, etc.) Named entity recognition (names, locations, events, organisations) Chunking/Parsing (noun/verb phrases and relationships) Statistical modelling tools Dictionaries, word-lists, WordNet , VerbNet Corpora tools (Lucene, Lemur) Question answering solutions • Open domain – No restrictions on question topic – Typically answers from web resources – Extensive literature • Closed domain – Restricted question topics – Typically answers from small corpus • Company documents • Structured data Open domain QA research • Well established over two decades • TREC (Text REtrieval Conference) – funded by NIST/DARPA since 1992 – QA track 1999 – 2007, directed at ‘Factoids’ • CLEF (Cross Language Evaluation Forum) – 2001- current – Information Retrieval, language resources • NTCIR (NII Test Collection for IR Systems) – 1997 – current – IR, question answering, summarization, extraction TREC Factoids • Given a fact-based question: – How many calories in a Big Mac? – Who as the 16th President of the United States? – Where is the Taj Mahal? • Return an exact answer in 50/250 bytes – 540 calories – Abraham Lincoln – Agra, India Minimal factoid process • Question analysis • Normalisation (verbs, auxiliaries, modifiers) • Identify entities (people, locations, events) • Pattern detection (who was X?, how high is Y?) • Query creation, expansion, and execution • Ordered terms, combined terms, weighted terms • Answer analysis • Match answer type to question type OpenEphyra: open source QA Source: http://www.cs.cmu.edu/~nico/ephyra/doc/images/overall_architecture.jpg OpenEphyra: question analysis Question ‘who was the fourth president of the USA’ Normalization ‘who be fourth president of USA’ Answer type NEproperName->NEperson Interpretation property: target: context: NAME fourth president USA OpenEphyra: query expansion 1. "fourth president USA" 2. (fourth OR 4th OR quaternary) president (USA OR US OR U.S.A. OR U.S. OR "United States" OR "United States of America" OR "the States" OR America) 3. "fourth president" "USA" fourth president USA 4. "was fourth president of USA“ 5. "fourth president of USA was” OpenEphyra: result answer: James Madison score: 0.7561732 docid: http://www.squidoo.com/james-madison-presidentusa Document content: <meta property="og:title" content="James Madison - 4th President of USA"/> <h1>James Madison - 4th President of USA</h1> <div class="module_intro>James Madison (March 16, 1751 - June 28, 1836) was fourth President of the United States (1809-1817), and one of the Founding Fathers of the United States... Shallow answer selection • Answer based on reformulation of question – Who was the fourth president of the <location>United States</location>? – <person>James Maddison</person> was the fourth president of the <location>United States</location> Students don’t ask questions and we don’t provide answers! Importance of named entities Search engine Answer matching Extracted NEs link question and answer Question processed for NEs Search results tagged with NEs PREPARATORY TASKS Task list: the real work • Create database of forum messages • Adapt open source NLP tools – Tokenisation, sentence detection, Parts Of Speech, parsing • Establish question patterns • Create language analysis tools – Word frequency – Named-entities: define, build, and train models • Prepare corpus – Format and tag documents (doc, html, pdf) – Build Indri catalogue and search interface Iterative process: build, test, refine NLP tools • Predominantly Java – Stanford, OpenNLP, Lingpipe – GATE: complete analysis + processing system – IKVM permits use with .NET framework • Some C++, C# – WordNet, Lemur/Indri, Nooj, SharpNLP • Python NLTK – Complete NLP toolset and corpus • Lisp, Prolog Message database • MySQL database for FirstClass messages • Extract: – Forum, Subject, Date, Author – Body • Use subject to classify as Original or Reply No clean-up or filtering of message content undertaken at this stage Raw forum message (Sample 1) <?xml version="1.0"?> <firstclass> <FCFORMSHEADER> <fcobject objtype="oConfItem" formid="141" objname="Daniel Hughes 5"> <field id="3" index="0" type="number">-959014497</field> <subject index="0" >Help Please!!!? Urgent</subject> <tonames index="0" >T320 09B Eclipse Support</tonames> </fcobject> </FCFORMSHEADER> <body> I am trying to open an existing project but can't do it. It's driving me mad. I know the project folders are located in the workspaceblock4 folder. I have deleted all the open projects in the project explorer window (without deleting content). BUT how on earth do I know proceed to reload some of the projects without starting from scratch? When I select open file ... it doesn't let me open any projects files - only the individual files in the project folder. In other words I cannot get any project files to appear in the project explorer window. Please can anyone help me as I have booked a lot of time off work to concentrate on the project, but I am a dead end.&#13; </body> </firstclass> Raw forum message (Sample 2) <?xml version="1.0"?> <firstclass> <FCFORMSHEADER> <fcobject objtype="oConfItem" formid="141" objname="Simon Shadbolt"> <field id="3" index="0" type="number">-962619805</field> <subject index="0" >Block 4 Practical booklet 6 activity 4- Unable to get a fault!</subject> <tonames index="0" >T320 09B Eclipse Support</tonames> </fcobject> </FCFORMSHEADER> <body> I have followed the set up and altered the fault to &quot;none&quot; and simulation to normal, but I do not get any faults at all or a listing that resembles the list on page 12, particularly line 12. I have attached my bpel file and my screenshot, any help appreciated.&#13; Simon&#13; &#13; Process bpelEcho3pScope: Instance 1 created.&#13; Process bpelEcho3pScope: Executing [/process]&#13; Process Suspended [/process]&#13; Receive ClientRequestMessage: Executing [/process/flow/receive[@name='ClientRequestMessage']]&#13; . Scope : Completed normally [/process/flow/scope]&#13; Reply ClientResponseMessage: Executing [/process/flow/reply[@name='ClientResponseMessage']]&#13; Reply ClientResponseMessage: Completed normally [/process/flow/reply[@name='ClientResponseMessage']]&#13; Process bpelEcho3pScope: Completed normally [/process]&#13; </body> </firstclass> Eclipse console listing or XML T320 09B database properties • • • • • • Total messages: Non-replies: Manually tagged questions: Average length (lines) Containing XML: Containing Eclipse content: 4246 1051 777 7.9 17 37 Creating question patterns • Extract text from forum messages (non-replies) • Create n-grams (‘n’ adjacent words) • Perform frequency analysis of n-grams • Manually review n-grams to create question patterns N-gram results Number of words Unique patterns 6 96900 5 96780 4 94975 3 86338 5-word frequency analysis Frequency 17 16 14 13 12 9 8 8 8 7 7 7 7 6 6 6 6 6 6 6 N-Gram An unexpected error has occurred. point me in the right I get the following error me in the right direction unexpected error has occurred. UDDIException does not seem to be get the following error message I get an error message system cannot find the path Any help would be appreciated. I am not sure if I can not seem to I do not know what A problem occured while running but I get the following cannot find the path specified error has occurred. UDDIException java. has occurred. UDDIException java. net. I am not sure how I do not seem to Top 20 results Sliding window across message Frequency N-gram 1 N-gram 2 1 am not that knowledgable Help I am not that knowledgable 1 am not the early adopter I am not the early 1 am not thinking straight today I am not thinking straight 1 am not too far off I am not too far 1 am not too sure if I am not too sure 1 am not using the fault I am not using the 1 am noticing in the console I am noticing in the 1 am now a while later I am now a while 1 am now adding my exception I am now adding my 1 am now getting the following I am now getting the 1 am now held up again I am now held up 1 am now not sure if I am now not sure 1 am now stuck on activity I am now stuck on 1 am now trying not to I am now trying not 1 am now trying to start I am now trying to 1 am now willing to submit I am now willing to 1 am obviously missing something here Candidate question patterns Class name Pattern #question (a|my) question (about|on|for|is) #appreciate appreciate (.*) (advice|comment|guidance|help|direction) #can/could (can|could|will|would) (any|some)\s?(body|one)) (.*) (explain|tell me) #does does (any|some)\s?(body|one) (have|know) #having (have|having) (.*) (problem|nightmare)s? #how how (best|can|does|do i|do you|do we) #i am i am not (really )?sure (if|how|what|when|whether|why) #i cannot i (can not|cannot|could not) find (.*) answer (.*) question) #just just wonder(ed|ing)? (if|what) #point me point (me|one) (.*) right direction Generalisation of patterns using POS Question part any|some advice|comment|guidance appreciated|welcomed . POS tag DT NN VB(N|D) ./. Can/MD anyone/NN offer/VB some/DT help/NN ?/. Can/MD someone/NN offer/VB some/DT help/NN ?/. Can/MD anybody/RB give/VB some/DT guidance/NN ?/. Could/MD somebody/RB give/VB some/DT direction/NN ?/. POS pattern matching failed due to errors in assigning tags Final question patterns: RegExs Pattern ID Weighting Regular Expression 1 0 (?<a>(a|my)\squestion\s)(?<b>about|on|for|is) 66 0 (?<a>(i\sam|i'm|im)?\shav(e|ing)\s(difficult(y|ie)|issue|problem)(s)?) 67 0 (?<a>i\s(am|have|was))\b(?<b>.*)\b(?<c>wonder(ed|ing)?\s(if|what|whether)?) 69 0 (?<a>i\sam\s(confused|assuming|unable\sto\scontinue)) 70 0 (?<a>i\sam\s(still|getting))\b(?<b>.*)\b(?<c>confused) 71 0 72 0 (?<a>i\sam\snot\s(really\s)?sure)\s(?<b>if|how|what|when|whether|why) (?<a>i\sam\snot\s(really\s)?sure)\s(?<b>what(\sit\swants\sfrom\sme|\sthey\sare\s after)) 73 0 (?<a>(i|i\sam)\s(not\sat\sall\ssure)) 88 0 (?<a>i\shave\s(encountered|found|got))\b(?<b>.*)\b(?<c>issue|problem) 139 0 (?<a>what\s(have\si|i\shave))\b(?<b>.*)\b(?<c>wrong) 164* 100 (?<a>problem\s)(?<b>.*)\b(?<c>WSDL\sconformance\scheck) * Pattern derived from Eclipse error message 169 patterns using ‘explicit capture’ CHALLENGES PROCESSING MESSAGES Poor message style Incorrect POS tagging due to spelling errors when/WRB I/PRP tried/VBD to/TO generate/VB the/DT sample/NN ,/, it/PRP said/VBD the/DT data/NNS is/VBZ available/JJ ./. XML within messages Detected as single sentence Eclipse console listing within message Line breaks not recognised as end of sentence Open-source NLP problems • Sentence detection failures: – Bad style (capitalisation, punctuation) – Ellipsis (i tried... it failed... error message...) – XML, BPEL segments concatenated to single sentence • Tokenisation failures: – Multiple punctuation ???, !!! (student emphasis) – Abbreviations (im, cant, doesnt, etc.) • POS errors – Spelling, grammar Purpose built tools • Tokeniser – Re-coded for typical forum content/style • Multiple punctuation • Abbreviations • Common contractions • Sentence detector – New detector based on token sequences • Pre-filter messages – Remove XML, console listing, error messages Message pre-filters • Short-forms – i’m, im, i m – can’t, cant, can t • • • • • i am can not Line numbers Repeated punctuation (!!!, ???, ...) Smilies Salutations (Hi all, Hiya, etc.) Names, signature, course codes Filtered message Raw message containing Eclipse console listing Filtered message ready to process PRELIMINARY RESULTS: question classification Message-set properties • • • • • Number of messages: 1051 (100%) Number of questions(M): 777 (73.9%)(100%) Number of questions(A): 756 (97.3%) False Positives (A not M): 58 (7.4%) False Negatives (M not A): 79 (10.2%) Approx 90% success rate M = manually annotated question, A = automatically annotated question Message-set properties – cont. • Average # pattern matches: • Min # pattern matches: • Max # pattern matches: 2.7606 1 12 • Average # of lines (ASCII linefeed) • Min # Lines in a message • Max # Lines in a message 7.9 1 68 • Average # of sentences • Min # Sentences in a message • Max # Sentences in a message 5.0 1 89 • Messages containing XML • Messages containing BPEL 17 37 Distribution of pattern match count 350 Number of messages 300 295 240 250 200 174 150 150 95 100 42 50 32 9 7 7 8 1 2 2 2 9 10 11 12 0 0 1 2 3 4 5 6 Number of pattern matches Challenges: false positives Challenges: false negatives Challenges: detecting the question 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 109 113 117 121 125 129 133 137 141 145 149 153 157 161 165 169 Number of messages Messages matching question pattern 250 10 200 Pattern IDs 150 50 100 68 31 50 0 Pattern ID Common question patterns (10) • any – (advice|clarification|clue|comment| – further thought|guidance| – help|hint|idea|opinion| – pointer|reason|suggestion|taker)(s)? • .* • appreciated|welcome|welcomed 216 matches Terms added over time to improve detection of questions Sample question match (10) Common question patterns (50) • get|getting|gives|got|receive • .* • error(s)? 102 matches Sample question match (50) Discrimination vs Classification 250 Multi-Matches Single Match Number of messages 200 150 100 50 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 109 113 117 121 125 129 133 137 141 145 149 153 157 161 165 169 0 Pattern ID Low discrimination >>> Increases successful classification at the risk of false-positives High discrimination >>> Reduces successful classification and risk of false-positives Does process transfer? • Tested against TT380 forums 04J – 07J – Preliminary results look promising – Need to manually tag >4000 messages – Review message pre-filters • Need access to Humanities course material PRELIMINARY RESULTS: question topic identification Basic method • Identify named entities – NEs are block-specific – Majority of questions linked to assignments • Parse sentence for dependencies – Nouns (that are NEs) – Verbs Named entities: inconsistent usage Message body Message subject Error handling Exception handling Deep parsing: dependencies advmod(delete-5, How-1) aux(delete-5, can-2) nsubj(delete-5, I-3) advmod(delete-5, properly-4) dobj(delete-5, PLTs-6) conj_and(PLTs-6, PLs-8) conj_and(PLTs-6, roles-10) det(project-13, the-12) prep_from(delete-5, project-13) prep_in(delete-5, order-15) aux(have-17, to-16) xcomp(delete-5, have-17) det(sheet-20, a-18) amod(sheet-20, clean-19) dobj(have-17, sheet-20) advmod(have-17, again-21) How can I properly delete PLTs and PLs and roles from the project in order to have a clean sheet again. Sentences per message 200 Number of messages 180 160 140 120 100 80 60 40 20 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21 22 23 24 25 30 31 38 48 51 89 Sentence count Sentence counts under-estimated due to spelling /grammar errors. Of the 120 single-sentence questions >80% are multiple sentences. Guess the topic Excuse me for directing this question at you, but when I try to contact my tutor through my homepage i still go to the details for John Stephenson but I am sure that he is ill at the moment. My question refers to the entities described in ECA part2 page 2, it states that the term identifier must be unique within the UK business domain. I thought Buyers ID and Sellers ID could be their email address, however, I am stuck on the Order ID which might refer to a depatch note as I do not know what standard these identifiers have to conform to in UK business. I would appreciate being directed as to where I can find this information. Current status • Unable to establish question topic for the 95% of detected questions • Current NLP techniques (anaphora and co-reference resolution) for multi-sentence questions not well established. Pattern matching in console listing Practical work: exact patterns • Process|Assign|Invoke|Scope|Reply • .* • Completed with fault: • invalidVariables|uninitializedVariable|joinFailure Provide direct link to FAQ or teaching materials Future work • Further work on sentence detection – Everything else depends on this • Create patterns to identify content – “how do i (.*)” – “are you now saying (.*)” – “(.*) word count” • Establish relationships between initial message and replies • Build tool to process Eclipse console listings – Could address 5% of all ECA related questions