Conference September 2013 Text analysis software needs more common sense and less intelligence! John S. Lemon, University of Aberdeen 1 Open Day 2013 IT Services John S. Lemon Student Liaison Officer Introduction • History – setting the scene • Problem – move from quantitative to qualitative • Etc. Introduction • • • • • History – setting the scene Problem – move from quantitative to qualitative How - Analysis / reporting Quantity – increases each year Constraints – Reports required earlier each year – Very limited budget 4 Disclaimer • I am not a statistician – I just have to present reports • When I started at university in 1975 almost all data was numeric / quantitative • For the purposes of this paper I emulated a naive user • To carry out the analysis there is no budget for: – Software – Training 5 History • IT Services ( formerly DISS & DIT ) runs an annual survey to: – Staff – Students • Purpose is to identify satisfaction with facilities and service • Originally on paper and scanned – almost entirely tick boxes • Moved to web but retained ‘tick box’ format 6 History • Converted to WebHost around 2008/9 • Still retained the mainly quantitative original 7 History • SNAP had been used to create Student Course Evaluation Forms ( SCEF ) • On paper since 1999 – two sides of Likert scales • Only one free text box • 60,000 forms scanned / year • In 2010 deemed to be ‘not green’ / ecological • Move to special web based software • Move to free text comments 8 History • This is the 2007 paper form • As SCEF forms had changed approach it was decided the annual survey would do the same • Fewer tick boxes 9 History • From 2011 some check boxes but more free text options. 10 Problem - quantitative to qualitative • Report generation could no longer rely on – charts – tables. • No thought given to how to cope with free text • First year one person (me) – ‘skimmed’ the responses – Subdivided according to which area of service was commented on – Passed to section heads for action and responses 11 Problem - quantitative to qualitative • Second year – manual coding • Excel file of case number and free text comments • Plus extra columns for coding comments / categorisation • Code values were “Positive”, “Negative” or “Ambiguous” • Limited number of categories • Needed consistency so one person coded all 12 Problem - quantitative to qualitative • Once coded loaded into SPSS • Merged with original file • Produced tables and charts combining demographic data and coded values • Extremely labour intensive • Needed an iterative approach for accuracy – Categories were too broad or too detailed – Codes were too restrictive 13 Problem - quantitative to qualitative • This year attempted a new approach • Use software • New / updated versions of: – SNAP (11) – Nvivo (10) – STAFS - SPSS Text analysis For Surveys (4) • Also consider use of concordance software 14 Problem - quantitative to qualitative • Why choose these four products ? – SNAP • Already had so no extra cost • Had SNAP format files so no translating / transforming the data – NVivo • Like SNAP already had on site • Claims that it would meet all requirements • Takes data from many sources 15 Problem - quantitative to qualitative • Why choose these four products ? – SPSS Text Analysis For Surveys • Reads SPSS files which SNAP would create • Export coded categories back to SPSS • Being considered for site licence – Concordance • Language / literature department recommendation • Cheap • Appeared easy to use. 16 SNAP • Survey had been done in SNAP so tried first • New features are: – word ‘cloud’ – Auto coding of text / words • Can combine all the free text questions into one new ‘derived’ / auto-recoded variable 17 SNAP • Not very helpful • Is there a difference between ‘computer’ and ‘computers’ ? Word cloud of Free text comments MyAberdeen university library computers time students internet help work staff access computer good Campus eduroam service slow Services helpful problems 18 SNAP • Not only ‘computer(s)’ presented problems • But all the different terms students use for the wireless network. • These are the more obvious spellings – ignoring the miss-spellings. • Not ideal as did not allow for synonyms 19 SNAP - limitations • Has a ‘Stop’ list – words to exclude • No equivalent list to create synonyms • Would like to be able to do: {wifi,wi-fi,eduroam,resnet,wireless}={wireless} • Not just a limitation of SNAP word cloud • In the time available could not find how to export auto-coded variables to SPSS 20 Concordance • Cheaper but very limited • No ability to easily export the results • Positive point is it shows need for synonyms !! 21 NVivo • Very powerful • Accepts data from a wide variety of sources: – – – – – – Text Video Pictures Web Social media Etc. 22 NVivo • Data needed some pre-preparation before input • Some of the concepts weren’t obvious • Took a number of attempts to get the data into the correct format • It will combine terms – But may not be exactly what you want – Some of the words for ‘connect’ are quite imaginative to say the least. 23 NVivo 24 NVivo • Depending on how ‘tight’ or ‘loose’ the word associations were made could end up with entirely different results / word clouds 25 NVivo • Found difficulty in: – Trying to get the data categorised – Exporting the results to merge back to SPSS – Alternatively try and produce tables and charts linked to demographic data within NVivo • Problems with all the different software were: – Time to learn all idiosyncrasies – Impatient line managers – Nomenclature 26 STAFS • Appears to be very powerful and comprehensive • Very large manual • Like Nvivio has different nomenclature for the aspects of analysis • Will read data from SPSS files – Providing the text fields are less than 4000 characters in length • Looked the most promising to solve the problem 27 STAFS • Foolishly left it until last for evaluation • Very little time left to get to grips with yet another set of concepts • The deadline for the report was approaching so not a lot of time • Also trial version which lasted 14 days • Appears to have a bit more intelligence in matching words together 28 STAFS 29 STAFS • Has the ability to indicate “good” and “bad” phrases in green, and red • It also highlights the context in amber 30 STAFS • Problem is that the file that ‘drives’ this appears to be rather general in approach • To really be useful in future it needs tailoring • Ran out of time to really develop expertise in this • Potential to apply a level of ‘common sense’ • Not easy to actually do in the time available. • Export back to merge with SPSS appeared OK • But had to abandon any further experiments 31 What was used finally • Time for testing / experimentation had run out • Only one course of action – By hand – One person – me • Scale of problem – When loaded into Word as single spaced, normal margins, 12 pt Calibri – Just under 500 pages • A ream of paper 32 Next year • Try and get a longer trial period for STAFS • Experiment with this years data to provide coding file • Use STAFS from the start 33 Conclusion • Don’t try and learn a lot of new software when there are deadlines from “management” • Word clouds don’t help much • A concordance really only highlights speeling idiosyncrasies • Care must be taken when allowing software to make choices in coding 34 Conclusion • • • • Does text analysis software have intelligence ? Up to a point Does it have common sense Of the four tried only one does BUT • It needs teaching “common sense” and that takes time • Just like a child !! 35