Vox populi: the public searching of the Web: A longitudinal study of large samples of Excite queries Dietmar Wofram (U. of Wisconsin - Milwaukee) Amanda Spink (Penn State U.) Major Bernard J. Jansen (U.S. Army) Tefko Saracevic (Rutgers U. ) © Tefko Saracevic, Rutgers 1 Excite@Home • A major Internet media company • Search capabilities: – Up to 10 terms per query; default OR – Advanced search: • Boolean AND, OR, AND NOT & parentheses • “phrase” : must appear in answer • + or - before term must or must not be in answer – More Like This : clickable relevance feedback – proprietary algortihms & concept linking method, but follow basic information retrieval © Tefko Saracevic, Rutgers 2 Samples Three samples: • pilot: 51,000 queries by 18,000 users collected in March 1997 (label: 51K) • 1 million queries by over 200,000 users collected in September 1997 (1M97) • 1 million queries by over 200,000 users collected in December 1999 (1M99) © Tefko Saracevic, Rutgers 3 Number of queries per user Queries 51K Mean 2.8 1M97 2.5 1M99 2 1 query 67% 48% 78% 2 19% 21% 13% 3 7% 11% 4% Sessions (as to no. of queries) are SHORT © Tefko Saracevic, Rutgers 4 Terms per query distribution Terms 51K 1M97 1M99 Mean 2.32 2.4 2.35 1 term 31% 26% 26% 2 31% 31% 26% 3 18% 18% 15% SHORT QUERIES: Some 60% have 1 or 2 terms © Tefko Saracevic, Rutgers 5 Use of Boolean operators 51K 1M97 In >10% >5% of queries 1M99 20% Many uses of Boolean operators are wrong - not according to instructions how to use them © Tefko Saracevic, Rutgers 6 Number of pages viewed per user Pages 51K 1M97 1M99 Mean 2.3 1.8 1.7 1 page 58% 29% 43% 2 19% 19% 21% 3 9% 14% Most users view VERY FEW pages beyond the first or first two © Tefko Saracevic, Rutgers 7 Distribution of terms • TOP: a very small number of distinct terms used with very high frequency • BOTTOM: unusually high number of distinct terms used with low frequency • Web query vocabulary contains a very large number of distinct terms – more than in ordinary English texts – has its own & unique characteristics © Tefko Saracevic, Rutgers 8 Term distribution 51K sample • Top: frequency of 100 or more: 74 terms – 0.34% of all unique terms (of 21,862) – 18% of all terms in all queries (of 113,793) • Bottom: frequency of one: 9,790 terms – 44.8% of all unique terms – 8.6% of all terms in all queries • In freq. of 100 or more (subject terms only): – 63 subject terms: 0.29% of unique terms; 10.3% of all terms © Tefko Saracevic, Rutgers 9 Term distribution 1M97 sample © Tefko Saracevic, Rutgers 10 Top 15 terms (common excluded) 51K sex nude free pictures new university women chat gay girls xxx music software pics NCAA © Tefko Saracevic, Rutgers 1M97 sex free nude pictures university pics chat adult women new xxxx girls music porn gay 1M99 sex Christmas nude pictures new pics music university games porn cards state stories women xxxx 11 Top 10 co-occurring terms (only meaningful ones) 1M97 free - pics university - of new - york free - sex real - estate home - page free - nude pictures - of free - pictures high - school © Tefko Saracevic, Rutgers 1M99 new - york free - sex free - pics university - of pictures - of greeting - cards britney - spears free - nude free - pictures real - estate 12 Classification of queries - a sample Subject category 1M97 1M99 2414 queries 2539 queries 1. Entertainment, recreation 16.9% 7.5% (6) 2. Sex, porn, preferences 16.8% 7.5% (4) 3. Commerce travel, economy 13.3% 24.4% (1) 4. Computers & the Internet 12.5% 10.9% (3) 5. Health & the sciences 9.5% 7.8% (5) 6. People, places, things 6.7% 20.3% (2) 7. Society, culture, religion 5.7% 4.2% (9) 8. Education & the humanities 5.6% 5.3% (8) 9. Performing & fine arts 5.4% 1.1% (11) 10. Government 3.4% 1.6% (10) 11. Incomprehensible 4.1% 6.8% (7) © Tefko Saracevic, Rutgers 13 Major findings (across all three samples) • Users: not many queries per search – 2.4 mean • Terms: not many per query – 2.4 mean – in traditional IR queries 3 to 7 times larger • Boolean stuff not used much – used from 1 in 10 to 1 in 5 queries © Tefko Saracevic, Rutgers 14 Major findings ... • Users did not view many pages – mean 1.9 pages - percentage of views falling – 1 in 2 or 1 in 3 of users did not go beyond the first page • Relevance feedback (More Like This) not used much – used in about 1 in 20 queries • Over time searching did NOT change much – use changed mostly in greater use of advances features © Tefko Saracevic, Rutgers 15 Major findings ... • Frequency of use of terms is highly skewed – highest 1/3 of 1% of terms accounted for 1 in every 10 terms used; terms that were used only once were 1/2 of unique terms – Web query language quite unique • Lot of searching about sex, but queries in category Sex still represents a small proportion of all categories – great many other topics searched – diversity of subjects searched very high © Tefko Saracevic, Rutgers 16 Conclusions • Web searching is still IR, but very different IR – Web users search in different & simplified ways • Many Web search features need redesign to accommodate the way users use the Web • Web is a marvelous new technology – but people are unpredictable in use of any new technology – how are they really using the Web? © Tefko Saracevic, Rutgers 17 Thank you Gracias Danke Merci Hvala … until the next installment ... © Tefko Saracevic, Rutgers 18