Excite longitudinal.ppt

advertisement
Vox populi: the public
searching of the Web:
A longitudinal study of large
samples of Excite queries
Dietmar Wofram (U. of Wisconsin - Milwaukee)
Amanda Spink (Penn State U.)
Major Bernard J. Jansen (U.S. Army)
Tefko Saracevic (Rutgers U. )
© Tefko Saracevic, Rutgers
1
Excite@Home
• A major Internet media company
• Search capabilities:
– Up to 10 terms per query; default OR
– Advanced search:
• Boolean AND, OR, AND NOT & parentheses
• “phrase” : must appear in answer
• + or - before term must or must not be in answer
– More Like This : clickable relevance feedback
– proprietary algortihms & concept linking
method, but follow basic information retrieval
© Tefko Saracevic, Rutgers
2
Samples
Three samples:
• pilot: 51,000 queries by 18,000 users
collected in March 1997 (label: 51K)
• 1 million queries by over 200,000 users
collected in September 1997 (1M97)
• 1 million queries by over 200,000 users
collected in December 1999 (1M99)
© Tefko Saracevic, Rutgers
3
Number of queries per user
Queries 51K
Mean 2.8
1M97
2.5
1M99
2
1 query 67%
48%
78%
2
19%
21%
13%
3
7%
11%
4%
Sessions (as to no. of queries) are SHORT
© Tefko Saracevic, Rutgers
4
Terms per query distribution
Terms
51K
1M97
1M99
Mean
2.32
2.4
2.35
1 term 31%
26%
26%
2
31%
31%
26%
3
18%
18%
15%
SHORT QUERIES:
Some 60% have 1 or 2 terms
© Tefko Saracevic, Rutgers
5
Use of Boolean operators
51K
1M97
In >10% >5%
of queries
1M99
20%
Many uses of Boolean operators are wrong -
not according to instructions how to use them
© Tefko Saracevic, Rutgers
6
Number of pages viewed per user
Pages 51K 1M97 1M99
Mean 2.3
1.8
1.7
1 page 58%
29%
43%
2
19%
19%
21%
3
9%
14%
Most users view VERY FEW pages beyond
the first or first two
© Tefko Saracevic, Rutgers
7
Distribution of terms
• TOP: a very small number of distinct
terms used with very high frequency
• BOTTOM: unusually high number of
distinct terms used with low frequency
• Web query vocabulary contains a very
large number of distinct terms
– more than in ordinary English texts
– has its own & unique characteristics
© Tefko Saracevic, Rutgers
8
Term distribution 51K sample
• Top: frequency of 100 or more: 74 terms
– 0.34% of all unique terms (of 21,862)
– 18% of all terms in all queries (of 113,793)
• Bottom: frequency of one: 9,790 terms
– 44.8% of all unique terms
– 8.6% of all terms in all queries
• In freq. of 100 or more (subject terms only):
– 63 subject terms:
0.29% of unique terms; 10.3% of all terms
© Tefko Saracevic, Rutgers
9
Term distribution 1M97 sample
© Tefko Saracevic, Rutgers
10
Top 15 terms (common excluded)
51K
sex
nude
free
pictures
new
university
women
chat
gay
girls
xxx
music
software
pics
NCAA
© Tefko Saracevic, Rutgers
1M97
sex
free
nude
pictures
university
pics
chat
adult
women
new
xxxx
girls
music
porn
gay
1M99
sex
Christmas
nude
pictures
new
pics
music
university
games
porn
cards
state
stories
women
xxxx
11
Top 10 co-occurring terms
(only meaningful ones)
1M97
free - pics
university - of
new - york
free - sex
real - estate
home - page
free - nude
pictures - of
free - pictures
high - school
© Tefko Saracevic, Rutgers
1M99
new - york
free - sex
free - pics
university - of
pictures - of
greeting - cards
britney - spears
free - nude
free - pictures
real - estate
12
Classification of queries - a sample
Subject category
1M97
1M99
2414 queries 2539 queries
1. Entertainment, recreation
16.9%
7.5% (6)
2. Sex, porn, preferences
16.8%
7.5% (4)
3. Commerce travel, economy
13.3%
24.4% (1)
4. Computers & the Internet
12.5%
10.9% (3)
5. Health & the sciences
9.5%
7.8% (5)
6. People, places, things
6.7%
20.3% (2)
7. Society, culture, religion
5.7%
4.2% (9)
8. Education & the humanities
5.6%
5.3% (8)
9. Performing & fine arts
5.4%
1.1% (11)
10. Government
3.4%
1.6% (10)
11. Incomprehensible
4.1%
6.8% (7)
© Tefko Saracevic, Rutgers
13
Major findings
(across all three samples)
• Users: not many queries per search
– 2.4 mean
• Terms: not many per query
– 2.4 mean
– in traditional IR queries 3 to 7 times larger
• Boolean stuff not used much
– used from 1 in 10 to 1 in 5 queries
© Tefko Saracevic, Rutgers
14
Major findings ...
• Users did not view many pages
– mean 1.9 pages - percentage of views falling
– 1 in 2 or 1 in 3 of users did not go beyond the
first page
• Relevance feedback (More Like This) not
used much
– used in about 1 in 20 queries
• Over time searching did NOT change much
– use changed mostly in greater use of advances
features
© Tefko Saracevic, Rutgers
15
Major findings ...
• Frequency of use of terms is highly skewed
– highest 1/3 of 1% of terms accounted for 1 in every
10 terms used; terms that were used only once were
1/2 of unique terms
– Web query language quite unique
• Lot of searching about sex, but queries in
category Sex still represents a small
proportion of all categories
– great many other topics searched
– diversity of subjects searched very high
© Tefko Saracevic, Rutgers
16
Conclusions
• Web searching is still IR, but very different IR
– Web users search in different & simplified ways
• Many Web search features need redesign to
accommodate the way users use the Web
• Web is a marvelous new technology
– but people are unpredictable in use of any new
technology – how are they really using the Web?
© Tefko Saracevic, Rutgers
17
Thank you
Gracias
Danke
Merci
Hvala
… until the next installment ...
© Tefko Saracevic, Rutgers
18
Download