PowerPoint

advertisement
Classification at Northern Light
Presentation to Access 98
October 4, 1998
www.nlsearch.com
“This year, the World Wide Web
has arrived as a serious supplier
of ‘serious’ online information.”
Sue Feldman, “Web Search Services in
1998: Trends and Challenges,” Searcher
Magazine, June 1998
www.nlsearch.com
Search engines are being held to
higher standards
 All users want freshness and manageable results
sets
 Professional information seekers want
– high relevance and high quality content first
– good descriptive information for all results
– precision searching
– text and tables
www.nlsearch.com
Web search environment
 constant growth in all dimensions (pages,
countries, languages, file formats)
 constantly increasing traffic
 continuous onslaught of spam
www.nlsearch.com
Practical considerations for search
engines
 significant engineering time spent counteracting
spam
 constantly adding disk space: 3 terabytes at
Northern Light
 crawler efficiency: must balance new page
discovery with known-page re-crawl
www.nlsearch.com
You step in the stream,
but the water has moved on.
This page is not here.
www.nlsearch.com
Search engines: limitations
 lack the higher quality sources not found on the
Web
 no concept of classification as found in library
systems
 like an index of every word on every page in
every book in your library
– with no subject catalog
www.nlsearch.com
Northern Light’s fundamental goals
 Combine Web data with quality information not on
the Web in a single integrated search
 Make results set manageable for user (already a
problem; worse after non-Web data is added)
www.nlsearch.com
#
Research Engine : Content as of Oct 98
 Web
– 96,000,000 pages
 Special Collection
– 3,600,000+ full-text documents
– 4600 journals, magazines, books, trusted
reference works, etc.
 Mixes free (Web) and Fee (Special Collection)
www.nlsearch.com
Relevancy ranking still critical
 Engines continue to improve their ranking
algorithms
 All seem to agree that relevancy ranking is not
enough to manage results lists of size commonly
seen now
www.nlsearch.com
Techniques for taming results sets
 abridge the database (Excite, Lycos, Infoseek)
 re-sort by popularity (HotBot/Direct Hit)
 suggest further refinement steps to user (Alta
Visa Refine)
 sort based on number of inbound links
(Infoseek…?)
 sort by classification metadata (Northern Light)
www.nlsearch.com
Research Engine: Classification
 classify the Web according to the same standards
found in journal literature
 sort results for user, based on this classification
 work with the user to refine the question
(reference interview approach)
www.nlsearch.com
Relevancy ranking has its limits
 Library patron: “I need some baseball
information.”
 Librarian: “OK. Here are 41,536 books and
sources about baseball, relevancy ranked.”
 Good general sources may be ranked on top, but
the user probably had something more specific in
mind...
www.nlsearch.com
Reference librarian approach: work
with the user to refine the question
 “I need some baseball information.”
 “OK. Tell me more. Do you want general info,
teams and players, recent news...?”
 “Um... team info”
 “OK. Red Sox, Yankees, ...?”
 “Red Sox.”
www.nlsearch.com
www.nlsearch.com
Classification helps organize results
 shows aspects of a topic (‘baseball’, ‘diagnostic
tests’)
 disambiguates queries (‘what is balance’)
 sometimes answers questions directly (‘12th
President’)
www.nlsearch.com
www.nlsearch.com
www.nlsearch.com
Search Current News
Special Collection
Computer networks
Local area networks
Modems
Cable modems
Personal computers
Computer caches
Buses (computer)
Health care software
Software industry
Circuit design
all others...
www.nlsearch.com
www.nlsearch.com
1. WHAT IS BALANCE?
84% - Articles & General info: WHAT IS BALANCE? Back to New
Evangelicanism Reports. Back to the Way of Life Home Page Way of Life
Literature Online Catalog You Can Own…11/09/97
Personal Page: http://www.dsinclair.com /~dcloud/fbns /whatisbalance.htm
Special Collection documents
Commercial sites
Sociology of the family
2.
Employee assistance programs
Neurology
Online banking
Helicopters
Martial arts
Chinese philosophy
Emotional Stability is Balance
77% - Articles & General info: Emotional Stability is Balance Emotional Stability is
Balance - 1 He is unbalanced - 2 She’s not on an even keel - 3 They’re upset…
03/24/95
Educational site:http://cogsci.berkeley.edu/metaphors/
EmotionalStabilityIsBalance.html
3. What is balance?
73% - Biographical sources: “What is balance?” This is an ongoing, soulsearching, head-scratching question that my husband, Don, and I ponder on a regular
bases….07/01/96
Exceptional parent (magazine): Available at Northern Light
all others...
www.nlsearch.com
www.nlsearch.com
Subject classification of Web
documents
 exists for sites in Web directories (Yahoo,
Looksmart, The Mining Co)
 exists behind CGI interfaces
 doesn’t exist at the document level
 except where supplied by the page creator
www.nlsearch.com
Cost of document classification
 Original cataloging of book: $37
 Creating a journal article abstract: $1.50
 Deriving subject headings from journal abstract:
$.20
 for 95,000,000 Web documents = $161.5 million
www.nlsearch.com
Metadata manufacturing
 Automatically determine document’s subject,
type, source and language metadata
 Controlled vocabularies interoperate with
classifier system
 System classifies pages
 Fraction of cent per document
www.nlsearch.com
NL’s controlled vocabularies
 Editorially developed
 Hierarchical in form (graph)
 Exist for subjects, types, and sources
www.nlsearch.com
NL’s subject vocabulary
 Subject scope is unlimited (as in LC, Dewey, Yahoo)
 Major points of reference were DDC, LC Subject
headings, UMI subject headings, and subjectspecialized classification schemes
 Unique, selective conflation of these
 Mapping NL with content partners’ vocabularies gives
freshness, completion
 20,000 concepts; 200-300,000 concept equivalents
www.nlsearch.com
Subject classification process
 Three main techniques:
– mapping
– automatic classification
– editorial classification of whole web sites
www.nlsearch.com
Mapping
 Indexing vocabularies of content partners are normalized
with NL vocabularies
 Excellent source of new terms; helps maintain freshness
and ensure complete coverage of a topic
 All terms become synonyms, equivalents of NL terms and
are used in automatic classification... creating a ‘network
effect’ of subject knowledge
www.nlsearch.com
Partner vocabularies mapped to date
 journal aggregators: UMI, IAC, Ethnic News
Watch, Responsive Database Services
 news databases: AP News, Comtex Newswires,
Newsbytes
 others: U.S. Pharmacopeia, American Banker,
Engineering News Record
www.nlsearch.com
Automatic classification
 based on words contained in document
 uses Term Frequency/Inverse Document
Frequency methods
 document must have a strong degree of
‘aboutness’ to class
www.nlsearch.com
NL’s type classification
 This scheme too is hierarchical, e.g.
• Reviews
– Book reviews
– Movie reviews
– Product reviews
 classification process based on words and
structure of document
www.nlsearch.com
Librarians at Northern Light
 Build and maintain controlled vocabulary
 Map vocabularies of new partners
 Continually tune classification performance
 Help design and test user interface
 Mine and classify whole web sites
 Edit databases
www.nlsearch.com
Database editing
 Classification used to slice NL database into “vertical
search engines”
 Since Feb 98, we’ve released
– 17 subject search engines on NL Power Search
– 26 industry databases (for NL; also on Netscape
Netcenter)
– 5 personal finance databases (for Doubleclick)
– music industry database (with Billboard magazine)
– construction industry database (with Engineering News
Record)
www.nlsearch.com
Automatic classification is still a
fledgling technology, however...
 it has proved practical for classifying close to 100
million web pages
 it is remarkably accurate, given the breadth of
concept space it covers
 it is responsive to tuning
 it is effective in managing results sets for users
www.nlsearch.com
Joyce Ward
Director, Content Classification
Northern Light Technology LLC
222 Third St.
Cambridge, MA 02172
jward@northernlight.com
617-577-2778
www.nlsearch.com
Download