Classification at Northern Light Presentation to Access 98 October 4, 1998 www.nlsearch.com “This year, the World Wide Web has arrived as a serious supplier of ‘serious’ online information.” Sue Feldman, “Web Search Services in 1998: Trends and Challenges,” Searcher Magazine, June 1998 www.nlsearch.com Search engines are being held to higher standards All users want freshness and manageable results sets Professional information seekers want – high relevance and high quality content first – good descriptive information for all results – precision searching – text and tables www.nlsearch.com Web search environment constant growth in all dimensions (pages, countries, languages, file formats) constantly increasing traffic continuous onslaught of spam www.nlsearch.com Practical considerations for search engines significant engineering time spent counteracting spam constantly adding disk space: 3 terabytes at Northern Light crawler efficiency: must balance new page discovery with known-page re-crawl www.nlsearch.com You step in the stream, but the water has moved on. This page is not here. www.nlsearch.com Search engines: limitations lack the higher quality sources not found on the Web no concept of classification as found in library systems like an index of every word on every page in every book in your library – with no subject catalog www.nlsearch.com Northern Light’s fundamental goals Combine Web data with quality information not on the Web in a single integrated search Make results set manageable for user (already a problem; worse after non-Web data is added) www.nlsearch.com # Research Engine : Content as of Oct 98 Web – 96,000,000 pages Special Collection – 3,600,000+ full-text documents – 4600 journals, magazines, books, trusted reference works, etc. Mixes free (Web) and Fee (Special Collection) www.nlsearch.com Relevancy ranking still critical Engines continue to improve their ranking algorithms All seem to agree that relevancy ranking is not enough to manage results lists of size commonly seen now www.nlsearch.com Techniques for taming results sets abridge the database (Excite, Lycos, Infoseek) re-sort by popularity (HotBot/Direct Hit) suggest further refinement steps to user (Alta Visa Refine) sort based on number of inbound links (Infoseek…?) sort by classification metadata (Northern Light) www.nlsearch.com Research Engine: Classification classify the Web according to the same standards found in journal literature sort results for user, based on this classification work with the user to refine the question (reference interview approach) www.nlsearch.com Relevancy ranking has its limits Library patron: “I need some baseball information.” Librarian: “OK. Here are 41,536 books and sources about baseball, relevancy ranked.” Good general sources may be ranked on top, but the user probably had something more specific in mind... www.nlsearch.com Reference librarian approach: work with the user to refine the question “I need some baseball information.” “OK. Tell me more. Do you want general info, teams and players, recent news...?” “Um... team info” “OK. Red Sox, Yankees, ...?” “Red Sox.” www.nlsearch.com www.nlsearch.com Classification helps organize results shows aspects of a topic (‘baseball’, ‘diagnostic tests’) disambiguates queries (‘what is balance’) sometimes answers questions directly (‘12th President’) www.nlsearch.com www.nlsearch.com www.nlsearch.com Search Current News Special Collection Computer networks Local area networks Modems Cable modems Personal computers Computer caches Buses (computer) Health care software Software industry Circuit design all others... www.nlsearch.com www.nlsearch.com 1. WHAT IS BALANCE? 84% - Articles & General info: WHAT IS BALANCE? Back to New Evangelicanism Reports. Back to the Way of Life Home Page Way of Life Literature Online Catalog You Can Own…11/09/97 Personal Page: http://www.dsinclair.com /~dcloud/fbns /whatisbalance.htm Special Collection documents Commercial sites Sociology of the family 2. Employee assistance programs Neurology Online banking Helicopters Martial arts Chinese philosophy Emotional Stability is Balance 77% - Articles & General info: Emotional Stability is Balance Emotional Stability is Balance - 1 He is unbalanced - 2 She’s not on an even keel - 3 They’re upset… 03/24/95 Educational site:http://cogsci.berkeley.edu/metaphors/ EmotionalStabilityIsBalance.html 3. What is balance? 73% - Biographical sources: “What is balance?” This is an ongoing, soulsearching, head-scratching question that my husband, Don, and I ponder on a regular bases….07/01/96 Exceptional parent (magazine): Available at Northern Light all others... www.nlsearch.com www.nlsearch.com Subject classification of Web documents exists for sites in Web directories (Yahoo, Looksmart, The Mining Co) exists behind CGI interfaces doesn’t exist at the document level except where supplied by the page creator www.nlsearch.com Cost of document classification Original cataloging of book: $37 Creating a journal article abstract: $1.50 Deriving subject headings from journal abstract: $.20 for 95,000,000 Web documents = $161.5 million www.nlsearch.com Metadata manufacturing Automatically determine document’s subject, type, source and language metadata Controlled vocabularies interoperate with classifier system System classifies pages Fraction of cent per document www.nlsearch.com NL’s controlled vocabularies Editorially developed Hierarchical in form (graph) Exist for subjects, types, and sources www.nlsearch.com NL’s subject vocabulary Subject scope is unlimited (as in LC, Dewey, Yahoo) Major points of reference were DDC, LC Subject headings, UMI subject headings, and subjectspecialized classification schemes Unique, selective conflation of these Mapping NL with content partners’ vocabularies gives freshness, completion 20,000 concepts; 200-300,000 concept equivalents www.nlsearch.com Subject classification process Three main techniques: – mapping – automatic classification – editorial classification of whole web sites www.nlsearch.com Mapping Indexing vocabularies of content partners are normalized with NL vocabularies Excellent source of new terms; helps maintain freshness and ensure complete coverage of a topic All terms become synonyms, equivalents of NL terms and are used in automatic classification... creating a ‘network effect’ of subject knowledge www.nlsearch.com Partner vocabularies mapped to date journal aggregators: UMI, IAC, Ethnic News Watch, Responsive Database Services news databases: AP News, Comtex Newswires, Newsbytes others: U.S. Pharmacopeia, American Banker, Engineering News Record www.nlsearch.com Automatic classification based on words contained in document uses Term Frequency/Inverse Document Frequency methods document must have a strong degree of ‘aboutness’ to class www.nlsearch.com NL’s type classification This scheme too is hierarchical, e.g. • Reviews – Book reviews – Movie reviews – Product reviews classification process based on words and structure of document www.nlsearch.com Librarians at Northern Light Build and maintain controlled vocabulary Map vocabularies of new partners Continually tune classification performance Help design and test user interface Mine and classify whole web sites Edit databases www.nlsearch.com Database editing Classification used to slice NL database into “vertical search engines” Since Feb 98, we’ve released – 17 subject search engines on NL Power Search – 26 industry databases (for NL; also on Netscape Netcenter) – 5 personal finance databases (for Doubleclick) – music industry database (with Billboard magazine) – construction industry database (with Engineering News Record) www.nlsearch.com Automatic classification is still a fledgling technology, however... it has proved practical for classifying close to 100 million web pages it is remarkably accurate, given the breadth of concept space it covers it is responsive to tuning it is effective in managing results sets for users www.nlsearch.com Joyce Ward Director, Content Classification Northern Light Technology LLC 222 Third St. Cambridge, MA 02172 jward@northernlight.com 617-577-2778 www.nlsearch.com