Tema 4. Búsquedas en el Web Sistemas de Gestión Documental 1 Introducción El WWW data de finales de 1980. Tiene un ritmo de crecimiento exponencial. Podemos encontrar información textual, pero también multimedia. Podemos considerar el web como una enorme base de datos sin estructura. 2 Introducción Se plantea el problema de encontrar información en el Web. Existen 3 formas distintas de hacer búsquedas: Utilizar motores de búsqueda (indexan parte del web como documentos en una base de datos textual). Usar Directorios Web (clasifican documentos por temas). Realizar búsquedas utilizando la característica de hiperenlaces. 3 Introducción Los principales problemas con los que nos enfrentamos son: Datos distribuidos. Alto porcentaje de datos volátiles. Enorme cantidad de información. Datos redundantes y no estructurados. Calidad de los datos. Datos heterogéneos. 4 Tipos de buscadores Types of Search Tools Search Engines (& MetaSearch Engines) Characteristics • • • • • • Full-text of selected Web pages Search by keyword, trying to match exactly the words in the pages No browsing, no subject categories Databases compiled by "spiders" (computer-robot programs) with minimal human oversight Search-Engine size: from small and specialized to huge (about 20 billion websites or pages) Meta-Search Engines quickly and superficially search several individual search engines at once and return results compiled into a sometimes convenient format. Caveat: They only catch about 1% of search results in any of the search engines they visit. Examples • • Google, Yahoo Search, Ask.com Meta-Search Engines: Dogpile, Copernic 5 Tipos de buscadores Types of Search Tools Subject Directories Characteristics • • • • • • Human-selected sites picked by editors (sometimes experts in a subject) Often carefully evaluated and kept up to date, but not always -- frequently not if large and general Usually organized into hierarchical subject categories Often annotated with descriptions (not in Yahoo!) Can browse subject categories or search using broad, general terms NO full-text of documents. Searches need to be less specific than in search engines, because you are not matching on the words in the pages you eventually want. In Directories you are searching only the subject categories and descriptions you see in its pages. Examples • • Librarians' Index, Infomine, Google Directory, About.com, AcademicInfo There are thousand more of Subject Directories on practically every topic you can think of. 6 Tipos de buscadores Types of Search Tools Specialized Databases (The Invisible Web) Characteristics • • • The Web provides access through a search box into the contents of a database in a computer somewhere Can be on any topic, can be trivial, commercial, task-specific, governmental, or a rich treasure devoted to your topic Also includes Also includes many pages generated as search results from libraries online catalogs, and the many copyrightprotected articles in the databases of journal and magazine publishers. Examples • Locate specialized databases by looking for them in good Subject Directories like the Librarian's Index, Yahoo!, or AcademicInfo; in special guides to searchable databases; and sometimes by keyword searching in general search engines 7 Search Engines ¿Como funcionan? No buscan en el web directamente Utilizan una base de datos de páginas web. Las bases de datos las crean los spiders o crawlers. Buscan páginas en base a los links que poseen. Una página que no esté enlazada nunca será indexada. Los spiders envían las páginas web a programas indexadores, que identifican texto, enlaces, ... Almacenan en la base de datos los términos indexados. Algunos tipos de páginas son excluidos de la indexación siguiendo alguna regla (páginas no encontradas, contenido no adecuado, formato no procesable, información generada de forma dinámica, etc.). 8 Search Engines Search Engine Google www.google.com Size, type Size varies frequently and widely. HUGE. Size not disclosed in any way that allows comparison. Probably the biggest. Biggest in tests. HUGE. Claims over 20 billion total "web objects." LARGE. Claims to have 2 billion fully indexed, searchable pages. Strives to become #1 in size. Noteworthy features and limitations Popularity ranking using PageRank™. Indexes the first 101KB of a Web page, and 120KB of PDF's. ~ before a word finds synonyms sometimes (~help > FAQ, tutorial, etc.) Shortcuts give quick access to dictionary, synonyms, patents, traffic, stocks, encyclopedia, and more. Subject-Specific Popularity™ ranking. Suggests broader and narrower terms. Phrase searching Yes. Use " ". Searches common "stop words" if in phrases in quotes. Yes. Use " " Yes. Use " ". Searches common "stop words" if in phrases in quotes. Boolean logic Partial. AND assumed between words. Capitalize OR. - excludes. No ( ) or nesting. In Advanced Search, partial Boolean available in boxes. Accepts AND, OR, NOT or AND NOT, and ( ). Must be capitalized. You must enclose terms joined by OR in parentheses (classic Boolean). Partial. AND assumed between words. Capitalize OR. - excludes. No ( ) or nesting. +Requires/ Excludes - excludes + will allow you to retrieve "stop words" (e.g., +in) - excludes + will allow you to search common words: "+in truth" - excludes + will allow you to retrieve "stop words" (e.g., +in) Yahoo! Search search.yahoo.com Ask.com www.ask.com 9 Search Engines Search Engine Google www.google.com Yahoo! Search search.yahoo.com Ask.com www.ask.com Sub-Searching Sort of . At bottom of results page, click "Search within results" and enter more terms. Adds terms. Add terms. Sort of . Add terms. Results Ranking Based on page popularity measured in links to it from other pages: high rank if a lot of other pages link to it. Fuzzy AND also invoked. Matching and ranking based on "cached" version of pages that may not be the most recent version. Automatic Fuzzy AND. Based on Subject-Specific Popularity™, links to a page by related pages. More info. link: site: intitle: inurl: Advanced Search boxes for most of these. Offers Uncle Sam for US federal pages and other special searches. link: site: intitle: inurl: url: hostname: (Explanation of these distinctions.) intitle: inurl: site: Field limiting 10 Search Engines Search Engine Google www.google.com Truncation Stemming No truncation. Stems some words. Search variant endings and synonyms separately, separating with OR (capitalized): airline OR airlines Neither. Search with OR as in Google. Neither. Search with OR as in Google. No. No. No. Yes. Major Romanized and non-Romanized languages in Advanced Search. Yes. Major Romanized and non-Romanized languages. Yes. Major Romanized languages. Use Advanced Search to limit. In Advanced Search. In Advanced Search. In Advanced Search. Yes, in Translate this page link following some pages. To and sometimes from English and major European languages and Chinese, Japanese, Korean. Yes. No. Case sensitivity Language Limit by age of documents Translation Yahoo! Search search.yahoo.com Ask.com www.ask.com 11 Search Engines Features Chart Last updated Oct. 1, 2007. Search Engines Boolean Default Proximity Truncation Fields Limits Stop Sorting Google -, OR and Phrase No (stems) word in phrase intitle, inurl, link, site, more Language, filetype, date, domain Few, + searches Relevance, site Yahoo! AND, OR, NOT, ( ), - and Phrase No word in phrase intitle, inurl, link, site, more Language, file type, date, domain No Relevance, site Ask -, OR and Phrase No intitle, inurl, site Language, site, date Yes, + searches Relevance, metasites Live Search AND, OR, NOT, ( ), - and Phrase No intitle, link, site, loc, url Language, site Varies, + searches Relevance,si te, sliders Gigablast AND, OR, AND NOT, ( ), +, - and Phrase No title, site, ip, more Domain, type Varies, + searches Relevance Exalead AND, OR, NOT, ( ),- and Phrase, NEAR Yes and stems intitle, inurl, link, site Language, file type, date, domain Varies, + searches Relevance, date 12 Search Engines (¿diferentes?) http://www.bruceclay.com/searchenginerelationshipchart.htm 13 Search Engines 14 Search Engines 15 Search Engines 16 Search Engines 17 Metasearch Meta-Search Tool Clusty clusty.com Dogpile www.dogpile.com What's Searched (As of date at bottom of page. They change often.) Complex Search Ability Results Display Currently searches a number of free, search engines and directories, not Google or Yahoo. Accepts and "translates" complex searches with Boolean operators and field limiting. Results accompanied with subject subdivisions based on words in search results, giving usually the major themes (Vivisimo Clustering Engine™). Click on these to search within results on each theme. Searches Google, Yahoo, LookSmart, AskJeeves/Teoma, Google ADS, MSN search. Sites that have purchased ranking and inclusion are blended in. Watch for Sponsored by... links below search results. Accepts Boolean logic, especially in advanced search modes. Dogpile allows you to see each search engine's results separately in a useful list for comparison. Click the search engine icons by "Best of Breed." 18 Metasearch Meta-Search Tool What's Searched (As of date at bottom of page. They change often.) Complex Search Ability Results Display SurfWax www.surfwax.com A better than average set of search engines. Can mix with educational, US Govt tools, and news sources, or many other categories. Accepts " ", +/-. Default is AND between words. I recommend fairly simple searches, allowing SurfWax's SiteSnaps and other features to help you dig deeply into results. Click on source link to view complete search results there. Click on to view helpful "SiteSnap™" extracted from most sites in frame on right. Many additional features for probing within a site. Copernic Agent www.copernic.com Select from list of search engines by clicking the Properties button following Advanced Search search box. ALL, ANY, Phrase, and more. Also Boolean searching within results under Refine (powerful!). Must be downloaded and installed, but Basic version is free of charge. Table comparing versions. 19 Metasearch Dogpile http://www.dogpile.com Popular metasearch site owned by InfoSpace that sends a search to a customizable list of search engines, directories and specialty search sites, then displays results from each search engine individually. Vivisimo http://vivisimo.com/ Enter a search term, and Vivismo will not only pull back matching responses from major search engines but also automatically organize the pages into categories. Slick and easy to use. Kartoo http://www.kartoo.com If you like the idea of seeing your web results visually, this meta search site shows the results with sites being interconnected by keywords. Mamma http://www.mamma.com Founded in 1996, Mamma.com is one of the oldest meta search engines on the web. Mamma searches against a variety of major crawlers, directories and specialty search sites. The service also provides a paid listings option for advertisers, Mamma Classifieds. SurfWax http://www.surfwax.com Searches against major engines or provides those who open free accounts the ability to chose from a list of hundreds. Using the "SiteSnaps" feature, you can preview any page in the results and see where your terms appear in the document. Allows results or documents to be saved for future use. 20 Metasearch Clusty http://www.clusty.com InfoGrid http://www.infogrid.com MetaEureka http://www.metaeureka.com CurryGuide http://web.curryguide.com/ Infonetware RealTerm Search http://www.infonetware.com ProFusion http://www.profusion.com Excite http://www.excite.com Ixquick http://www.ixquick.com/ Query Server http://www.queryserver.com/web.htm Fazzle http://www.fazzle.com/ iZito http://www.izito.com Turbo10 http://turbo10.com Gimenei http://gimenei.com/ Jux2 http://www.jux2.com/ Search.com http://www.search.com IceRocket http://www.icerocket.com/ Meceoo http://www.meceoo.com/ Ujiko http://www.ujiko.com/ Info.com http://www.info.com MetaCrawler http://www.metacrawler.com WebCrawler http://www.webcrawler.com ZapMeta http://www.zapmeta.com 21 Directorios Subject Directories Size, type Phrase searching Librarians' Index www.lii.org Infomine infomine.ucr.edu Academic Info www.academicin fo.us Recommend Browsing About.com www.about.co m Google Directory directory.google .com Yahoo! dir.yahoo.com Over 16,000 Compiled by public librarians in information supply business. Highest quality sites only. Great, reliable annotations. Over 120,000 Great, reliable annotations. Cooperatively compiled by university & college-level, academic librarians of the UC campuses. Rich selection of about 25,000 pages, selected as "college and research level Internet resources" aimed at "at the undergraduate level or above." Brief annotations. Over 2 million Generally good annotations done by "Guides" with various levels of expertise. About 5 million web pages, selected by the Open Directory Project and enhanced by Google searching and ranking. Often useful to find "better" results, especially on broad or widely covered topics. About 4 million. Scarce descriptions and annotations. Often useful, especially for popular and commercial topics. Yes. Use " " Yes. Use " " |term term| requires exact match No. " " make searches fail. Yes. Use " " Yes. Use " " Yes. Use " " 22 Directorios Subject Directories Librarians' Index www.lii.org Infomine infomine.ucr.edu Academic Info www.academicinf o.us Recommend Browsing About.com www.about. com Google Directory directory.googl e.com Yahoo! dir.yahoo.com Boolean logic AND implied between words. Also accepts OR and NOT, and ( ). AND implied between words. Also accepts OR, NOT, and ( ). OR implied between words. Accepts AND, OR, NOT and ( ) Recommend AND between words in most searches. No. OR, capitalized, as in Google's web search engine. Yes, as in Yahoo! Search web search engine. Truncation Use *. Also stems. Can turn stemming off on Advanced Search page. Use *. Also stems. Can turn stemming off. Use " " or | | to search exact terms. No. Use *. Not accepted consistently. No. No. Field searching Advanced Search allows Boolean searching within subject, titles, description, parts of URLs, and more. Select boxes under search box to limit. No. No. Same as in Google's web search engine. As in Yahoo! Search web search engine. 23 El web invisible ¿Qué es? El web visible es lo que se ve como resultado de una consulta en un buscador o en los directorios. El web invisible está formado por todas aquellas páginas y contenidos que no pueden ser procesados por los buscadores y catalogados en los índices. Por ejemplo: Información dinámica. Bases de datos buscables. Páginas excluidas de los buscadores por algún tipo de política de procesamiento. Los buscadores no pueden encontrar la información ofrecida en estas páginas. Para acceder a la información del web invisible hay que ir directamente a la página que la ofrece, y buscar en ella. 24 El web invisible ¿Cómo buscar en el web invisible? Hay que mantener en la mente el concepto “bases de datos” y permanecer atento a cualquier información que nos puedan ofrecer los buscadores y directorios. Las páginas pueden aparecer en cualquier momento de la navegación o ejecución de nuestras consultas. Para encontrar páginas del web invisible se pueden utilizar buscadores añadiendo en la consulta el término “base de datos” o “database”. Ejemplo: plane crash database Además de planificar una buena búsqueda con una estrategia adecuada en un buscador o un directorio, hay que dedicar tiempo a investigar las bases de datos que encontremos referentes a los temas de nuestra necesidad de información. 25 El web invisible When dealing with the Deep Web, keep these points in mind: • • • • • • • Information that is likely to be stored in a database is a part of the deep Web. Information that is new and dynamically changing in content will appear on the deep Web. Web sites of searchable databases can be retrieved via directories and search engines. Many search engine sites and commercial portals feature searchable databases as part of their package of services. Some search engines will search the deep Web for related content subsequent to an initial search. Topical coverage on the deep Web is extremely varied. Some of the information stored on Web-accessible databases may not be substantive or useful to most searchers. 26 El web invisible The Invisible Web: Databases not accessible to ordinary search engines. Librarians’ Internet Index (lii.org) Lots of categorized databases. Complete Planet (www.completeplanet.com) Hundreds of databases by category. All Academic (www.allacademic.com) Journals & other free academic content. Invisible-Web.net (www.invisible-web.net) Companion site to Invisible Web book. Findarticles.com (www.findarticles.com) Magportal (www.magportal.com) Infomine (infomine.ucr.edu) Online Books Page (onlinebooks.library.upenn.edu) Search hundreds of journals. Full text magazine articles. Scholarly Internet Resource Collections. Full text of more than 18,000 books. 27 Algunas estadísticas 28 Algunas estadísticas 29 Algunas estadísticas Millions Of Textual Documents Indexed 30 Algunas estadísticas Billions Of Textual Documents Indexed December 1995-September 2003 Search Engine Size November 2004 Search Engine Reported Size Page Depth Google 8.1 billion 101K MSN 5.0 billion 150K 4.2 billion (estimate) 500K Yahoo Ask Jeeves 2.5 billion 101K+ 31 Algunas estadísticas 32 Algunas estadísticas 33 Algunas estadísticas 34 Algunas estadísticas 35 Algunas estadísticas 36 Algunas estadísticas 37 Algunas estadísticas 38 Algunas estadísticas 39 Algunas estadísticas 40 Algunas estadísticas How many searches are performed each day? Below are how many searches happen within the United States in March 2006, based on comScore figures. Searches Per Day (Millions) Per Month (Millions) Google 91 2,733 Yahoo 60 1,792 MSN 28 845 AOL 16 486 Ask 13 378 Others 6 166 Total 213 6,400 41 Algunas estadísticas 42 Como buscan otros en el Web 43 Como buscan otros en el Web 44 Como buscan otros en el Web 45 Como buscar en el Web Estrategias Step #1. Analyze your topic to decide where to begin Click here for a printable FORM you may use to Analyze Your Topic (pdf file). PDF files are supported in Netscape 4.x and some other browsers. To view, search, or print the PDF files, you will need to use Adobe® Acrobat® Reader software, which is available free from Adobe if you need it. have distinctive words or phrases? methernitha, unique meaning "affirmative action", specific, accepted meaning in word cluster have NO distinctive words or phrases you can think of? You have only common or general terms that get the "wrong" pages. "order out of chaos", used in too many contexts to be useful sundiata, retrieves a myth, a rock group, a person, etc. seek an overview of a broad topic? victorian literature, alternative energy sources Does your topic... specify a narrow aspect of a broad or common topic? automobile recyclability, want current research, future designs, not how to recycle or oil recycling or other community efforts have synonymous, equivalent terms, or variant spellings or endings that need to be included? echinoderm OR echinoidea OR "sea urchin", any may be in useful pages "cold fusion energy" OR "hydrogen energy", some use one term, some the other; you want both, although not precisely equivalent millennium OR millennial OR millenium OR millenial OR "year 2000", etc. Pages you want may contain any or all. Make you feel confused? Don't really know much about the topic yet? Need guidance? 46 Como buscar en el Web Estrategias Step #2. Pick the right starting place using this table: YOUR TOPIC'S FEATURES: Search Engines Distinctive or word or phrase? Enclose phrases in " ". Test run your word or phrase in Google. Search the broader concept, what your term is "about." NO distinctive words or phrases? Use more than one term or phrase in " " to get fewer results. Try to find distinctive terms in Subject Directories NOT RECOMMENDED Look for a specialized Subject Directory focused on your topic Seek an overview? Narrow aspect of broad or common topic? Synonyms, equivalent terms, variants Confused? Need more information? Boolean searching as in Yahoo! Search. Choose search engines with Boolean OR, or Truncation, or Field limiting. NOT RECOMMENDED Subject Directories Look for a Directory focused on the broad subject. NOT RECOMMENDED Look for a Gateway Page (Subject Guide). Try an encyclopedia. iAsk at a library reference desk. Specialized Databases "Invisible Web" Want data? Facts? Statistics? All of something? One of many like things? Schedules? Maps? Look for a specialized database on the Invisible Web. Hard to predict what you might find. Find an Expert LUCK Look for a specialized subject directory on your topic. E-mail the author of a good page you find. Ask a discussion group or blog. Never hurts to seek help. Always on your side. Keep your mind open. Learn as you search. 47 Como buscar en el Web Estrategias Step #3. Learn as you go & VARY your approach with what you learn. Don't assume you know what you want to find. Look at search results and see what you might use in addition to what you've thought of. Step #4. Don't bog down in any strategy that doesn't work. Switch from search engines to directories and back. Find specialized directories on your topic. Think about possible databases and look for them. Step #5. Return to previous strategies better informed. 48 Como buscar en el Web Estrategias Search Strategies We Do NOT Recommend Because of their inefficiency and often haphazard and frustrating results, we do not recommend either of the following two approaches to finding Web documents: • Browsing searchable directories. If you can find a search box, search a directory. BROWSING is sometimes fun but rarely as efficient. The term "directories" refers here to any collection of web resources organized into subject categories or some other breakdown appropriate to the content (Subject Directories or directories of specialized databases). Browsing locates documents by your trying to match your topic in first the top, broadest layer of a subject hierarchy, then by choosing narrower sub-subject-categories in the hierarchy that you hope will lead to your target. Browsing encounters the difficulty of guessing under which subject category your topic is classified. The taxonomy in every directory differs, making browsing inconsistent from one search tool to another. The category "health" may contain documents on medicine, homeopathy, psychiatry, and fitness in one directory. In another "medicine" may include health, mental health, and alternative medicine, but not the term psychiatry and may classify fitness only under "lifestyle." Searching (typing keywords in a search box) retrieves occurrences of your words no matter where they may be classified by subject. Use broad terms in searching any directory. • Following links to sites recommended by heavy use or commercial interest. Often in search engine results, you will see links to sites that are selected based on how often they are visited by others, or based on fees paid to the browser. Or you may see recommended "cool" sites. Use these with caution! Others may visit sites for reasons having no relation to your information interests, and the best sites for you may still be largely undiscovered by the vast public searching the Web. Taste varies and should vary. Make your own evaluations. 49 Como buscar en el Web Estrategias Features of your search inquiry Matching Search Tools Features worth learning Are you looking for a proper name or a distinct phrase ? PHRASE SEARCHING is a feature you want in every search tools you choose. Requires your terms all to appear in exactly the order you enter them. Enclose the phrase in double quotations " " Examples: "affirmative action" "world health organization" "a person's name" In , capitalizing initial letters will cause the terms to be searched as a phrase: World Health Organization •The name of an organization or society or movement •A proper name or an individual •A distinctive string of words generally associated with your topic Can you think of an organization, proper name, or phrase to search for? It might help zoom in on the pages you want. Are some of your terms common words with many meanings and contexts ? •Children in conjunction with television and also violence •Censorship as an aspect of ethics in journalism Do you anticipate lots of search results with terms you do not want ? •Your search for biomedical engineering and cancer brings you lots of academic programs, and you want research reports. So you try to exclude documents containing Department of or School of BOOLEAN AND will help: children AND television AND violence journalism AND ethics AND censorship Google and AllTheWeband most other search engines put AND in between words automatically (by default): children television violence journalism ethics censorship BOOLEAN AND NOT will help: "biomedical engineering" AND cancer AND NOT "Department of" AND NOT "School of" or its -EXCLUDES near equivalent: "biomedical engineering" cancer -"Department of" -"School of" 50 Como buscar en el Web Estrategias Features of your search inquiry Matching Search Tools Features worth learning Are there synonyms, spelling variations, or foreign spellings for some of your terms? BOOLEAN OR will help: (women OR females) AND networking (Sarajevo OR Sarayevo) AND peace (literature OR litterature) AND (French or francaise) In Google, capitalize OR (no need to type "and"): peace sarajevo OR sarayevo literature OR litterature french OR francaise In AllTheWeb, use parentheses and omit the OR: peace (sarajevo sarayevo) (literature litterature) (french francaise) •women, females with networking •Sarajevo, Sarayevo with peace •literature, litterature with French, francaise Are you looking for home pages and/or other documents primarily about your term(s)? •The home page of the American Dietetic Association •Pages primarily about Affirmative Action Are you looking for terms with many possible endings ? •Feminism, feminist, feminine •Children, child LIMIT TO TITLE FIELD IN DOCUMENTS intitle:"American Dietetic Association" intitle:"affirmative action" In Google, use intitle:"affirmative action" Some systems search word ending variants automatically (stemming). See the specific instructions for each of the recommended search tools. To be sure use OR searches: children OR child 51 Como buscar en el Web Comandos Command How Supported By Must Include Term + All Must Exclude Term - All Must Include Phrase "" All Match All Terms Automatic at All Via Advanced Search AllTheWeb, AltaVista, Google, Lycos, MSN Search, Teoma, Yahoo (HotBot offers but failed to work when tested) OR AltaVista, AOL Search, Ask Jeeves, Google, HotBot, MSN Search, Teoma, Yahoo (must be done in ALL CAPS) AllTheWeb, Lycos (only works for two words) Match Any Terms 52 Como buscar en el Web Comandos Command Title Search (Updated March 11, 2003) Site Search How Supported By title: AltaVista, AllTheWeb, Inktomi intitle: Google Teoma allintitle: Google host: AltaVista site: Excite, Google (Netscape, Yahoo) url.host: AllTheWeb, Lycos (for AllTheWeb results only) domain: Inktomi (HotBot, iWon, LookSmart) none AOL, Direct Hit, HotBot, LookSmart, Lycos, MSN, Netscape, Northern Light, Open Directory, Yahoo 53 Como buscar en el Web Comandos URL Search Link Search url: AltaVista, Excite, Northern Light url.all: AllTheWeb, Lycos (for AllTheWeb results only) allinurl: inurl: Google originurl: Inktomi (AOL, GoTo, HotBot) u: Yahoo none AOL, Direct Hit, HotBot, LookSmart, MSN Not yet updated, but may be still correct: Open Directory link: AltaVista, Google, Northern Light linkdomain: Inktomi (AOL, HotBot, iWon, MSN) (NOTE: measures links to entire domains) link.all: AllTheWeb, Lycos (for AllTheWeb results only) none AOL, Direct Hit, Excite, HotBot, LookSmart, Northern Light Not yet updated, but may be still correct: Netscape, Yahoo (n/a) 54 Como buscar en el Web Comandos * AltaVista, Inktomi (iWon), Northern Light Not yet updated, but may be still correct: Yahoo ? AOL Search, Inktomi (iWon) % Northern Light none AllTheWeb, Direct Hit, Excite, Google, HotBot, LookSmart, Lycos, MSN (MSN's help says it offers wildcard, but it failed to during testing) anchor: AltaVista None AllTheWeb, AOL Search, Direct Hit, Excite, Google, Inktomi, HotBot, Lycos Wildcard Anchor Search 55 Como buscar en el Web Ayudas Feature Offered By Related Searches AltaVista, AllTheWeb, Excite, HotBot, Lycos, MSN, Yahoo Not yet updated, but may be still correct: iWon Clustering AltaVista, AllTheWeb, Excite, Google, HotBot, MSN, Northern Light Find Similar AltaVista, AOL Search, Google Stemming AOL Search, Direct Hit, HotBot, Inktomi (HotBot, MSN) Search Within AltaVista, Google, HotBot, Lycos Spidered Version Google Search By Language AltaVista, AllTheWeb, Excite, Google, HotBot, Lycos, MSN, Northern Light Page Translation AltaVista, Google, Lycos Porn Filter AltaVista, AllTheWeb, Google Porn Warning HotBot, MSN, Northern Light 56 Como buscar en el Web Ayudas Feature Supported By Number Of Listings Shown (10 unless noted) AltaVista, AllTheWeb, AOL Search (5), Direct Hit, Excite, Google, HotBot, LookSmart (15), Lycos, MSN (15), Northern Light Not yet updated, but may be still correct: iWon, Netscape, Yahoo (20) Ability To Increase Number Of Listings? AltaVista, AllTheWeb, Excite, Google, HotBot, MSN Not yet updated, but may be still correct: Yahoo See 20 Results AltaVista, AllTheWeb, Excite, Google, HotBot, MSN Not yet updated, but may be still correct: Yahoo See 50 Results AltaVista, AllTheWeb, Excite, Google, HotBot, MSN Not yet updated, but may be still correct: Yahoo See 100 Results AllTheWeb, Google, HotBot, Not yet updated, but may be still correct: Yahoo Sort By Date MSN Search, Northern Light Date Range AltaVista, Google, HotBot, MSN, Northern Light Not yet updated, but may be still correct: iWon, Yahoo Date Displayed? AltaVista, HotBot (for Inktomi results), Northern Light Display Titles Only? AltaVista, Excite, HotBot (URLs only option), MSN Other Major Customize Options AltaVista, AllTheWeb, Google 57 Como buscar en el Web Operadores Command Or And Not Nesting Near How Supported By OR AltaVista, AOL Search, Excite, Google, Inktomi (HotBot, MSN), Lycos, Northern Light None AllTheWeb, Direct Hit, LookSmart, Not yet updated, but may be still correct: Yahoo AND AltaVista, AOL Search, Excite, Inktomi (HotBot, MSN) Lycos, Northern Light None AllTheWeb, Direct Hit, Google, LookSmart Not yet updated, but may be still correct: Yahoo NOT AOL Search, Excite, Inktomi (HotBot), Lycos, Northern Light AND NOT AltaVista, Inktomi (MSN) Not yet updated, but may be still correct: Netscape None AllTheWeb, Direct Hit, Google, LookSmart, Not yet updated, but may be still correct: Yahoo () AltaVista, AOL Search, Excite, Inktomi (MSN), Northern Light None AllTheWeb, Direct Hit, Google, Inktomi (HotBot), LookSmart, Lycos Not yet updated, but may be still correct: Yahoo NEAR AltaVista (10 words), AOL Search (specify number), Lycos (25 words) None AllTheWeb, Direct Hit, Google, Inktomi (HotBot, MSN), LookSmart Notes At AltaVista, Boolean only works on advanced search page. At Excite, Google & MSN, Boolean commands must be in UPPERCASE At Inktomi-powered services, set menu to "Boolean" 58 Un ejemplo: Google Google = [googol] = 10100 Objetivo en su creación (1997): mejorar los buscadores existentes en cuanto a calidad de las búsquedas. Ej. De los 4 principales buscadores de la época, sólo 1 se encontraba a sí mismo. Se pretende obtener muy alta precisión a costa de la exhaustividad. Se contempla la inclusión de texto y estructura de los enlaces como mejora a otros sistemas. 59 Un ejemplo: Google Características: Utiliza la estructura de los enlaces para calcular el ranking de cada página, a través de una medida llamada PageRank. Utiliza los enlaces para mejorar los resultados de las búsquedas. Se incluye la información del enlace tanto en la página que lo contiene como en la enlazada (en algunos casos, el texto del enlace es más descriptivo de la página enlazada que los propios contenidos de la página). Mantiene información sobre localización de términos. Por tanto, permite utilizar búsquedas de proximidad, y aplicar la proximidad al cálculo de la relevancia. Mantiene información sobre la tipología y visualización de los caracteres (negrita, comillas, ...) para determinar la importancia de un término. Mantiene todas las páginas que analiza en formato comprimido (sólo el contenido html). 60 Un ejemplo: Google PageRank Medida objetiva de la importancia de una página atendiendo al número de referencias que existen a la misma en otras páginas. Tiene en cuenta: El número de referencias a esa página. La calidad de las páginas que hacen referencia a esa página. El número total de referencias existentes en cada página que hace referencia a esa página. 61 Un ejemplo: Google Elementos considerados: El web no es una colección controlada. Mejorar la búsqueda no tiene que restringirse a mejorar la consulta (un usuario puede consultar lo que quiera y como quiera). No hay control sobre lo que la gente pone el en web. Las empresas comerciales aprovechan el funcionamiento de los buscadores para manipularlos y obtener altos rankings. 62 Un ejemplo: Google Arquitectura 63 Un ejemplo: Google Funcionamiento El URLServer envía URLs a los crawlers Las páginas encontradas se envían al StoreServer para que se almacenen en el Repository (comprimidas). El Indexer lee el repositorio, descomprime los documentos y los parsea. Convierte el documento en un conjunto de ocurrencias de palabras llamadas hits. Los hits almacenan la palabra, posición en el documento, tamaño de fuente y mayúsculas. Distribuye los hits en los barrels creando el forward index parcialmente ordenado. Almacena información sobre los enlaces hallados en las páginas. El URLResolver convierte direcciones relativas en absolutas, y genera los identificadores de documentos. Genera base de datos de links para calcular el PageRank. El sorter reordena la información de los barrels por identificador de palabras en lugar de por identificador de documentos. Genera el fichero invertido. El Searcher se encarga de resolver las consultas. 64 Un ejemplo: Google Estructuras de datos: BigFiles. Ficheros virtuales. Repositorio. Documentos comprimidos. Indices de documentos. Lexicon. Lista completa de palabras. Hit Lists. Forward index. Ordenación parcial (barrels) Inverted index. Ordenación total (barrels) 65 Un ejemplo: Google El proceso de indexación: Parsing Indexar documento en los ‘barrels’ Muchos problemas por errores de sintaxis y tipos de contenidos. El parsing genera documentos que se codifican en los ‘barrels’. Ordenar Se genera el índice invertido ordenando por identificadores de palabras. 66 Un ejemplo: Google El proceso de búsqueda Parsing de la consulta. Conversión de palabras en identificadores. Búsqueda de comienzo de lista de documentos para cada palabra. Buscar documentos que contengan todas las palabras. Calcular el ranking de cada documento. Ordenar y mostrar los primeros k documentos. 67 Un ejemplo: Google Algunas estadísticas (1997) Storage Statistics Total Size of Fetched Pages 147.8 GB Compressed Repository 53.5 GB Short Inverted Index 4.1 GB Full Inverted Index 37.2 GB Lexicon 293 MB Temporary Anchor Data (not in total) 6.6 GB Document Index Incl. Variable Width Data 9.7 GB Links Database 3.9 GB Total Without Repository 55.2 GB Total With Repository 108.7 GB Web Page Statistics Number of Web Pages Fetched 24 million Number of Urls Seen 76.5 million Number of Email Addresses 1.7 million Number of 404's 1.6 million 68 Un ejemplo: Google 69 Un ejemplo: Google 70 Evaluar páginas Los buscadores recuperan información, pero (por ahora) no dan datos sobre la calidad de las páginas encontradas. En algunos casos el ranking de los resultados de una consulta trata de considerar la calidad de las páginas (PageRank – google), pero no hay criterios objetivos para su valoración. Es necesario evaluar de forma objetiva las páginas encontradas. Para ello se necesita: Utilizar técnicas para identificar características de las páginas y la información que se necesita Aplicar un pensamiento crítico sobre los contenidos, y realizar una serie de preguntas para decidir sobre su calidad. 71 1. What can the URL tell you? Questions to ask: What are the implications? Is it somebody's personal page? • Read the URL* carefully: • Look for a personal name (e.g., jbarker or barker) following a tilde ( ~ ), a percent sign ( % ), or or the words "users," "members," or "people." • Is the server a commercial ISP* or other provider mostly of web page hosting (like aol.com or geocities.com Personal pages are not necessarily "bad," but you need to investigate the author very carefully. For personal pages, there is no publisher or domain owner vouching for the information in the page. What type of domain does it come from ? (educational, nonprofit, commercial, government, etc.) • Is the domain appropriate for the content? • Government sites: look for .gov, .mil, .us, or other country code • Educational sites: look for .edu • Nonprofit organizations: look for .org • If from a foreign country, look at the country code and read the page to be sure who published it. Look for a appropriateness, fit. What kind of information source do you think is most reliable for your topic? Is it published by an entity that makes sense? Who "published" the page? • In general, the publisher is the agency or person operating the "server" computer from which the document is issued. • The server is usually named in first portion of the URL (between http:// and the first /) • Have you heard of this entity before? • Does it correspond the name of the site? Should it? You can rely more on information that is published by the source: Evaluar páginas • • Look for New York Times news from www.nytimes.com Look for health information from any of the agencies of the National Institute of Health on sites with nih somewhere in the domain name. 72 2. Scan the perimeter of the page Questions to ask: What are the implications? Who wrote the page? • Look for the name of the author, or the name of the organization, institution, agency, or whatever who is responsible for the page Web pages are all created with a purpose in mind by some person or agency or entity. They do not simply "grow" on the web like mildew grows in moist corners. You are looking for someone who claims accountability and responsibility for the content. An e-mail address with no additional information about the author is not sufficient for assessing the author's credentials. If this is all you have, try e-mailing the author and asking Evaluar páginas • An e-mail contact is not enough • If there is no personal author, look for an agency or organization that claims responsibility for the page. • If you cannot find this, locate the publisher by truncating back the URL (see technique above). Does this publisher claim responsibility for the content? Does it explain why politely for more information about him/her. the page exists in any way? Is the page dated? Is it current enough? • Is it "stale" or "dusty" information on a time-sensitive or evolving topic? • CAUTION: Undated factual or statistical information is no better than anonymous information. Don't use it. How recent the date needs to be depends on your needs. For some topics you want current information. For others, you want information put on the web near the time it became known. In some cases, the importance of the date is to tell you whether the page author is still maintaining an interest in the page, or has abandoned it. What are the author's credentials on this subject? • Does the purported background or education look like someone who is qualified to write on this topic? • Might the page be by a hobbyist, self-proclaimed expert, or enthusiast? • Is the page merely an opinion? Is there any reason you should believe its content more than any other page? • Is the page a rant, an extreme view, possibly distorted Anyone can put anything on the web for pennies in just a few minutes. Your task is to distinguish between the reliable and questionable. Many web pages are opinion pieces offered in a vast public forum. You should hold the author to the same degree of credentials, authority, and documentation that you would expect from something published in a reputable print resource (book, journal or exaggerated? • If you cannot find strong, relevant credentials, look very closely at documentation of sources (next section). article, good newspaper). 73 3. Look for indicators of quality information Questions to ask: What are the implications? Evaluar páginas Are sources documented with footnotes or links? • Where did the author get the information? • As in published scholarly/academic journals and books, you should expect documentation. • If there are links to other pages as sources, are they to reliable sources? • Do the links work? In scholarly/research work, the credibility of most writings is proven through footnote documentation or other means of revealing the sources of information. Saying what you believe without documentation is not much better than just expressing an opinion or a point of view. What credibility does your research need? An exception can be journalism from highly reputable newspapers. But these are not scholarly. Check with your instructor before using this type of material. Links that don't work or are to other weak or fringe pages do not help strengthen the credibility of your research. If reproduced information (from another source), is it complete, not altered, not fake or forged? • Is it retyped? If so, it could easily be altered. • Is it reproduced from another publication? • Are permissions to reproduce and copyright information provided? • Is there a reason there are not links to the original source if it is online (instead of reproducing it)? Are there links to other resources on the topic? • Are the links well chosen, well organized, and/or evaluated/annotated? • Do the links work? • Do the links represent other viewpoints? • Do the links (or absence of other viewpoints) indicate a bias? You may have to find the original to be sure a copy of something is not altered and is complete. Look at the URL: is it from the original source? If you find a legitimate article from a reputable journal or other publication, it should be accompanied by the copyright statement and/or permission to reprint. If it is not, be suspicious. Try to find the source. If the URL of the document is not to the original source, it is likely that it is illegally reproduced, and the text could be altered, even with the copyright information present. Many well developed pages offer links to other pages on the same topic that they consider worthwhile. They are inviting you compare their information with other pages. Links that offer opposing viewpoints as well as their own are more likely to be balanced and unbiased than pages that offer only one view. Anything not said that could be said? And perhaps would be said if all points of view were represented? Always look for bias. Especially when you agree with something, check for bias. 74 4. What do others say? Questions to ask: What are the implications? Who links to the page? Sometimes a page is linked to only by other parts of its own site (not much of a recommendation). Sometimes a page is linked to by its fan club, and by detractors. Read both points of view. If a page or its site is in a bona fide directory, think about whether there is much critical Evaluar páginas • Are there many links? • What kinds of sites link to it? • What do they say? • Are any of them directories? Try looking at what directories say. evaluation of the links in the directory. Is the page listed in one or more reputable directories or pages? Good directories include a tiny fraction of the web, and inclusion in a directory is therefore noteworthy. But read what the directory says! It may not be 100% positive. What do others say about the author or responsible authoring body? "Googling someone" (new term for this) can be revealing. Be sure to consider the source. If the viewpoint is radical or controversial, expect to find detractors. Think critically about all points of view. 75 5. Does it all add up? Questions to ask: So what? What are the implications? Why was the page put on the web? These are some of the reasons to think of. The web is a public place, open to all. You need to be aware of the entire range of human possibilities of intentions behind web pages. Evaluar páginas • Inform, give facts, give data? • Explain, persuade? • Sell, entice? • Share? • Disclose? Might it be ironic? Satire or parody? • Think about the "tone" of the page. • Humorous? Parody? Exaggerated? Overblown arguments? • Outrageous photographs or juxtaposition of unlikely images? • Arguing a viewpoint with examples that suggest that what is argued is ultimately not possible. It is easy to be fooled, and this can make you look foolish in turn. Is this as good as resources I could find if I used the library, or some of the web-based indexes available through the library, or other print resources? • Are you being completely fair? Too harsh? Totally objective? Requiring the same degree of "proof" you What is your requirement (or your instructor's requirement) for the quality of reliability of your information? In general, published information is considered more reliable than what is on the web. But many, many reputable agencies and publishers make great stuff available by "publishing" it on the web. This applies to most governments, most institutions and societies, many publishing houses and news sources. would from a print publication? • Is the site good for some things and not for others? • Are your hopes biasing your interpretation? But take the time to check it out. 76 Evaluar Buscadores Creación de índices ¿Cómo se compila el índice? Tamaño – número de páginas indexadas Cobertura (http, ftp, www, news, …) ¿Hay criterios especiales de inclusión? ¿Tiene el spider acceso a sitios protegidos por contraseñas? ¿Dónde no busca el motor? ¿Qué elementos de las páginas se indexan? ¿Hay control de vocabulario? ¿Se usan stopwords? Frecuencia de actualizaciones Tiempo de indexación de una página solicitada Páginas indexadas por día Comprobación de enlaces muertos 77 Evaluar Buscadores Capacidad de búsqueda ¿Dónde busca (que hay en el índice)? Búsqueda en distintos lugares a la vez Tratamiento de stopwords Rango de funciones de búsqueda Refinamiento de búsquedas Opciones avanzadas Uso de campos Uso de lógica boolean (si/no, fácil/difícil, …) Tratamiento de sinónimos / Uso de tesauros ¿Se puede guardar la búsqueda? 78 Evaluar Buscadores Calidad de las respuestas Tiempo de respuesta Número de resultados Calidad del resumen del hitlist (host, motivo, enlace, ranking, …) Detalle del criterio de relevancia usado Eliminación de duplicados Tratamiento de resultados (visualización, ordenación, exportación, buscar-como, …) Guardar resultados de la búsqueda Análisis metodológico (precisión, exhaustividad, relevancia, cobertura, fiabilidad, utilidad, novedad, …) 79 Evaluar Buscadores Usabilidad Interface (claridad, simplicidad, …) Legibilidad (tamaño de letra, distribución de texto, disposición de párrafos, …) Facilidad de uso (navegación) Ayuda en línea Proceso de construcción de la consulta Capacidad de personalización Guardar preferencias Tiempos de carga y respuesta 80