Computer Science 1000 Information Searching I Permission to redistribute these slides is strictly prohibited without permission World Wide Web – The Basics our next topic examines how to find information on the web we consider a few basic terms here (which you’re probably familiar with): page/web page link/hyperlink site/web site later in semester, we will revisit web technologies in much more detail World Wide Web a system of linked documents accessed via the internet often simply referred to as the web sometimes used interchangeably with the internet, but this isn’t exactly correct the internet is the global network of interconnected devices (computers, routers, etc) that exchange data the web refers to the documents being stored, the software that broadcasts and receives them, and the protocols used for transmission Web Page a document stored and accessed on the web identified by a unique URL (Uniform Resource Locator) often referred to simply as a page today’s web pages are very rich in content text images hyperlinks videos Web Site a collection of related webpages on the internet typically belong to a common organization or event example all pages served by the University of Lethbridge make up its website Hyperlink a part of a web page that refers to a different location often just called a link hyperlinks can reference: another place on the same page another webpage hypertext: text containing hyperlinks The Age of Information the computer, internet, and web have changed how we interact with information information storage the amount of available information is significantly greater (and growing rapidly) than even a generation ago information transmission large amounts of information are available with a single mouse click, and transfer almost immediately Information Age – Rapid Onset the situation has transformed tremendously in your lifetimes consider the global information capacity: in 1986: in 1993: in 2000: in 2007: 2.6 exabytes (< 1 CD per person) 15.8 exabytes 54.5 exabytes 295 exabytes (61 CDs per person) how does one successfully navigate such a mountain of digital content? Martin and Lopez. The World’s Technological Capacity to Store, Communicate, and Compute Information. Science 332:6025 2011 Information Access even in pre-internet days, there was a wealth of information large-scale: library medium-scale: Encyclopaedia set small-scale: newspaper strategies developed to manage information categories hierarchies indices Classification systematic arrangement in groups or categories according to established criteria – Merriam Webster in other words, the information is categorized according to relevant features consider our course notes: terminology (4 sets of slides) information searching (2-3 sets of slides) etc ... Classification classification is not specific to digital information library classification: Dewey Decimal Classification Library of Congress Classification Classification classification is not specific to digital information newspaper classification Classification classification level of detail leads to tradeoffs consider a coarse level of detail e.g. taxonomy of living organisms classify organisms according to Domain (Archaea, Bacteria, Eukarya) advantage: small number of groups disadvantage: each group is massive Classification classification level of detail leads to tradeoffs consider a fine level of detail e.g. taxonomy of living organisms classify organisms according to Genus (Canis, Felis) advantage: each group reasonably small disadvantage: massive number of groups solution: hierarchy Hierarchy a decomposition of classifications according to detail hierarchies contain levels at the top (root) level, there is typically a small number of broad categories each category is decomposed into small categories a classification group is defined by categorization at each level Hierarchy organism taxonomy hierarchy: each Domain categorized into Kingdoms Eukarya Domain: Kingdom: Animalia Fungi Plantae Protista Hierarchy organism taxonomy hierarchy: each Kingdom classified in Phylum each Phylum classified into Class and so on .. http://ag.arizona.edu/pubs/garden/mg/entomology/intro.html Hierarchy an object is still categorized, but by multiple levels (instead of one) http://schoolworkhelper.net/scientific-taxonomy/ Hierarchy facilitates efficient searching through exclusion example (text): suppose you have a collection of a million items these items organized into 10 equal-sized groups each top-level group is also organized into 10 equal subgroups choosing first category eliminates 900000 items choosing second category eliminates 90000 items and so on … Hierarchy hierarchies are very popular consider our previous examples: Library of Congress Classification Hierarchy hierarchies are very popular consider our previous examples: Newspaper Index a detailed list of words, phrases, and/or topics indicating place of occurrence in essence, it maps keywords of interest to their location a bottom-up approach to information organization e.g. a page number as opposed to the top-down structure of a hierarchy particularly popular in printed material books, magazines, volumes, etc Index - Example Index typically used on small-scale made efficient through organizational scheme books and volumes vs. libraries alphabetical is very common some overlap with hierarchies e.g. subtopics Finding Information – The Web as discussed, the amount of information on the web is immense many of the discussed techniques for information finding also apply digitally classification/hierarchies indexing Classification many commercial websites have a classification structure navigation bars Hierarchies many websites, especially large ones, will also arrange their categories in hierarchical fashion Partition a hierarchy where every object occurs only once some hierarchies are necessarily partitions organism taxonomy – every species appears only once e.g. a particular book will only occur at one point in a library classification however, a partition in some case is not natural an object might have an inherent fit in more than one classification Partitions digital content is often stored using overlapping hierarchies (non-partition) potentially more intuitive with hyperlinking, it’s easy to accomplish (two links to the same page) example (text): Three Books for Frugal Fashionistas was stored on NPR’s website under: Home > Arts & Life > Books > Three Books for Frugal Fashionistas Home > Listen > Latest Program > Three Books for Frugal Fashionistas Indexes for the Web unlike hierarchies, indexes are much less common on individual websites site maps might be considered an index of sorts however, there are analogous technologies to indexes that pertain to the web as a whole Search Engines!