Finding Information on the Internet Session 1: Searching the World Wide Web Dr. Hesham Azmi Program of Information Science Dept. of Mass Comm.& Information Science Tuesday 17/1/2006 What is the Internet? The Internet is a computer network made up of thousands of networks worldwide. No one knows exactly how many computers are connected to the Internet. It is certain, however, that these number in the millions and are growing. No one is in charge of the Internet. There are organizations which develop technical aspects of this network and set standards for creating applications on it, All computers on the Internet communicate with one another using the Transmission Control Protocol/Internet Protocol suite, abbreviated to TCP/IP. The Internet consists primarily of a variety of access protocols. Many of these protocols feature programs that allow users to search for and retrieve material made available by the protocol. COMPONENTS OF THE INTERNET WORLD WIDE WEB E-MAIL TELNET FTP E-MAIL DISCUSSION GROUPS USENET NEWS FAQ, RFC, FYI CHAT & INSTANT MESSAGING Information on the Internet The Internet provides access to a wealth of information on countless topics contributed by people throughout the world. On the Internet, a user has access to a wide variety of services: vast information sources, electronic mail, file transfer, interest group membership, interactive collaboration, multimedia displays, and more. The Internet is not a library in which all its available items are identified and can be retrieved by a single catalogue. In fact, no one knows how many individual files reside on the Internet. The number runs into a few billion and is growing at a rapid pace. The Internet is a self-publishing medium. This means that anyone with little or now technical skills and access to a host computer can publish on the Internet. Also be aware that the addresses of Internet sites frequently change. Web sites can disappear altogether. Do not expect stability on the Internet. One of the most efficient ways of conducting research on the Internet is to use the World Wide Web. Since the Web includes most Internet protocols, it offers access to a great deal of what is available on the Internet. WORLD WIDE WEB The World Wide Web (abbreviated as the Web or WWW) is a system of Internet servers that supports hypertext to access several Internet protocols on a single interface. Almost every protocol type available on the Internet is accessible on the Web. This includes email, FTP, Telnet, and Usenet News. In addition to these, the World Wide Web has its own protocol: Hypertext Transfer Protocol, or HTTP. The Web gathers together these protocols into a single system. Web's ability to work with multimedia and advanced programming languages, the Web is the fastest-growing component of the Internet. HOW TO FIND INFORMATION ON THE WEB There are a number of basic ways to access information on the Web: Go directly to a site if you have the address Browse Conduct a search using a Web search engine Explore a subject directory Explore the information stored in live databases on the Web, known as the "deep Web" Join an e-mail discussion group or Usenet newsgroup GO DIRECTLY TO A SITE IF YOU HAVE THE ADDRESS URL stands for Uniform Resource Locator. The URL specifies the Internet address of the electronic document. Every file on the Internet, no matter what its access protocol, has a unique URL. Web browsers use the URL to retrieve the file from the host computer and the directory in which it resides. This file is then downloaded to the user's computer and displayed on the monitor. This is the format of the URL: protocol://host. second level domain. upper level Domain/path/filename READING WEB ADDRESSES First, you need to know how to read a web address, or URL (Universal Resource Locator). Let's look at the URL for this tutorial: http://www.sc.edu/beaufort/library/pages/bones/lesson1.shtml Here's what it all means: "http" means hypertext transfer protocol and refers to the format used to transfer and deal with information "www" stands for World Wide Web and is the general name for the host server that supports text, graphics, sound files, etc. (It is not an essential part of the address, and some sites choose not to use it) "sc" is the second-level domain name and usually designates the server's location, in this case, the University of South Carolina "edu" is the top-level domain name (see below) "beaufort" is the directory name "library" is the sub-directory name "pages" and "bones" are the folder and sub-folder names "lesson1" is the file name "shtml" is the file type extension and, in this case, stands for "scripted hypertext mark-up language" (that's the language the computer reads). The addition of the "s" indicates that the server will scan the page for commands that require additional insertion before the page is sent to the user. Top Level Domains Only a few top-level domains are currently recognized, but this is changing. Here is a list of the domains generally accepted by all: .edu -- educational site (usually a university or college) .com -- commercial business site .gov -- U.S. governmental/non-military site .mil -- U.S. military sites and agencies .net -- networks, internet service providers, organizations .org -- U.S. non-profit organizations and others . Additional Top Level Domains In mid November 2000, the Internet Corporation for Assigned Names and Numbers (ICANN) voted to accept an additional seven new suffixes, which are expected to be made available to users : .aero -- restricted use by air transportation industry .biz -- general use by businesses .coop -- restricted use by cooperatives .info -- general use by both commercial and noncommercial sites .museum -- restricted use by museums .name -- general use by individuals .pro -- restricted use by certified professionals and professional entities CONDUCT A SEARCH USING A WEB SEARCH ENGINE An Internet search engine allows the user to enter keywords relating to a topic and retrieve information about Internet sites containing those keywords. Search engines located on the Web have become quite popular as the Web itself has become the Internet's environment of choice. Web search engines have the advantage of offering access to a vast range of information resources located on the Internet. Web search engines tend to be developed by private companies, though most of them are available free of charge. Search Engines A Web search engine service consists of three components: Spider: Program that traverses the Web from link to link, identifying and reading pages Index: Database containing a copy of each Web page gathered by the spider Search engine mechanism: Software that enables users to query the index and that usually returns results in term relevancy ranked order Search Engines With most search engines, you fill out a form with your search terms and then ask that the search proceed. The engine searches its index and generates a page with links to those resources containing some or all of your terms. These resources are usually presented in ranked order. Term ranking was once a popular ranking method, in which a document appears higher in your list of results if your search term appears many times, near the beginning of the document, close together in the document, in the document title, etc. These may be thought of as first generation search engines. A more sophisticated development in search engine technology is the ordering of search results by concept, keyword, site, links or popularity. Engines that support these features may be thought of as second generation search engines. These engines offer improvements in the ranking of results. Search Engines It is important to stress that by the very nature of Search Engines, they cannot index the entire content of the ‘Net. Since the content of the Internet changes continuously, there will always be a delay in indexing the Net. The possible theoretical exception is Google, whose proprietary engine takes a ‘picture’ of the Net every time it is accessed. But in practice it is estimated that no search engine indexes more than about 30% of the Web’s content. HOW TO FORMULATE QUERIES 1. Identify your concepts When conducting any database search, you need to break down your topic into its component concepts. 2. List keywords for each concept Once you have identified your concepts, you need to list keywords which describe each concept. Some concepts may have only one keyword, while others may have many. 3. Specify the logical relationships among your keywords Once you know the keywords you want to search, you need to establish the logical relationships among them. The formal name for this is Boolean logic. Boolean logic allows you to specify the relationships among search terms by using any of three logical operators: AND, OR, NOT. Simple Vs Advanced Search Simple search Very broad :retrieves thousands of irrelevant files Advanced search Narrowing the search Boolean Phrase searching Field search Truncation Boolean Operators A AND B ( Files containing both terms) A OR B ( Files containing at least one of the terms) A NOT B ( Files containing term A only) QUICK TIPS NOTE: These tips will work with most search engines in their basic search option. Use the plus (+) and minus (-) signs in front of words to force their inclusion and/or exclusion in searches. EXAMPLE: +meat -potatoes (NO space between the sign and the keyword) Use double quotation marks (" ") around phrases to ensure they are searched exactly as is, with the words side by side in the same order. EXAMPLE: "bye bye miss american pie" (Do NOT put quotation marks around a single word.) Put your most important keywords first in the string. EXAMPLE: dog breed family pet choose Type keywords and phrases in lower case to find both lower and upper case versions. Typing capital letters will usually return only an exact match. EXAMPLE: president retrieves both president and President Use truncation (or stemming) and wildcards (e.g., *) to look for variations in spelling and word form. EXAMPLE: librar* returns library, libraries, librarian, etc. EXAMPLE: colo*r returns color (American spelling) and colour (British spelling) QUICK TIPS Know whether or not the search engine you are using maintains a stop word list If it does, don't use known stop words in your search statement. Also, consider trying your search on another engine that does not recognize stop words. Combine phrases with keywords, using the double quotes and the plus (+) and/or minus (-) signs. EXAMPLE: +cowboys +"wild west" -football -dallas (In this case, if you use a keyword with a +sign, you must put the +sign in front of the phrase as well. When searching for a phrase alone, the +sign is not necessary.) When searching within a document for the location of your keyword(s), use the "find" command on that page. Know the default (basic) settings your search engine uses (OR or AND). This will have an effect on how you configure your search statement because, if you don't use any signs (+, -, " "), the engine will default to its own settings. CREATING A SEARCH STATEMENT When structuring your query, keep the following tips in mind: Be specific EXAMPLE: Hurricane Hugo Whenever possible, use nouns and objects as keywords EXAMPLE: fiesta dinnerware plates cups saucers Put most important terms first in your keyword list; to ensure that they will be searched, put a +sign in front of each one EXAMPLE: +hybrid +electric +gas +vehicles Use at least three keywords in your query EXAMPLE: interaction vitamins drugs Combine keywords, whenever possible, into phrases EXAMPLE: "search engine tutorial" CREATING A SEARCH STATEMENT Avoid common words, e.g., water, unless they're part of a phrase EXAMPLE: "bottled water" Think about words you'd expect to find in the body of the page, and use them as keywords EXAMPLE: anorexia bulimia eating disorder Write down your search statement and revise it before you type it into a search engine query box EXAMPLE: +“South Carolina" +"financial aid" +applications +grants Meta Search Engines Utilities that search more than search engine and/or subject directory at once and then compile the results in a sometimes convenient display, sometimes consolidating all the results into a uniform format and listing. Some offer added value features like the ability to refine searches, customize which search engines or directories are queried, the time spent in each, etc. Some you must download and install on your computer, whereas most run as server-side applications. Examples ; Dogplile ( http://www.dogpile.com ) Webcrawler ( http://www.webcrawler.com) Subject Directories built by human selection -- not by computers or robot programs organized into subject categories, classification of pages by subjects -- subjects not standardized and vary according to the scope of each directory NEVER contain full-text of the web pages they link to -you can only search what you can see (titles, descriptions, subject categories, etc.) -- use broad or general terms small and specialized to large, but smaller than most search engines -- huge range in size often carefully evaluated and annotated (but not always!!) When to use directories? Directories are useful for general topics, for topics that need exploring, for in-depth research, and for browsing. There are two basic types of directories: 1.academic and professional directories often created and maintained by subject experts to support the needs of researchers 2.commercial portals that cater to the general public and are competing for traffic. Be sure you use the directory that appropriately meets your needs. Subject Directories Internet Subject Directories. INFOMINE, from the University of California, is a good example of an academic subject directory Yahoo is a famous example of a commercial portal Examples of Specialized directories EXAMPLES OF SUBJECT-SPECIFIC DATABASES (i.e.,VORTALS), Educator's Reference Desk (educational information) Expedia (travel) Internet Movie Database (movies) Jumbo Software (computer software) Kelley Blue Book (car values) Monster Board (jobs) Motley Fool (personal investment) MySimon (comparison shopping) PsychCrawler (psychology resources) Roller Coaster Database (roller coasters) SearchEdu (college & university sites) Voice of the Shuttle (humanities research) WebMD (health information) WHAT ARE THE PROS AND CONS OF SUBJECT DIRECTORIES? PROS: Directory editors typically organize directories hierarchically into browsable subject categories and sub-categories. When you're clicking through several subject layers to get to an actual Web page, this kind of organization may appear cumbersome, but it is also the directory's strength. Because of the human oversight maintained in subject directories, they have the capability of delivering a higher quality of content. They may also provide fewer results out of context than search engines. CONS: Unlike search engines, most directories do not compile databases of their own. Instead of storing pages, they point to them. This situation sometimes creates problems because, once accepted for inclusion in a directory, the Web page could change content and the editors might not realize it. The directory might continue to point to a page that has been moved or that no longer exists. Dead links are a real problem for subject directories, as is a perceived bias toward e-commerce sites. WHAT IS THE "INVISIBLE WEB"? There is a large portion of the Web that search engine spiders cannot, or may not, index. It has been dubbed the "Invisible Web" or the "Deep Web" and includes, among other things, pass-protected sites, documents behind firewalls, archived material, the contents of certain databases, and information that isn't static but assembled dynamically in response to specific queries. Web profilers agree that the "Invisible Web," which is made up of thousands of such documents and databases, accounts for 60 to 80 percent of existing Web material. This is information you probably assumed you could access by using standard search engines, but that's not always the case. According to the Invisible Web Catalog, these resources may or may not be visible to search engine spiders, although today's search engines are getting better and better at finding and indexing the contents of "Invisible Web" pages. Sources to locate Invisible Web http://www.lib.berkeley.edu/TeachingLib/G uides/Internet/InvisibleWeb.html Criteria for Critical Evaluation of Information on the Internet Evaluating Information Content on the Internet: Purpose Intended Audience Scope Currency Authority Bibliography Objectivity Accuracy Criteria for Critical Evaluation of Information on the Internet Evaluating Information Structure on the Internet: Design Software Requirements Hardware Requirements Style Uniqueness Criteria for Critical Evaluation of Information on the Internet Evaluating Information Accessibility on the Internet: Restrictions Stability Security