CIS 50: Computing and Information Technology Internet Searches At the end of this assignment, you will: Utilize search engine subject directories and key words to find information Subject Directory Searches THERE WILL BE FOUR (4) PRINTOUTS/FILES TO SUBMIT FOR CREDIT: 1. Assignment #1— your_name _halifax.doc 2. Assignment #2— your_name_lpc_yahoo.doc 3. Assignment #3— your_name_lpc_google.doc 4. Assignment #4— your_name_myteam.doc Key Word Searches THERE WILL BE TWO (2) PRINTOUTS/FILES TO SUBMIT FOR CREDIT: 5. Assignment #5— your_name_scavenger.doc Online students: email assignments to cis_assignments@comcast.net Be sure to put CIS 50online in the subject of the email, put your name in the content of the email. Attach the assignment files to the email. Send a copy to your own email address What are Search Engines A Web search engine is a program designed to search for information on the World Wide Web. Information may consist of web pages, images, information and other types of files. Some search engines also mine data available in newspapers, books, databases, or open directories. Unlike Web directories, which are maintained by human editors, search engines operate algorithmically or are a mixture of algorithmic and human input. Although a search engine is really a general class of programs, the term is often used to specifically describe systems like Google, Alta Vista, Yahoo and Excite that enable users to search for documents on the World Wide Web. Typically, a search engine works by sending out a spider to fetch as many documents as possible or by human-powered directories. Another program, called an indexer, then reads these documents and creates an index based on the words contained in each document. Each search engine uses a proprietary algorithm to create its indices such that, ideally, only meaningful results are returned for each query. http://searchenginewatch.com - this website provides a guide to the major search engines of the web. Why are these considered to be "major" search engines? Because they are either wellknown or well-used. Want to know the top ten search terms by category: http://www.hitwise.com/us/datacenter/main/dashboard-10134.html Want to know the top search providers: http://www.hitwise.com/us/datacenter/main/dashboard-23984.html How Search Engine Work HW3: 8/12 Page: 1 CIS 50: Computing and Information Technology The term "search engine" is often used generically to describe both crawler-based search engines and/or human-powered directories. These two types of search engines gather their listings in radically different ways. See the YouTube video: http://www.youtube.com/watch?v=UV4bpjTt2P8 Crawler-Based Search Engines Crawler-based search engines, such as Google, create their listings automatically. They "crawl" or "spider" the web, then people search through what they have found. If you change your web pages, crawler-based search engines eventually find these changes, and that can affect how you are listed. Page titles, body copy and other elements all play a role. The Parts of A Crawler-Based Search Engine Crawler-based search engines have three major elements. First is the spider, also called the crawler. The spider visits a web page, reads it, and then follows links to other pages within the site. This is what it means when someone refers to a site being "spidered" or "crawled." The spider returns to the site on a regular basis, such as every month or two, to look for changes. Everything the spider finds goes into the second part of the search engine, the index. The index, sometimes called the catalog, is like a giant book containing a copy of every web page that the spider finds. If a web page changes, then this book is updated with new information. Sometimes it can take a while for new pages or changes that the spider finds to be added to the index. Thus, a web page may have been "spidered" but not yet "indexed." Until it is indexed -added to the index -- it is not available to those searching with the search engine. Search engine software is the third part of a search engine. This is the program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant. You can learn more about how search engine software ranks web pages on the aptly-named How Search Engines Rank Web Pages page. Human-Powered (Subject) Directories A human-powered directory, such as the Open Directory, depends on humans for its listings. You submit a short description to the directory for your entire site, or editors write one for sites they review. A search looks for matches only in the descriptions submitted. Changing your web pages has no effect on your listing. Things that are useful for improving a listing with a search engine have nothing to do with improving a listing in a directory. The only exception is that a good site, with good content, might be more likely to get reviewed for free than a poor site. Subject directories, unlike search engines, are created and maintained by human editors, not electronic spiders or robots. Editors’ review and select sites for inclusion in their directories on the basis of previously determined selection criteria. The resources they list are usually annotated. Directories tend to be smaller than search engine databases, typically indexing only the home page or top level pages of a site. They may include a search engine for searching their own directory (or the web, if a directory search yields unsatisfactory or no results.) HW3: 8/12 Page: 2 CIS 50: Computing and Information Technology Because humans organize the websites in subject directories, you can often find a good starting point if your topic is included. Directories are also useful for finding information on a topic when you don't have a precise idea of what you need. Many large directories include a keyword search option which usually eliminates the need to work through numerous levels of topics and subtopics. Like the yellow pages of a telephone book, subject directories are best for browsing and for searches of a more general nature. They are good sources for information on popular topics, organizations, commercial sites and products. When you'd like to see what kind of information is available on the Web in a particular field or area of interest, go to a directory and browse through the subject categories. Because directories cover only a small fraction of the pages available on the Web, they are most effective for finding general information on popular or scholarly subjects. If you are looking for something specific, use a search engine. Directories are categorized lists of sites picked out by human editors. Directory databases are therefore much smaller than those of search engines. However, the fact that the sites are handpicked often means that you will find very high quality sites or articles in the results. Example directories are http://dir.yahoo.com or http://directory.google.com or http://www.dmoz.org . Directory Search Example – Pittsburgh Steelers Follow this example to find the home page for my favorite football team; Pittsburgh Steelers Begin at http://dir.yahoo.com to search for the Pittsburgh Steelers Football. Follow the subject categories -- do NOT use the search box -- to search for information. Here is the list of links you click to find the URL for the official Pittsburgh Steelers. 1. 2. 3. 4. 5. 6. 7. 8. http://dir.yahoo.com Click on Recreation and Sports link Click on Sports link Click on Football (American) link Click on National Football League (NFL) link Click on Teams link Click on Pittsburgh Steelers link When you find the Pittsburgh Steelers Football link, What is the Yahoo URL BEFORE you click the link for the official site for Pittsburgh Steelers. __http://dir.yahoo.com/Recreation/Sports/Football__American_/Leagues/Na tional_Football_League__NFL_/Teams/?b=0_ 9. Click on Pittsburgh Steelers link 10. After you click the link, what is the URL for the Pittsburgh Steelers: ______ http://www.steelers.com/ __ HW3: 8/12 Page: 3 CIS 50: Computing and Information Technology Assignment #1: Yahoo Directory Search - Halifax Reading the News from Halifax. In Halifax, a city in Nova Scotia, Canada, situated along the Atlantic Ocean, fishing is a major industry. In recent years, Halifax has also flourished as a tourist town, thanks to the area’s natural beauty. David Wu wants to spend his summer working at one of the fishing resorts in the area. Before packing his bags, he wants to learn a little more about the daily life in Halifax. He asks you to use the Web to get headlines from a Halifax newspaper, find out about the climate, and determine popular sporting events in Halifax. Begin at http://dir.yahoo.com to search for the Halifax newspaper. Follow the subject categories -- do NOT use the search box -- to search for information. List the links you clicked to find the URL for the newspaper. 1. http://dir.yahoo.com 2. Click on News and Media link 3. 4. 5. 6. ….. Be sure to list all the links….. 7. When you find the Halifax newspaper link, What is the Yahoo URL BEFORE you click the link for the Halifax newspaper: _ ______________________________________________ 8. What is the URL for the Halifax newspaper: _____________________________ STEP 1: Create a MS Word document named: yourname_halifax.doc In the document provide; your first and last name CIS 50 today’s date Document: your_name_halifax.doc 1. List (or copy/paste) ALL the links you clicked from above to find the URL for the newspaper 2. Copy and paste the URL to the Halifax newspaper 3. Navigate through the pages to obtain information about today’s weather in Halifax. Copy and paste today’s weather report into the Word document 4. Save one image from the newspaper to your flash drive, use the name Halifax_picture. Hint, to save an image, right-click on the image, select Save Picture As, save on your flash drive 5. Copy/paste the picture image into your document 6. Save your document, submit to your instructor HW3: 8/12 Page: 4 CIS 50: Computing and Information Technology Assignment #2: Yahoo Directory Search – Las Positas College Use the Yahoo Web directory and the Education category; to locate the home page of Las Positas College. Begin at http://dir.yahoo.com to search for the Las Positas College home web page. Follow the subject categories -- do NOT use the search box -- to search for information. List the links you clicked to find the URL for the college home page. 1. http://dir.yahoo.com 2. Click on Education link 3. 4. 5. 6. 7. 8. ….. Be sure to list all the links….. 9. When you find the Las Positas College Link, What is the Yahoo URL BEFORE you click the link for Las Positas College: _______________________________________________________________ 10. What is the URL for home page for Las Positas College: ___________________________________________________________ STEP 1: Create a MS Word document named: your_name_lpc_yahoo.doc In that document provide; your first and last name CIS 50 today’s date Document: your_name_lpc_yahoo.doc 1. 2. 3. 4. 5. HW3: 8/12 List (or copy/paste) ALL the links you clicked from above to find the URL for LPC Copy and paste the URL to the Las Positas College homepage Save one image from the college website to your flash drive use the name LPC_picture1. Hint, to save an image, right-click on the image, select Save Picture As on your flash drive Copy/paste the picture image into the word document Save your document, submit to your instructor Page: 5 CIS 50: Computing and Information Technology Assignment #3: Google Directory Search – Las Positas College Use the Google Directory and the Reference category; locate the home page of Las Positas College. Begin at http://directory.google.com to search for the Las Positas College. Follow the subject categories -- do NOT use the search box -- to search for information. List the links you clicked to find the URL for the college home page. NOTE: directory.google.com is no longer available. Please use the http://dmoz.org 1. http://dmoz.org 2. 3. 4. 5. 6. ….. Be sure to list all the links….. 7. When you find the Las Positas College Link, What is the dmoz URL BEFORE you click the link for Las Positas College: _______________________________________________ 8. What is the URL for the home page of Las Positas College: _______________________________________ STEP 1: Create a MS Word document called yourname_lpc_dmoz.doc In that document provide; your first and last name CIS 50 today’s date Document: yourname_lpc_dmoz.doc 1. 2. 3. 4. 5. HW3: 8/12 List (or copy/paste) ALL the links you clicked from above to find the URL for LPC Copy and paste the URL to the Las Positas College homepage Save one image from the college website to your flash drive use the name LPC_picture2. Hint, to save an image, right-click on the image, select Save Picture As. Copy/paste the picture image into the word document. Save your document, submit to your instructor Page: 6 CIS 50: Computing and Information Technology Assignment #4: Dmoz or Yahoo Directory Search – your team Who is your favorite sports team? (football, softball, baseball, soccer, poker) ______________ Begin at http://dmoz.org or dir.yahoo.com to search for the home page of your favorite sports team. Follow the subject categories -- do NOT use the search box -- to search for information. List the links you clicked to find the URL for the home page of your favorite sports team. 1. http://directory.google.com 2. 3. 4. 5. ….. Be sure to list all the links….. 6. 7. When you find the official site of your favorite sports team link, What is the URL BEFORE you click the link for the team’s home webpage: _______________________________________________ What is the URL for your favorites sports team home webpage: _______________________________________________ STEP 1: Create a MS Word document called yourname_myteam.doc. In that document provide;; your first and last name CIS 50 today’s date Documents: yourname_myteam.doc 1. List (or copy/paste) ALL the links you clicked from above to find the URL for the your favorite sports team 2. Copy and paste the URL to the home page of your favorite sports team 3. Save one image from the team’s official webpage to your flash drive, use the name myteam_picture. Hint, to save an image, right-click on the image, select Save Picture As, save to your flash drive. 4. Copy/ paste the picture image into your document. 5. Save your document, submit to your instructor When done, email to me (cis_assignments@comcast.net) the four MS Word files HW3: 8/12 Page: 7 CIS 50: Computing and Information Technology Searching Myths and Morals The moral of this story: When you are just playing, knowing HOW your search engine works probably will not matter. But for doing research, it is important to take a few minutes to see how your engine really works. The Internet is NOT a library. Search engine indexes are NOT a snapshot of everything online. No search engine – not even Google – knows everything. There is simply too much information and it is all flowing too fast to keep up. Then there is the content that a search engine notices but chooses not to index at all: movies, audio, Flash animations, and other special data formats. Everything on the Web is NOT creditable. There are things on the Internet that is biased, distorted, or just plain wrong – whether intentional or not. Generally speaking, there are two types of search engines on the Internet. The first is the searchable subject index. This kind of search engine searches only the titles and descriptions of sites, and does NOT search individual pages. Yahoo is a searchable subject index. Then there is the full-text search engine, which uses computerized “spiders” to index millions, sometimes billions of pages. These pages can be searched by title or content, allowing for much narrower searches than a searchable subject index. Google is a full-text search engine. The way most people use an Internet search engine is to type some keywords and see what turns up. While in certain domains that can yield some decent results, it is becoming less and less effective as the Internet gets larger and larger. Simple searches allow you to do quite a bit, but not everything. Key Word Search Engine Basics A search engine is a tool that helps people find information on the World Wide Web. Search engines use a specialized computer program called a spider that travels from site to site indexing, or cataloging, the contents of the pages based on keywords. The results are compiled into a database, so what you are searching is not the Web itself, but the contents of the search engine’s database. No single search engine can catalog the shifting contents of the Web, and even the most powerful engines cover a fraction of known Web content. If a particular site is not widely linked, or its author does not submit it to major search engines, then the material is invisible to them. Also, any site that requires a visitor to type in data, such as a name, cannot be accessed by search engines. Although search engine indexes are incomplete, and often dated, they are capable of delivering an overwhelming number of results or hits. The real issue is quality not quantity. When comparing search engines, it is important to know the company’s policy toward allowing HW3: 8/12 Page: 8 CIS 50: Computing and Information Technology commercial sites to boost their ranking in a pay for performance arrangement. Links that are subsidized by companies are called sponsored links. Not all search engines work the same way. By understanding the underlying algorithms, or specific rules that drive these information engines, it is possible to better target your search. For example, sites like Google, rank their pages by analyzing the number of other sites that link to that page. Other sites, like Ask Jeeves, organize results by concepts or intended meanings. Teoma uses an interesting approach called Subject-Specific Popularity, which ranks a site based on the number of same subject pages that reference it, not just general popularity, too determines a site’s level of authority. You usually can determine an engine’s approach by clicking the About tab or link on the search site. Search Engine Techniques Consider the following suggestions when you begin to search: Refine your topic. Unless you limit the scope of your topic, you might be overwhelmed by the number of results. If you are looking for general information on a broad topic, consider a subject directory site. Translate your question into an effective search query. Searches are executed on keywords. You will improve your success if you pick the proper keywords. Try to find unique words or phrases and avoid those with multiple uses. For example, search for Siamese cat, rather than cat. Consider using advanced search techniques. Review the search results and evaluate the quality of the results. If the search needs refinement or additional material, you can either use the site’s advanced search techniques or select a different Internet resource altogether. To be effective, you should understand the mechanics of the search engine, use proper spelling, find unique phrases, and experiment with a variety of approaches. If you are consistently returning too many results, try using topic-specific terms and advanced search techniques. Conversely, if too few results are returned, eliminate the least important terms or concepts, broaden your subject, or use more general vocabulary when you select terms. HW3: 8/12 Page: 9 CIS 50: Computing and Information Technology Advanced Search Techniques Many search engines offer powerful features that allow you to refine and control the type of information returned from searches. These features can include the option to search within returned results and the ability to search within specific areas, such as newsgroups. Perhaps the most powerful advanced feature is the option to use Boolean logic. You can use various combinations of the logical operators OR, AND, and NOT to improve your search success greatly. Narrow, Exact, Trim, Similar – Four NETS for Better Searching The perfect page is out there somewhere. It's the page that has exactly the information you're looking for and to you it's beautiful and unattainable like a faraway star. If only you had a super-sized net for capturing it! Most people use a search engine by simply typing a few words into the query box and then scrolling through whatever comes up. Sometimes their choice of words ends up narrowing the search unduly and causing them not to find what they're looking for. More often the end result of the search is a haystack of off-target web pages that must be combed through. You can do better than that. The most comprehensive engine out there at the moment seems to be Google, and that's what we'll focus on here. The first step in becoming a better catcher of web pages is to master Google's Advanced Search form located at http://www.google.com/advanced_search. Bookmark it! Drag the bookmark to your browser's toolbar so that it's always available. If you make a habit of using the four techniques described below, you'll be a much better searcher than 90% of all web-users. It's just four things, and each will provide you with a better net for information catching. Net 1: Start Narrow The biggest problem people have with search engines (perhaps) is that they're so good! You can type in a word and within a fraction of a second you'll have 20,000 pages to look at. Most of those pages will not be exactly what you're after, and you have to spend a load of time wading through the 19,993 that aren't quite right. HW3: 8/12 Page: 10 CIS 50: Computing and Information Technology If you know what you're after, why not start by asking for it as precisely as you can? Think of all the words that would always appear on the perfect page. Put those in the WITH ALL THE WORDS field. Think of all the distracting pages that might also turn up because one or more of your search terms has multiple meanings. What words can you think of that might help you eliminate those pages? Put those in the WITHOUT field. If there's a term with synonyms, either of which might appear on the page you're after, put them in the WITH ANY OF THE WORDS field. Try each of the searches now, and record how many sites you find. As you do each search, take note of what kinds of things turn up. Notice that the more specific the terms you include and exclude the more focused your search. Query # Matches Imagine that you're interested in the legendary lost continent of Atlantis. There have been several movies with Atlantis in the title, but you're not interested in them. You are also not interested in the space shuttle Atlantis. Try this search... WITH ALL: Atlantis continent WITHOUT: shuttle film movie Write the number of hits you get below. Here's how to search for it badly: WITH ALL: Atlantis Here's another search to try: WITH ALL: Waterbury WITH AT LEAST ONE: Vermont VT WITHOUT: Connecticut CT Here's how to search for Waterbury, VT. badly: WITH: Waterbury Net 2: Find Exact Phrases Words hang together in predictable ways. If you type a phrase into the EXACT PHRASE field in Google, you'll be able to locate pages in which those words appear together in that order. This is obviously useful for finding things that have a proper name consisting of several words (e.g., places, book titles, people). HW3: 8/12 Query # Matches You've heard of a fine public university in the lower left corner of the United States and you want to know more about it. Try this search... EXACT PHRASE: San Write the number of hits you get below. Page: 11 CIS 50: Computing and Information Technology It's also useful when you can remember a distinctive Diego State University phrase in something you've read, but now need to Here's how to search for it locate it. What's the rest of the poem that starts badly: with "Jenny kissed me when we met"? WITH ALL: San Diego The ability to search for phrases can be surprisingly State University useful. Do you suspect that something your student turned in was plagiarized, or at least heavily borrowed without attribution? Type in a phrase or two from the paper and see if it turns up elsewhere! You can also check to see if your own work is being copied without your permission. Another use for this feature: stamping out urban legends. Next time you get an e-mail warning you about a repressive new law about to pass or a vicious computer virus about to attack, check it out before passing on misinformation to others. Type in any unusual or unique phrase you see in the e-mail and see if others have commented on this particular rumor. Here are some more searches to try: EXACT PHRASE: Bill 602P EXACT PHRASE: We know he has weapons of mass destruction EXACT PHRASE: demonstrating genuine leadership EXACT PHRASE: Jenny kissed me when we met Net 3: Trim Back the URL The next net is not Google-specific, though you'll find yourself using it often once you get better at Googling. Often you'll find a terrific page nestled deep down inside a folder inside a folder inside a folder. You suspect that there are other pages you'd find interesting nearby. How to you find them? Trim the URL step by step. You found this Romeo & Juliet WebQuest that you really like. Are there more like that where this one came from? Start here: http://oncampus.richmond.edu/academics/education/projects/webquests/shakespeare/ Now trim away the last part: http://oncampus.richmond.edu/academics/education/projects/webquests/ What do you see? Trim it again: http://oncampus.richmond.edu/academics/education/projects/ http://oncampus.richmond.edu/academics/education/ http://oncampus.richmond.edu/academics/ http://oncampus.richmond.edu/ HW3: 8/12 Page: 12 CIS 50: Computing and Information Technology Now try this: A friend told you of another cool Shakespeare WebQuest and emailed you the URL: http://www.longwood.k12.ny.us/wmi/wq/collin/index.html That URL turned out to be wrong, though. Can you find the real URL, and see if there are other worthy WebQuests at the same site? Sometimes you'll get a notice saying FORBIDDEN! Sometimes you'll get a list of files and directories. Sometimes you'll get a web page with more links. Each step back tells you more about where the page came from. This is also a good strategy to try when a page goes missing (that is, you get a 404 message). Perhaps someone at the site moved the page into a new folder or renamed a folder. Trace your way back to the top and drill down again to see if you can find it. Net 4: Look for Similar Pages Once you've found something you like on Google, it's very easy (and useful) to find similar pages. How? Below the advanced search fields that you've been using up until now are another two fields. These allow you to find pages that Google has deemed to be similar to or linked to any URL you type in. How does Google know that two pages are similar? The details of the inner workings of search engines are a trade secret, but it's safe to assume that it's based on similarities in the words and the external links on each page. All that matters is that it works surprisingly well, especially when you're not sure what key words to look for. Use this tool to find more of a good thing. Use it to find pages that are linked to a page that you find useful. Chances are, those pages might be useful to you, too. And there's always ego surfing: if you've uploaded a page of your own to a public server and it's been there for awhile, find out who else is linking to it. HW3: 8/12 Query # Matches Suppose that you've discovered Tapped In, an online community of educators, and you're wondering what else like that is out there. Using Google's similarity search will surface a number of sites that are likely to interest you. SIMILAR TO: www.tappedin.org Write the number of hits you get below. Another way to explore a domain is to find out who else is linked to a page. Who else finds Tapped In useful enough to include on one of their pages? LINKED TO: www.tappedin.org Here's another search to try: SIMILAR TO: kids.msfc.nasa.gov LINKED TO: kids.msfc.nasa.gov Page: 13 CIS 50: Computing and Information Technology So, to recap... remembering the word NETS will help you to remember the four techniques you just experimented with. More Search Techniques The website: www.21cif.com, 21st Century Information Fluency is a great website for learning about effective Internet searching. Review the following sections of this website: Digital Information Fluency: http://www.21cif.com/resources/difcore/index.html - the ability to find, evaluate and use digital information effectively. What are you looking for? Where will you find the information, How will you get there, How good is the information? How will you use the information? Choosing keywords is crucial, many words are not necessary or not effective. http://21cif.com/rkitp/curriculum/v1n3/use_flash_applications_v1n3.html Try the ‘Buffalo Challenge’ game. Note: need to use Firefox browser HW3: 8/12 Page: 14 CIS 50: Computing and Information Technology Assignment #5: Keyword Searches STEP 1: Create a MS Word document called yourname_scavenger.doc. In that document, 1. type your name, CIS 50 Online, today’s date, Instructor: DJFields 2. Type (or copy/paste) the questions into the document 3. Answer the following questions and email me the document NAME: ___________________________________________________________ CIS 50 Online DJFields Today’s date: ________________________________________________________________________ 1. What is the URL for the Smithsonian Institution’s home page? (easy keyword search) URL for Smithsonian ___________________________________________ Which search engine did you use? __________________________________________ What keywords did you utilize? ______________________________________________ How many hits#? _________________________________________________________ 2. What is the day of the week of the vice president’s birthday, next year? (not so easy keyword search, multiple steps required) Day of week: ___________________________________________ 3. Find the URL of the web page where you can find this picture of Kermit and hear what he is saying (it should take you 6 minutes to find the answer) URL: ___________________________________________ What award did Kermit receive: ________________________________________ 4. Using only two keywords search for movies playing at the Vine Theater in Livermore. What were your two keywords? ________________________________________ 5. Using only two keywords search for how many people live in California. What were your two keywords? ________________________________________ 6. Finish this sentence: My Google fu is ________________________________________ today. What is Google fu/foo? ________________________________________ 7. Using the Google calculator: How many ‘smoots’ in one mile? Include a printscreen of your results. How long is a ‘smoot’ ________________________________________ HW3: 8/12 Page: 15 CIS 50: Computing and Information Technology 8. Google limits query to how many words? Why that number? ________________________________________ 9. Are employees encouraged to use Google time to work on their own projects? If so, what percent of the time? ________________________________________ 10. What is the Google company headquarters called? ________________________________________ When done, email to me (cis_assignments@comcast.net) the MS Word file your_name_scavenger.doc HW3: 8/12 Page: 16