Using Google for Genealogical Searches

advertisement
Manatee
Genealogical
Society
MGS Computer Special
Interest Group (SIG)
4 http://www.colket.org/genealogy/MGS/
Overview
o
o
o
o
o
Manatee
Genealogical
Society
History of Browsing
Problem of Searching
Solution to Search Problem
Google Search Basics
Search Results
2
Internet
ss
Static Searches
Indexable Nodes
Use Google, Bing, or
other Search Engine
Every word on Page
Is indexed with web
crawler
Manatee
SearchGenealogical
Society
Dynamic Searches
Non Indexable Nodes
Private Databases
Fee/membership
(e.g., Ancestry,
Professional,
News)
Many available with
Library membership
Commercial Databases
Shopping
Or Limited to
employees and
customers only
Public Databases
City, County, State
Federal Records
3
Dark Web
Static Searches
Manatee
Genealogical
Society
Have Web Crawlers Visit
Each Node
For “Public Domains”
4
Who Invented
the Internet?
Manatee
Genealogical
Society
5
History
of Browsing
•
•
•
Manatee
Genealogical
Society
Early on very cumbersome
• Generally login to a desired computer and search based on the directory
• Every computer had its own directory structure and search application(s)
In 1980, Tim Berners-Lee proposed and prototyped ENQUIRE, a system to share
documents
In 1990, he collaborated with Robert Cailliau on a joint proposal for the World Wide
Web (WWW) or W3 project for a protocol to share information using hypertext.
Became HyperText Markup Language (HTML) – defined using text
•
•
This allowed people to organize information they wanted to share with
Links to the information or files which could then be downloaded
Requires a browser that could read these HTML files using a protocol called:
HyperText Transfer Protocol (HTTP)
Many commercial browsers available today
• Internet Explorer (IE), Safari, Netscape, Mozilla Firefox, etc.
• Even Google has its own browser called “Google Chrome”
• You need a current browser to access latest information
6
Problem
with Searching
Many search applications developed based on HTML
Manatee
Genealogical
Society
BUT
Search on Coke –117,000,000 hits
Many of these are menu items at restaurants – Much useless information
You have hits from every restaurant that has coke on its menu
If you are interested in Coca-Cola headquarters in Atlanta, it may not appear until item
23,672,344
How do you get RELEVANT hits????
How do you get hits ordered so that Relevant Hits are Ordered
in a way that facilitates use????
Google found a way to “solve” this problem;
7
What’s a Google?
Manatee
Genealogical
Society
8
Solution to
Search Problem - 1
Manatee
Genealogical
Society
• 1995, Sergey Brin and Larry Page while students at Stanford
came up with a concept of using the strength of the Internet
community.
• Their technology evaluated a site primarily on how many other
sites linked to it and ranked search results accordingly.
• The technology was called PageRank (named for Larry Page)
although, it does rank pages as to which page is most important.
• PageRank tended to return results that people found useful,
Resulting in a surprisingly valuable system
• PageRank was patented by Stanford University.
• In 1997, BackRub was a PageRank application so called
because the technology analyzed what was
going on behind the scenes.
• Fall, 1997 BackRub became Google
• http://infolab.stanford.edu/~backrub/google.html
• Sergey Brin and Larry Page purchased the
exclusive licensing rights to PageRank for
$1.56B
9
Solution to
Search Problem - 2
Manatee
Genealogical
Society
• Google is an adaption of googol. A googol is the number 1 followed by
100 zeros (10E100). (from Hitchhikers Guide to the Galaxy). This
reflects the number of WWW pages it searches.
• In 1998, they dropped out of Stanford to develop Google.
• Set up shop in the Menlo Park garage of Susan Wojcicki
• 1998, 50 employees. 7 million searches a day.
• By 2005, Google was having 250 million web searches per day.
• Sergey Brin’s Net Worth is 29.9 Billion Dollars (17th richest in the world in 2014)
• Larry Page’s Net Worth is 29.8 Billion Dollars (18th richest in the world in 2014)
• Google headquarters, the Googleplex, is located in Mountain View,
California. As of March 31, 2009, the company has 19,786 full-time 10
employees; 46,170 by May 2014 - 68 Worldwide locations
Solution to
Search Problem - 3
Manatee
Genealogical
Society
Most
Relevant
Results
First
11
Google
Search Basics - 0
Manatee
Genealogical
Society
 Ready to do some Google Searching
 Still a Big Problem
Simple Surname search yields millions of results
Colket =>
Pelot =>
Reger =>
Sparrow =>
Johnson =>
Smith =>
89,600 results
477,000 results
7,650,000 results
63,900,000 results
978,000,000 results
1,500,000,000 results
 Need to find a way to reduce results
 Google Basics Discusses way to do this on Search Query
 Google Results discusses ways to do this on Results Page
12
Google
Search Basics - 1
 Google cares about:
Singular versus Plural – “apple” versus “apples”
Manatee
Genealogical
Society
Exceptions to
These Rules
Order Of Words is Important for Ranking
“brown bear” – things named “Brown Bear” first – 20,800,000 Hits
“bear brown” – emphasis on bears – 87,000,000 Hits
Spelling is Important

Suggest putting Surnames first – Pelot Samuel
Names originating in another alphabet have many valid transliterations
Mohamed, Mohammed
Pelot, Pelote, Pelotte
Sometime Get Spelling Suggestions
Sometimes Use Misspelled Queries
Google does not care about:
Case Sensitivity – Hence “Samuel Pelot” = “samuel pelot”
Little Words Ignored – such as I, where, how, the, of, an, for, from,
how, it, in, is, single digits, single letters. If desired, use quotes.
The who Is a Band
Punctuation – MOST PUNCTUATION IS IGNORED. …
13
Google
Search Basics - 2
Manatee
Genealogical
Society
– Apostrophes are meaningful
Hence Pauls, Paul’s, and Pauls’ require 3 different searches.
– A “-” before a word excludes terms – later
– A “-” between 2 or more words strongly connects the words:
Example: twelve-year-old dog almost like “twelve year old”
– A “-” by itself is ignored
– A “_” between 2 or more words also strongly connects the words
Underscore when between 2 words as formal name: Quick_Sort
Mary_Beth Underscore treated as a search for
MaryBeth | Mary Beth | Mary_Beth
– Quotes require exact match – later
Exceptions:
Punctuation in proper names: Google+ AB+ C++, A#
$ is understood to be dollars
“Nikon $400” ≠ “Nikon 400”
Ditto for ¢, £, ¥. Etc.
@ is understood to be an email address e.g., colket@colket.org
Hashtags are understood to be trending topics
14
#newenglandpatriots
Google
Search Basics - 3
Manatee
Genealogical
Society
Exact Order; Exact Phrase – Use quotation marks.
This techniques is especially useful for genealogy – very
different results for
11,000 Hits
8,670 Hits
Samuel George Pelot versus “Samuel George Pelot” 37 Hits
George Samuel Pelot versus “George Samuel Pelot” 0 Hits
Huh??? Should get the same number – Why???
Does not exist
What about the middle name?
Some sources report as initial or no middle initial (nmi)
“Samuel Pelot”
“Samuel G Pelot”
“Samuel G. Pelot”
“Samuel nmi Pelot”
231 Hits
24 Hits
24 Hits
0 Hits
Most Punctuation is ignored
87,200 Hits with G.
3,390,000 Hits with Graham
410,200 Hits
Remember, a search for “Alexander Bell” will miss hits for “Alexander G Bell”
15
Google
Search Basics - 4
Manatee
Genealogical
Society
Search Within Site/Domain – Identify site in query:
iraq site:nytimes.com – returns hits on “Iraq” in NY Times only
iraq site:.gov
returns hits only from a .gov domain
iraq site:.iq
returns hits only from an Iraq domain
Good for genealogy research:
Pelot site:nytimes.com 157 Hits
Pelot
394,000 Hits
Pelot site:.fr
14,700 Hits
Pelot site:.ch
1,070 Hits
Pelot site:.ca
2,900 Hits
Pelot site:.us
2,410 Hits
Pelot site:.mil
89 Hits
Pelot site:.gov
947Hits
Pelot site:.biz
5,480 Hits
NY Times only
Worldwide
French Domain
Swiss Domain
Canadian Domain
US Domain (not null)
US Military Domain
US Government Domain
US Business Domain
16
Google
Search Basics - 5
Manatee
Genealogical
Society
Exclude Terms – Use “-” preceded by a blank
Say searching for anti-virus stuff for humans:
Note:
“-” is part of
the word for
“anti-virus”
Strongly
Connected
anti-virus
132,000,000 Hits
includes antivirus, anti virus, and anti-virus”
anti-virus -software
79,100,000 Hits
jaguar -cars -football
Can use multiple negations
and for the poor fellow with the surname of “Sparrow”
Sparrow
Sparrow -bird
Sparrow -bird -book
63,400,000 Hits
60,400,000 Hits
45,500,000 Hits
Note:
Combinations of Search Terms can be effective
17
Google
Search Basics - 6
Manatee
Genealogical
Society
OR Operator – Sometimes you want hits for either/or
Use cap “OR” or OR Operator “|”
Tampa Bay Buccaneers
Tampa Bay Buccaneers
Tampa Bay Buccaneers
Tampa Bay Buccaneers
Tampa Bay Buccaneers
Tampa Bay Buccaneers
2,620,000 Hits
2004
298,000 Hits
2005
409,000 Hits
2004 2005
206,000 Hits
2004 OR 2005
726,000 Hits
2004 | 2005
726,000 Hits
Exceptions: Phrases such as “FOR BETTER OR FOR WORSE”
18
Manatee
Genealogical
Society
Google
Search Basics - 7
Feeling Lucky – Gives you the first page.
Wild Cards
– Use a “*” – Works on words, not parts of words
– Use a “?” – Single characters (Officially not in Google)
For Questions: “"How often does Halley's comet appear?“
Pose as: Halley’s Comet appears every * years – it’s 76 years
Also for unknown middle names Samuel * Pelot
Difference for
“Samuel * Pelot“
Difference for
“Samuel ? Pelot“
Note: For
Samuel Pelot
and For
“Samuel Pelot“
10,700,000 Hits
7,910,000 Hits
624 Hits
801,000 Hits
616 Hits
Ten Word Limit – Search terms over 10 are ignored
19
Google
Search Basics - 8
Manatee
Genealogical
Society
Misspellings – Try alternative spellings
thousands of Web sites mention Arnold Schwarznegger
70,000 Hits
though the governator spells his name "Schwarzenegger” 34,500,000 Hits
Google recognizes
some misspellings
and provides
alternatives
New since Mar 2010
20
Google
Search Basics - 9
 Proximity Search
Manatee
Genealogical
Society
Not Advertised Google Tool,
But Common Search Tool (e.g., Archive Grid) –
Seems to be Useful With Google
Proximity Search “Samuel Pelot”~3
Hits for:
Samuel Pelot
801,000 Hits
“Samuel Pelot”
616 Hits
“Samuel George Pelot”
27 Hits
“Samuel G Pelot”
73 Hits
“Samuel Pelot”~2
351 Hits (catch initial)
“Samuel Pelot”~3
190 Hits
“Samuel Pelot”~4
158 Hits
“Samuel Pelot”~7
126 Hits
“Samuel Pelot”~10
173 Hits
21
Google
Search Basics - 10
Manatee
Genealogical
Society
Keep Search Terms Simple
 Most Queries do not require advanced operators or unusual
syntax
 Simply enter name, place, product, or concept,
 Simple is good
 Think of terms likely to be on result pages
 Don’t use My Head Hurts
 Instead use Headache {term likely found on medical page}
 Describe what you want in as few words as possible
 Use Weather Cancun
 Instead of Weather Report for Cancun Mexico
 Choose Descriptive Terms
 Use Celebrity Ringtones
 Instead of Celebrity Sounds
22
Google Results - 1
Manatee
Genealogical
Society
Start
Search
Search
Term(s)
Advanced
Search
Filters
Result
Statistics
Link
Uniform
Resource
Locator
(URL)
Snippet
(Controls
For Advanced
Search
Options)
Sponsored
Links
Sometimes
Similar Pages
Cached Pages
Result
Links
23
Google Results - 2
Manatee
Genealogical
Society
Ordered By Relevance [Indented same site, less relevant]
Also sponsored links, links to news stories, Ads
True, unpaid results are on the lower left
Ads are on the right (no more than 10 per page)
Sponsored Links on top (Ads, at a higher rate; colored background)
True Unpaid Search Results =>
Title
Text from site with Snippets of your search terms (in bold)
URL => Uniform Resource Locator
Size
Date – NOT created/updated, but when last crawled
Dataset in Jul crawl of 2014 is over 266TB containing 4.05 billion webpages
Indication if Cached – Good place to go if Page Removed
URL goes to current page
Cached link goes to cached page – handy if page deleted or link broken
Cached version is used to highlight key words
File Format
.html use browser
.pdf – read with Adobe’s free reader at www.adobe.com
.doc – read with Microsoft’s free reader at www.microsoft.com
.ppt – read with Microsoft’s free reader at www.microsoft.com
24
Similar Results
Google Results - 3
Manatee
Genealogical
Society
Location Feature – Sets default for searches
Location auto-detected
- by IP Address
- or entered into Google Toolbar
Can be changed, if you are looking for stuff in a different location
**Only works in your selected country**
Manually set location is stored in a “Cookie”
Can also be turned off
Type of Content – Limit results to a particular type of web content:
Called Filters
Images, Videos, News, Shopping, Books,
Discussions, Places, Blogs, Real-time (e.g., updates from Twitter)
or select the default – Everything
This is a big recent change Five years ago one had to search each database
--- The databases were not integrated --- They are now --25
Note on URLs
Manatee
Genealogical
Society
• Results of Google Search provided as a
• Uniform Resource Locator (URL)
• URL Format:
Domain Name
World Wide Web
Extension
http://www.google.com.uk
HyperText Transfer Protocol Domain Name
URL
Uniform Resource
Locator
Domain Name Country Extension
• Domain Names: http://www.networksolutions.com/whois/index.jsp
• URL for my domain name is: http://www.colket.org
• Domain name extensions include:
.com .mobi .mil .gov .edu .net .info .org .biz .bz .tv
• Domain Name Extensions (including Country):
http://www.networksolutions.com/glossary/glossaryd.jsp#domainnameextensions
26
• Domain Name Country Extensions –
.be .ca .cn .de .es. ru.com se.com .us
Note on IP Addresses
Manatee
Genealogical
Society
• Every URL maps into a Unique Number called an IP
(Internet Protocol) Address
http://www.google.com => 216.239.51.99
• IPV4 in format of xxx.xxx.xxx.xxx (e.g., 208.77.188.166)
232 can handle 4,294,967,296 addresses Google crawls
Over
Expected to run out in early 2000s
• IPV6 in format of x:x:x:x:x:x:x:x in late 1990s 8,000,000,000
Pages each
(e.g., 2001:db8:0:1234:0:567:1:1)
month
2128 (or 340,282,366,920,938,463,463,374,607,431,768,211,456 ) addresses
• IP addresses still work as IPV4 addresses all map to IPV6
Need
• Operating systems are migrating to IPV6
Current
(e.g., Vista uses IPV6; XP uses IPV4)
Browser
Go to help/support on your computer searching for IPV6
27
Static versus Dynamic
Manatee
Genealogical
Society
Searches
-1
“Relevancy” might not be relevant to
Researchers and Genealogists.
Google’s use of Relevancy is not useful for doing many types of searches:
• Dynamic Databases
• Genealogy Searches on family surnames
• Obscure information
• Much non-business oriented information
• Rather unique information
Dynamic Searches
Static Searches
Indexable Nodes
Use Google, Bing, or
other Search Engine
Every word on Page
Is indexed with web
crawler
Dynamic Searches
Non Indexable Nodes
Private Databases
Fee/membership
(e.g., Ancestry,
Professional,
News)
Many available with
Library membership
Commercial Databases
Shopping
Or Limited to
employees and
customers only
Public Databases
City, County, State
Federal Records
Static Versus
Dynamic Searches - 2
Manatee
Genealogical
Society
Desired Information is in a Separate Database
Auction Sites: Ebay | Craig’s List | UBid | Bid Start | Ebid | US Seek
Web Pages are Private and Not Available for Google
Most businesses have a public web site and a private web site
Only data companies want to share is available via Google
Limited Access Web Sites – Typically for profit sites, e.g.,
ACM’s Digital Library – No Google access at all
Ancestory.com – Google provides “Teaser” results to entice membership
Chicago Tribune – Get “Teaser” hits on Google, but have to pay to access data
Many
Models
Later We will discuss:
The dark web
Archive Grid
New York Times Database
30
Future Plans
Manatee
Genealogical
Society
Future Plans for Computer SIGs:
Finding Pictures of Your Ancestor on the Internet
– 3 Feb 2015
 Using Google for Genealogical Searches
– Scheduled for 3 March 2015
 Manipulating Photos for Genealogy
– Scheduled for April 2015
 Using Ancestry.com requested by Dunham Swift
– Maybe November 2015
 Need Inputs see sheet
What else would you like to have addressed at future
Computer SIG Meetings?????
31
MGS Computer Special
Interest Group (SIG)
WHAT IS IT?
A meeting of genealogists interested in using their
personal computers to enhance their research.
WHEN?
Monthly -- On the first Tuesday of the month (October
through May) following main topic speaker.
TIME:
About 11:15 AM to 12:15 PM, following the meeting
break period after the main MGS speaker.
PLACE:
The Central Library Auditorium, Bradenton, FL (same
location as our MGS monthly meeting)
WHO:
Open to all those interested in using their personal
computers to enhance their genealogical research.
PROGRAM:
Each month we will discuss and view what's new in
genealogy on the Internet. We'll have demonstrations of software and
hardware that will facilitate our research. Tips and techniques will be
shared by and among those attending each meeting. Genealogically
related computer, Internet, digital photography and research questions will
be fielded during the sessions. We'll look at the newest technology but
will keep the discussions as low tech as possible.
32
What topics would you like to hear??????
Download