12169678838888

advertisement
Google XML
Reference
Google Search
Appliance
Confidential: For
Customer Use Only
Revised October, 2005
Google has developed a simple HTTP-based protocol for serving search results.
Search administrators have complete control over how search results are requested
and presented to the end user. This document describes the technical details of
Google search request and results formats. It assumes that the reader has basic
understanding of the HTTP protocol and the HTML document format.
Contents
1. Overview
2. Request Format
2.1 Request Overview
2.2 Search Parameters
2.3 Query Terms
2.4 Filtering
2.5 Internationalization
2.6 Sorting
2.7 Meta Tags
2.8 Limits
3. Results Format
3.1 Custom HTML
3.1.1 Custom HTML Output Overview
3.1.2 Internationalization
3.2 XML
3.2.1 XML Output Overview
3.2.2 Character Encoding Conventions
3.2.3 Google XML Results DTD
3.2.4 Google XML Tag Definitions
Appendices
Appendix A: Estimated vs. Actual Number of Results
Appendix B: URL Escaping
Glossary
1. Overview
[TABLE OF CONTENTS]
A Google search request is a simple HTTP request to the Google search engine. The
search request format and options available are detailed in the Request Format
section.
The search results are returned in the output format specified in the search request.
Currently, Google supports output results in XML and HTML format. XML
formatted results give you the power to customize the display of the results through
the implementation of a custom XML parser. The HTML results can be customized
through the application of an XSL stylesheet to the standard XML results.
[TABLE OF CONTENTS]
2. Request Format
This section is broken into the following categories:








Request Overview
Search Parameters
Query Terms
Filtering
Internationalization
Sorting
Meta Tags
Limits
2.1 Request Overview
[REQUEST FORMAT] - [TABLE OF CONTENTS]
Using the Google search protocol is as simple as requesting a page from a web server.
The Google search request is a standard HTTP GET command, which returns results
in either XML or HTML format as specified in the search request. The search request
is a URL combining the search engine host name, port and path; as well as a
collection of name-value pairs (input parameters) separated by & characters. Some
examples are listed below. Explanations of input parameters and output results can be
found in the remainder of this document.
Note: Google recommends performing a HTTP version 1.0 (or later) GET command.
Note: To determine which host name and port to send your search requests to, please
review your specific configuration documentation. The path to send your search
requests to is always "/search".
Examples
The query
GET /search?q=bill+material&output=xml&client=test&site=operations
would return the first 10 results matching the query "bill material" in the "operations"
collection in the Google XML output format.
The query
GET
/search?q=bill+material&start=10&num=5&output=xml_no_dtd&proxystylesh
eet=test&client=test&site=operations
would return results numbering 11-15 matching the query "bill material" in the
"operations" collection in the Google XML output format.
The query
GET
/search?q=Star+Wars+Episode+%2BI&output=xml_no_dtd&lr=lang_de&ie=lati
n1&oe=latin1&client=test&site=movies
&proxystylesheet=test
would return the first 10 German results matching the query "Star Wars Episode +I"
in the "movies" collection returned in the Google custom HTML output format by
applying the XSL stylesheet associated with the "test" front end to the standard XML
results.
2.2 Search Parameters
[REQUEST FORMAT] - [TABLE OF CONTENTS]
This table lists all the valid name-value pairs that can be used in a search request and
descriptions of how these parameters will modify the search results.
Name
Description
Default
Value
access
Defines whether the user is searching public content or
all content (i.e. public and secure).
This parameter takes effect only if Secured Content
Search capability is enabled.
The access parameter can have one of these possible
values:
p - search public content
s - search secure content
a - search all content, both public and secure
The access parameter defaults to "p" if none is
provided.
Note: Secured Content Search is automatically enabled
for clustered appliances.
p
Modifies the as_sitesearch parameter as follows:
Value
i
as_dt
e
Modification
Include only results in the web directory
specified by as_sitesearch
Exclude all results in the web directory
specified by as_sitesearch
i
Adds an additional search query term to search for the
phrase specified.
This parameter has the same effect as the phrase
special query term.
Note: New query terms specified will be combined
with q query terms to generate search results.
Note: The value specified for this parameter must be
URL-escaped.
Empty
string
as_eq
Adds an additional search query terms to exclude any
of the terms specified.
This parameter has the same effect as the exclude (-)
special query term.
Note: New query terms will be combined with q query
terms to generate search results.
Note: The value specified for this parameter must be
URL-escaped.
Empty
string
as_lq
Additional search query term to show any pages which
link to the specified URL.
This parameter has the same effect as the link special
query term.
Note: No other query terms can be specified when
using this special query term.
Note: The value specified for this parameter must be
URL-escaped.
Empty
string
as_epq
as_occt
Additional search query term to specify where the
search terms occur on the page: anywhere on the page,
in the title, or in the URL.
Note: Query terms specified will be combined with q
query terms to generate search results.
Note: The value specified for this parameter must be
URL-escaped.
Value
as_oq
as_q
Meaning
any
anywhere on the page
title
in the title of the page
URL
in the URL for the page
Empty
string
Adds additional search query terms to find any of the
terms specified.
This parameter has the same effect as the OR special
query term.
Note: New query terms will be combined with q query
terms to generate search results.
Note: The value specified for this parameter must be
URL-escaped.
Empty
string
Search query terms as entered by the user.
(See Query Terms section for additional query
Empty
string
features.)
Note: Query terms specified will be combined with q
query terms to generate search results.
Note: The value specified for this parameter must be
URL-escaped.
Additional search query term to show links in the
specified web directory or to exclude those links
depending on the value of as_dt.
This parameter has the same effect as the site special
query term.
When the Google Search Appliance is sent a search
request that includes the as_sitesearch parameter, it
converts the value of the parameter into an argument to
the site special query term and appends it to the value
of q in the search results.
For example, if your search contains the following
parameters:
as_sitesearch
q=mycompany&as_sitesearch=www.mycompany.com
Empty
string
The raw XML of your search results will contain the
following:
<q>mycompany site:www.mycompany.com</q>
The default XSLT stylesheet displays the value of the q
tag in the search box on the results page. Consequently,
using an as_sitesearch parameter will appear to
change the user's search query.
If the parameter and value as_dt=e are specified, site: is appended to the end of the query term.
Note: The value specified for this parameter must be
URL-escaped and contain fewer than 125 characters.
client
A string indicating any valid front end
filter
Activates or deactivates automatic results filtering
performed by Google search. By default, filtering is
applied to all Google results returned to improve results
quality.
(See Automatic Filtering section for more details.)
REQUIRED
1
getfields
Requests that the names and values of the meta tags
specified be returned with each search result, when
available.
(See Meta Tags section for more details.)
Note: All meta tag names or values specified must be
double URL-escaped.
Empty
string
ie
Input Encoding
Sets the character encoding used to interpret the query
string.
(See Internationalization section for details.)
latin1
lr
Language restrict
Restricts searches to pages in the specified language.
Empty
string
(See Language Restricts section for more details.)
num
Number of results desired per a single request. The
maximum allowable value is 100. (The maximum
number of results available for a query is 1,000.) See
also start.
Note: The actual number of results may be smaller than
the requested value.
10
numgm
Number of KeyMatch results to return with the results.
A value between 0 to 5 (inclusive) can be specified for
this option.
3
oe
Output Encoding
Sets the character encoding used to encode the results
returned.
(See Internationalization section for details.)
UTF8
Select the format of the search results. Valid formats
are:
Value
Output Format
xml_no_dtd
XML results or custom
HTML
(See proxystylesheet
parameter for details.)
output
xml
REQUIRED
XML results with
Google DTD reference.
If using this value,
proxystylesheet must
be omitted from the
parameters or must be
set to an empty string.
Restricts the search results to documents with meta tags
whose values contain the words or phrases specified.
partialfields (See Meta Tags section for more details.)
Note: All meta tag names or values specified must be
double URL-escaped.
Empty
string
proxycustom
Custom XML tags to be included in the XML results.
The only permitted values for this parameter are either
<HOME/>, <ADVANCED/>, or <TEST/>.
(See the Custom HTML output section for more
details.)
Note: This parameter is disabled if the search request
does not contain the proxystylesheet tag.
Note: If custom XML is specified, search results will
not be returned with the search request.
Note: Custom XML must be URL-escaped.
Empty
string
proxyreload
A value of 1 indicates that the Google Search
Appliance should update the XSL stylesheet cache to
0
refresh the stylesheet currently being requested. This
parameter is optional. The XSL stylesheet cache is
updated approximately every 15 minutes.
(See the Custom HTML section for more details.)
If the value of the output parameter is xml_no_dtd,
then the output format is modified by the
proxystylesheet value as follows:
Proxystylesheet
Value
proxystylesheet
Output Format
Omitted
XML results
Empty
XML results have a contenttype of text/html (rather than
text/xml), because the XML
results are not transformed.
Custom HTML results through
application of the XSL
Front End Name
stylesheet associated with the
specified front end
NA
(See the Custom HTML section for more details.)
Note: This parameter may also specify the identifier of
a valid collection. The default XSL stylesheet
associated with that collection will then be used for
custom HTML output.
Note: The value specified for this parameter must be
URL-escaped.
q
Search query as entered by the user.
(See Query Terms section for additional query
features.)
Note: The value specified for this parameter must be
URL-escaped.
Restricts the search results to documents that contain
exact meta tag names or name-value pairs specified.
requiredfields (See Meta Tags section for more details.)
Note: All meta tag names or values specified must be
double URL-escaped.
site
sitesearch
Empty
string
Empty
string
The name of a collection. Note that you can search over
multiple collections using the properly escaped OR (pipe REQUIRED
character) to separate the collection names.
Additional search query term to show links in the
specified web directory. Requires that a value for q
(query) be submitted as well. (The value of as_dt does
not modify the effect of the sitesearch parameter.)
This parameter has the same effect as the site special
query term.
Note: The sitesearch and as_sitesearch parameters
Empty
string
differ in how they are returned in the XML results. The
sitesearch parameter is not appended to the search
query in the results. That is, the original query term will
not be modified when you use the sitesearch
parameter.
Note: The value specified for this parameter must be
URL-escaped and contain fewer than 125 characters.
sort
start
Indicates alternate sorting method.
(See Sorting section for sort parameter format and
details.)
Note: Only date sort is currently supported.
Empty
string
Use this parameter to support result set page navigation.
The maximum number of results available for a query
is 1,000, i.e., the value of the start parameter added to
the value of the num parameter cannot exceed 1,000.
See also num.
0
Custom Parameters
If any custom parameters that contain spaces are added to the search URL, the space
characters will be replaced by an underscore (_).
For example:
http://search.customer.com/search?q=customer+query&site=collection&cl
ient=collection&output=xml_no_dtd&newparam=test+this
This URL adds the custom parameter newparam with a value of "test+this." For
security reasons, all space characters (represented as a "+") in the custom parameter
newparam will be replaced by "_" characters, while built-in variables, such as q, will
not be affected.
The resulting XML will look like this:
<PARAM name="q" value="customer query"
original_value="customer+query"/>
<PARAM name="newvar" value="test_this" original_value="test+this" />
The unmodified value can still be retrieved from the original_value attribute.
2.3 Query Terms
[REQUEST FORMAT] - [TABLE OF CONTENTS]
Default Search
By default, Google only returns pages that include all of your search terms. There is
no need to include "AND" between terms. Keep in mind that the order in which the
terms are typed will affect the search results. To restrict a search further, just include
more terms.
Google ignores common words and characters such as "where" and "how," as well as
certain single digits and single letters, because they tend to slow down your search
without improving the results. Google will indicate if a common word has been
excluded by including text in the search comments field of the search results returned.
Special Characters
By default, all non-alphanumeric characters that are included in a search query are
treated as query term separators (just like space characters).
The exceptions to this rule are the following characters: double quote mark ("), plus
sign (+), minus sign (hyphen) (-), decimal point (.), and ampersand (&). The
ampersand character (&) is treated as another character in the query term in which it
is included. The decimal point is a query term separator unless it is part of a number
(e.g., 250.01), in which case it counts as part of the query term. The remaining
exception characters correspond to search features listed in the section below.
If your document contains a number, with or without a decimal point, that has letters
immediately before or after it, the letters are treated as a separate word or words. For
example, the string 802.11a is indexed as two separate words, 802.11 and a.
Special Query Terms
Google supports the use of several special query terms that allow the user or search
administrator to access additional capabilities of the Google search engine. These
special query terms are listed below.
Note: All query terms must be correctly URL-escaped in the search request sent to
Google search.
Special
Query
Capability
Exclude
Query
Term
Sample Usage
bass -music
Description
Sometimes what you're
searching for has more
than one meaning. For
example, the term
"bass" can refer to
either fishing or music.
You can exclude a word
from your search by
putting a minus sign ("") immediately in front
of the term you want to
exclude from the search
results.
Note: The search
request parameter,
as_eq,
can also be used
to submit terms to
exclude.
Phrase
Search
"yellow pages"
Search for complete
phrases by enclosing
them in quotation marks
or connecting them with
hyphens. Words marked
in this way will appear
together in all results
exactly as you have
entered them. Phrase
searches are especially
useful when searching
for famous sayings or
proper names.
Note: The search
request parameter,
as_epq, can also be
used to submit a phrase
search.
Boolean
OR
Search
vacation london OR paris
Google search supports
the Boolean "OR"
operator. To retrieve
pages that include either
word A or word B, use
an uppercase OR
between terms.
Note: The search
request parameter,
as_oq, can also be used
to submit a search for
any term in a set of
terms.
Domain search examples:
site:www.google.com
site:google.com
site:com
Directory
Restricted Directory search examples:
Search
admission
site:www.stanford.edu/group/uga
site:www.google.com/about/
site:www.google.com/about
To search a domain,
specify a partial string
that matches complete
name segments from
the end of the canonical
host name.
To search a particular
directory on a web
server (including root),
you must specify the
complete canonical
name of the host server
followed by the path of
the directory. The string
must have a "/"
character after the host
name to limit searches
to a single
server/directory. The
path segments searched
must be a complete
match, because there is
no partial path segment
matching. Enter the
query followed by
"site:" followed by the
host name and path of
the web directory. If the
("/") character is at the
end of the web directory
path specified, then
only files within that
directory will be
searched and files in
sub-directories will not
be considered.
The URLs for these
queries must contain
fewer than 119
characters.
Note: The exclusion
operator ("-") can be
applied to this query
term to remove a web
directory from
consideration in the
search.
Note: Only one "site:"
search term per search
request is supported at
this time.
Note: The search
request parameters,
as_sitesearch and
as_dt, can also be used
to submit "site:" and "site:" search terms.
Title
intitle:Google search
If you prepend "intitle:"
to a query term, Google
search will restrict the
results to documents
containing that word in
the title. The query term
must appear in the first
10 words of the title.
Note there can be no
space between the
"intitle:" and the
following word.
Search
(term)
Note: Putting "intitle:"
in front of every word
in your query is
equivalent to putting
"allintitle:" at the front
of your query.
Title
Search
(all)
allintitle: Google search
If you start a query with
the term, "allintitle:";
Google search will
restrict the results to
those with all of the
query words in the title.
The query terms must
appear in the first 10
words of the title.
If you prepend "inurl:"
to a query term, Google
search will restrict the
results to documents
containing that word in
the result URL. Note
there can be no space
between the "inurl:"
and the following word.
URL
Search
(term)
inurl:Google search
Note: "inurl:" works
only on words, not URL
components. In
particular, it ignores
punctuation and will
only use the first word
following the "inurl:"
operator. To find
multiple words in a
result URL, use the
"inurl:" operator for
each word.
Note: Putting "inurl:"
in front of every word
in your query is
equivalent to putting
"allinurl:" at the front
of your query.
If you start a query with
the term, "allinurl:";
Google search will
restrict the results to
those with all of the
query words in the
result URL.
URL
Search
(all)
File Type
Filtering
allinurl: Google search
Google
filetype:doc OR filetype:pdf
Note: "allinurl:" works
only on words, not URL
components. In
particular, it ignores
punctuation. Thus,
"allinurl: foo/bar" will
restrict the results to
page with the words
"foo" and "bar" in the
URL, but won't require
that they be separated
by a slash within that
URL, that they be
adjacent, or that they be
in that particular word
order. There is currently
no way to enforce these
constraints.
The query prefix,
"filetype:", will filter
the results returned to
only include documents
with the extension
specified immediately
after. Note there can be
no space between
"filetype:" and the
specified extension.
Note: Multiple file
types can be included in
a filtered search by
adding more "filetype:"
terms to the search
query, when used in
conjunction with the
Boolean OR.
File Type Google -filetype:doc
Exclusion -filetype:pdf
The query prefix, "filetype:", will filter the
results to exclude
documents with the
extension specified
immediately after. Note
there can be no space
between "-filetype:" and
the specified extension.
Note: Multiple file
types can be excluded
in a filtered search by
adding more "-filetype:"
terms to the search
query.
Web
Document info:www.google.com
Info
The query prefix,
"info:", will return a
single result for the
specified URL if it
exists in the index.
Note: No other query
terms can be specified
when using this special
query term.
The query prefix,
"link:", will list web
pages that have links to
the specified web page.
Note there can be no
space between "link:"
and the web page URL.
Back
Links
link:www.google.com
Note: No other query
terms can be specified
when using this special
query term.
Note: The search
request parameter,
as_lq, can also be used
to submit a link:
request.
The query prefix,
"cache:", will return the
cached HTML version
of the specified web
document that the
Google search crawled.
Note there can be no
space between "cache:"
and the web page URL.
Cached
Results
Page
cache:www.google.com web
2.4 Filtering
If you include other
words in the query,
Google will highlight
those words within the
cached document.
Note: To use Google's
default cached result
display, simply omit the
output parameter in the
cache request. To
customize the display of
cached results, simply
request XML or
Custom HTML output
as part of the cache
request and ensure your
parser or stylesheet will
handle the incoming
cache data.
[REQUEST FORMAT] - [TABLE OF CONTENTS]
Google search provides many ways for you to filter the results that are returned as
part of your query. These filtering options include:


Automatic Filtering
Language Filters
o Automatic Language Filters
o Combining Language Filters
Other filtering options can be applied through special query parameters, query terms
and meta tags, which are documented in their respective sections. Please review these
sections for more information on other filtering options.
2.4.1 Automatic Filtering
The quality of the results Google returns for searches is extremely important. One
method that makes sure the best results are returned for a query is automatic
"filtering" of the search results to weed out undesirable results.
Currently, Google search uses two techniques for automatic filtering of results:


Duplicate Snippet Filter - If multiple documents contain the same information
in their snippets in response to a query, then only the most relevant document
of that set will be displayed in the results.
Duplicate Directory Filter - If there are many results in a single web directory,
then only the two most relevant results for that directory will be returned in the
results. An output flag indicates that more results are available from that
directory.
By default, both types of filters are enabled. However, you can disable them with the
filter parameter.
Setting filter=1 enables both Duplicate Directory Filtering and Duplicate Snippet
Filtering. This is the default setting if no value for the filter parameter is provided.
Setting filter=0 will disable both Duplicate Directory Filtering and Duplicate
Snippet Filtering.
Although determining when to use this option is up to each search administrator,
Google recommends against setting filter=0 for typical search requests, since Google
has found that document filtering significantly enhances the quality of most search
results.
Setting filter=p will disable Duplicate Snippet Filtering only.
Setting filter=s will disable Duplicate Directory Filtering only.
When an end user submits a search request in which filtering removes any results, the
removal of the results will be noted in the output generated for the search results. See
the section on Estimated vs. Actual Number of Results for more information on how a
filtered result set is identified and recommendations for results display.
The appliance also will automatically group results from a single directory in the
search results.
If you set filter=0, then the order in which results are ranked can change depending
on the value of the num parameter.
For example, if you set num=10 and filter=0 you may get two results in a particular
directory that are considered in the 10 most relevant results. If one of these results is
the most relevant of all, then directory crowding will cause both be displayed at the
top of the results.
If you now set num=20, you may get a third result in the same directory that would be
ranked from between 11 and 20. However, this result will actually be ranked third
because of directory crowding.
2.4.2 Language Filters
This section covers:


Automatic Language Filters
Combining Language Filters
2.4.2.1 Automatic Language Filters
Language filters limit searches to pages in the specified languages. The algorithm for
automatically determining the language of a web document is not customizable. The
language determination algorithm is primarily based on the majority language used in
the web document body text. Automatic language collections may not be appropriate
for all users.
Note: Encoding schemes for input and output of search requests are important when
providing international search. Please review the Internationalization section for more
details.
The automatic language filters generated are:
Language
Automatic Language Filter Name
Arabic
lang_ar
Chinese (Simplified)
lang_zh-CN
Chinese (Traditional)
lang_zh-TW
Czech
lang_cs
Danish
lang_da
Dutch
lang_nl
English
lang_en
Estonian
lang_et
Finnish
lang_fi
French
lang_fr
German
lang_de
Greek
lang_el
Hebrew
lang_iw
Hungarian
lang_hu
Icelandic
lang_is
Italian
lang_it
Japanese
lang_ja
Korean
lang_ko
Latvian
lang_lv
Lithuanian
lang_lt
Norwegian
lang_no
Portuguese
lang_pt
Polish
lang_pl
Romanian
lang_ro
Russian
lang_ru
Spanish
lang_es
Swedish
lang_sv
Turkish
lang_tu
2.4.2.2 Combining Language Filters
Search requests that use the lr parameter support the Boolean operators identified in
the table below (in order of precedence).
Boolean
Operator
Boolean NOT [
-]
Sample Usage
Description
-lang_fr
Removes all results that are
defined as part of the Language
Filter immediately following the "" operator.
The example lr value would
remove all results in French.
Boolean AND [
.]
Returns results that are in the
intersection of the results returned
by the collection to either side of
the "." operator.
gloves.hats
The example restrict value
would return all results which are
in both the "hats" and "gloves"
custom collections.
Boolean OR [ |
]
Returns results that are in either of
the results returned by the
collection to either side of the "|"
operator.
lang_en|lang_fr
(gloves).(Parentheses [ (
(lang_hu|lang_cs))
)]
The example lr value would
return all results matching the
query that are in either French or
English.
All terms within the innermost set
of parentheses will be evaluated
before terms outside the
parentheses are evaluated. Use
parentheses to adjust the order of
term evaluation.
The example lr value would
return all results in the "gloves"
custom collection that are not in
either the Hungarian or Czech
collections.
Note: Spaces are not valid characters in the collection string.
2.5 Internationalization
[REQUEST FORMAT] - [TABLE OF CONTENTS]
In order to support searching documents in multiple languages and character
encodings, Google provides the ie parameter to specify how Google search should
interpret characters in the search request, and the oe parameter to specify how
characters in the search results output should be encoded. To appropriately decode the
search query and correctly encode the search results, specify the correct ie and oe
parameters, respectively, in the search request.
Note: When providing search for multiple languages, Google recommends the usage
of the utf8 encoding value for the ie and oe parameters.
Example
The query
GET
/search?q=gloves&client=test&site=test&lr=lang_en|lang_fr&ie=latin1&o
e=latin1
would interpret the search query "gloves" using the latin1 encoding scheme, search
for English or French results, and return results in the latin1 encoding scheme.
The query
GET /search?q=gloves&client=test&site=test&lr=(-lang_hu).(-
lang_cs)&ie=latin2&oe=latin2
would interpret the search query "gloves" using the latin2 encoding scheme, search
for any results which are not in Hungarian or Czech, and return results in the latin2
encoding scheme.
The query
GET /search?q=gloves&client=test&site=test&lr=lang_zh-CN|lang_zhTW&ie=utf8&oe=utf8
would interpret the search query "gloves" using the utf8 encoding scheme, search for
any results which are in Simplified or Traditional Chinese, and return results in the
utf8 encoding scheme.
Note: See the Language Filters section for details of language specific searches using
the lr parameter.
Character Encoding Values
The following table lists all encoding values supported by these parameters:
Language
Encoding Value
Alternate Encoding Value
Chinese (Simplified)
gb
GB2312
Chinese (Traditional)
big5
Big5
Czech
latin2
ISO-8859-2
Danish
latin1
ISO-8859-1
Dutch
latin1
ISO-8859-1
English
latin1
ISO-8859-1
Estonian
latin4
ISO-8859-4
Finnish
latin1
ISO-8859-1
French
latin1
ISO-8859-1
German
latin1
ISO-8859-1
Greek
greek
ISO-8859-7
Hebrew
hebrew
ISO-8859-8
Hungarian
latin2
ISO-8859-2
Icelandic
latin1
ISO-8859-1
Italian
latin1
ISO-8859-1
Japanese
sjis
Shift_JIS
Korean
euc-kr
EUC-KR
Latvian
latin4
ISO-8859-4
Lithuanian
latin4
ISO-8859-4
Norwegian
latin1
ISO-8859-1
Portuguese
latin1
ISO-8859-1
Polish
latin2
ISO-8859-2
Romanian
latin2
ISO-8859-2
Russian
cyrillic
ISO-8859-5
Spanish
latin1
ISO-8859-1
Swedish
latin1
ISO-8859-1
latin3
ISO-8859-3
latin5
ISO-8859-9
latin6
ISO-8859-10
euc-jp
EUC-JP
utf8
UTF-8
Unicode (All
Languages)
2.6 Sorting
[REQUEST FORMAT] - [TABLE OF CONTENTS]
Google search provides two sorting options for implementing your search solution:


Sort By Relevance
Sort By Date
2.6.1 Sort By Relevance (Default)
By default, Google combines hypertext analysis and PageRank technologies to
provide users with highly relevant results. Hypertext analysis uses the design of the
page, examining over 100 factors to determine the best result for your query term.
PageRank considers the link structure of the entire index to understand how each page
links to the other pages in the index.
2.6.2 Sort By Date
Google search also supports the ability to order search results by date. The date of a
web document is defined by parameters configured by the search administrator. When
a search is performed using the sort by date capability, the date associated with each
result document will be included with the results.
When using the Sort by Date feature, the automatic quality filter will sometimes reorder results when performing result grouping. This can be disabled by adding the
"filter =0" parameter to the search request when performing search by date.
Example
The query
GET
/search?q=chicken+teriyaki&output=xml&client=test&site=test&sort=date
:D:S:d1
would return the first 10 top results sorted by both date and relevancy which match
the query "chicken teriyaki" in the "test" collection.
Details
To sort the results by date, the sort parameter must be formatted as follows:
date:<direction>:<mode>:<format>
where <direction>, <mode> and <format> can have the following values:
<direction> Value
Results
A
Sort results in ascending date order
D
Sort results in descending date order
<mode> Value
Results
S
Sort relevant results. Google's algorithm will determine a
subset of the most relevant results from the set of all
results, and then sort that subset by date to return as
results for the search request.
R
Sort all results
Note: Providing sort by date on queries with large result
sets may incur performance penalties.
L
Perform a look-up on the date associated with each
document and return the date information for each result
returned; but no sorting is performed.
<format> Value
d1
2.7 Meta Tags
Results
The format of the value returned for each search result
returned is set to YYYY-MM-DD
[REQUEST FORMAT] - [TABLE OF CONTENTS]
Google search provides two options for leveraging the meta tags that are available in
your content. Unless one of these parameters is specified; meta tag information will
not be considered in your search results, since that information is not visible to the
search user. These options are:


Requesting Meta Tag Values
Filtering by Meta Tags
2.7.1 Requesting Meta Tag Values
Through the use of the getfields parameter, the Google search engine allows a
search request to specify meta tag values to return with the search results. The search
engine will only return meta tag information for results which actually contain the
meta tags. The search for meta tags is case-insensitive. Use only whole words in the
getfields parameter, not partial words or word "stems." There is a limit of 320
characters returned for each meta tag when using getfields. This character limit
includes the meta tag name and content.
Usage
GET /search?q=[search
term]&output=xml&client=test&site=test&getfields=[meta tag name]
Example
The query
GET
/search?q=books&output=xml&client=[test]&site=[test]&getfields=author
.title.keywords
would return the first 10 results that match the query "books" in the "test" collection.
If any of the results contain the author, title and/or keywords meta tags, then the
values of those meta tags will be returned with the respective results. For example, the
following tags could be returned with this search request:
<META NAME="author" CONTENT="Jakob Nielsen">
<META NAME="title" CONTENT="Usability Engineering">
<META NAME="keywords" CONTENT="Usability, User Interface, User
Feedback">
Details
To specify multiple meta tag values to be returned, list all meta tag names separated
by a period (".") as in the example above. To request all available meta tags for each
search result, specify an asterisk ("*") as the value for the getfields parameter.
Note: When meta tag values are requested, they are not displayed in results requested
in the default HTML format. Please use the custom HTML or XML output options to
take advantage of this feature.
Note: All meta tag names or values specified must be double URL-escaped. See an
example in the following section.
2.7.2 Filtering by Meta Tags
The Google search engine can filter results by the values of the result meta tags. This
section details how to use the requiredfields and partialfields input parameters
to filter on meta tag values. The term partialfields refers to part of the meta tag
content, rather than part of a word. Other filtering techniques are noted in the Filtering
section.
Usage
GET /search?q=[search
term]&output=xml&client=test&site=test&requiredfields=[metatag name]:[metatag
content]
Examples
The query
GET
/search?q=checks&output=xml&client=test&site=test&requiredfields=depa
rtment:Human%252BResources|department:Finance
returns the first 10 results which match the query "checks" in the "test" collection
which also contained either of the following meta tags:
<META NAME="department" CONTENT="Human Resources">
<META NAME="department" CONTENT="Finance">
The query
GET
/search?q=books&output=xml&client=test&site=test&partialfields=author
:Scott
would return the first 10 results which match the query "books" in the "test" collection
which also contained the word "Scott" somewhere in the "author" meta tag. Some
example meta tags satisfying this search request are:
<META NAME="author" CONTENT="Sir Walter Scott">
<META NAME="author" CONTENT="F. Scott Fitzgerald">
Details
Multiple meta tag constraints can be specified using the requiredfields and
partialfields parameters. To filter for the presence of a meta tag, indicate the name
of the meta tag to be found. To filter on a specific meta tag value, indicate the name of
the meta tag followed by the colon ":" character and then the specific value. The
partialfields parameter matches complete words, not parts of words. In addition,
the match must be within the first 160 characters of the meta tag. See the examples in
the table below for sample usage.
To combine multiple name-value pairs, use the following operators:
Boolean
Operator
Sample Usage
Description
Boolean
AND [ .
]
author:William.keywords
Returns results which
satisfy both meta tag
constraints.
Returns results which
Boolean department:Sales|department:Finance
satisfy either meta
OR [ | ]
tag constraint.
As stated in the "Query Terms" section, all non-alphanumeric characters included in a
search query are treated as query term separators (just like space characters).
Similarly, Google uses these separators to divide metatag content into single entities,
or word tokens; that is, a word or a string that may or may not be a real word. The
separators, used in both queries and results, and their values are in the table. They are
not customizable.
Separator
+
{
~
}
!
|
@
`
#
[
$
]
%
Value
^
:
;
&
'
*
<
(
>
)
?
,
.
/
space
= character
\
92
"
34
\t
9
\r
13
\n
10
\v
11
\f
12
\177
177
Note: All meta tag names or values specified must be double URL-escaped. See
example above.
[REQUEST FORMAT] - [TABLE OF CONTENTS]
2.8 Limits
This section lists any limitations on the search requests sent to Google search.
Component
Limit
Search request length
2048 bytes
Query Terms
(includes terms in parameter q and
any parameters starting with as_ )
50
site: query terms
(includes use of as_sitesearch
1 (per search request)
parameter)
[TABLE OF CONTENTS]
3. Results Format
This section is broken into the following categories:


Custom HTML
XML
3.1 Custom HTML
[RESULTS FORMAT] - [TABLE OF CONTENTS]
The description of the custom HTML results section is broken down into the
following sections:


Custom HTML Output Overview
Internationalization
3.1.1 Custom HTML
Output Overview
[CUSTOM HTML] - [RESULTS FORMAT] - [TABLE
OF CONTENTS]
Google search provides the ability to generate custom HTML by incorporating an
XSLT (eXtensible Stylesheet Language Transformation) server into the search engine
infrastructure. Search requests submitted to the Google search engine, with the
output input parameter set to xml_no_dtd and a valid proxystylesheet parameter
value, will automatically be processed by the XSLT server as requests for custom
HTML output.
Using the XSL stylesheet specified by the proxystylesheet parameter; the XSLT
server will apply the transformation rules found in the XSL stylesheet to the standard
Google XML results and return the resulting output. While this document assumes
that the output generated by applying the XSL stylesheet will be HTML, almost any
output format can be generated by the application of the appropriate XSL stylesheet
rules. For any front end, the default XSL stylesheet can be customized or replaced by
the search administrator.
To customize the XSL stylesheet used to generate custom HTML output, please
review Google's XML output format to determine the XML tags that may be
transformed using a customized XSL stylesheet.
Additionally, you can leverage the proxycustom parameter to pass custom XML tags
to the XSLT server. Since the inclusion of custom XML does not generate search
results, this feature is useful for implementing additional static search pages, such as
an advanced search page.
Notes:





XSL stylesheets used by the XSLT server will be cached for 15 minutes. To
force the XSLT server to use the latest version of an XSL stylesheet, set the
proxyreload input parameter to a value of 1 in your search request.
XSL stylesheets which include other files may not be used with the Google
search engine. Any XSL stylesheet which contains the following tags /
functions will generate an error result: <xsl:import>, <xsl:include>,
xmlns: and document()
When requesting cached results in custom HTML output, the BLOB XML tag
and associated value are automatically converted to the original text before the
XSL stylesheet rules are applied. When using an XSL stylesheet which
customizes cache results, simply use the values of the CACHE_LEGEND_TEXT,
CACHE_LEGEND_NOTFOUND and CACHE_LEGEND_HTML XML tags directly
instead of applying a rule on the BLOB sub-tag.
If you use input or output encodings other than latin1, please consult the
Internationalization section for more details.
More information on XSL and XSLT can be found on the W3C web site.
3.1.2
Internationalization
[CUSTOM HTML] - [RESULTS FORMAT] - [TABLE
OF CONTENTS]
The Google search engine handles over 20 character encoding schemes. This section
will discuss any special considerations that must be made when using the custom
HTML output format with encoding schemes other than latin1.
In order to support all the encoding schemes supported by Google, the XSLT server
follows a process to ensure that the results are returned in the correct encoding
scheme. When requesting search results through the XSLT server, the server will
translate the results to the UTF8 encoding scheme before applying the selected XSL
stylesheet. Once the XSL stylesheet rules are applied to generate the results, then the
results will be converted to the encoding scheme specified in the output encoding
parameter, oe, of the search request. The one exception to this rule is cached result
pages, which get converted to the encoding scheme of the cached document after
XSLT processing.
Note: XSL stylesheets are associated with a front end. All XSL stylesheets must be in
latin1 or UTF8 formats.
3.2 XML
[RESULTS FORMAT] - [TABLE OF CONTENTS]
The description of the XML results format is broken down into the following
sections:




XML Output Overview
Character Encoding Conventions
Google XML Results DTD
Google XML Tag Definitions
3.2.1 XML Output
Overview
[XML] - [RESULTS FORMAT] - [TABLE OF
CONTENTS]
For maximum flexibility, Google provides search results in XML format. Using the
Google XML results, you can use your own XML parser to customize the display for
your search users. For developers who want to specify an XSL stylesheet for
transformation of the XML results, instead of developing their own XML parser,
proceed to the Custom HTML section.
Note:


All element values will be valid HTML suitable for display, unless otherwise
noted in the XML tag definitions. Some values are URLs which will need to
be HTML encoded before displaying.
All XML parsers used to parse Google results should be built to ignore any
attributes or tags which are not documented. This will allow custom XML
parsers to continue working without modification when Google adds more
features to the XML output in the future. In any custom parameters added that
contain spaces, each space will be replaced with "_". You can still retrieve the
unmodified value from "original_value." For example:
<PARAM name="temp" value="token_ring"
original_value="token+ring" />
3.2.2 Character
Encoding
Conventions
[XML] - [RESULTS FORMAT] - [TABLE OF
CONTENTS]
The first line of the Google XML results will indicate which character encoding is
used. See the XML Standard for more details.
Additionally, certain characters are required to be escaped when included as values in
XML tags. These characters are documented in the XML standard, and are also
reproduced in the table below. All other characters in the XML results will be
presented without modification.
Character
Escaped form
<
either < or <
&
either & or &
>
either > or >
'
either ' or '
"
either " or "
3.2.3 Google XML
Results DTD
[XML] - [RESULTS FORMAT] - [TABLE OF
CONTENTS]
Google XML results can be returned either with or without a reference to the most
recent DTD (Document Type Definition) describing Google's XML format. The DTD
is a guide to help search administrators and XML parsers understand the XML results
output. Since Google's XML grammar may change from time to time, you should not
configure your parser to use the DTD to validate the XML results.
Additionally, XML parsers should not be configured to fetch the DTD every time a
search request is performed. Since the DTD is updated infrequently, these fetches
create unnecessary delay and bandwidth requirements.
Google recommends that you use the xml_no_dtd output format to get XML results.
If you specify the xml output format in your search request, then the only difference
will be the inclusion of the following line in the XML results.
<!DOCTYPE GSP SYSTEM "google.dtd">
The DTD is available on the Google Search Appliance at
http://<appliance_hostname>/google.dtd
If there are other features you would like to see on the DTD, please consult with your
account representative. Not all features in the DTD may be available or supported at
this time.
3.2.4 Google XML
Tag Definitions
[XML] - [RESULTS FORMAT] - [TABLE OF
CONTENTS]
This section provides an index and details of Google's XML results.
Sub-Tags Legend
?
*
+
|
=
=
=
=
optional sub-tag
zero or more instances of the sub-tag
one or more instances of the sub-tag
Boolean OR
Index
The XML tags are listed in alphabetical order below. Please click on the first letter of
the XML tag in question to jump to the correct section.
B
C
F
G
H
L
M
N
O
P
Q
R
S
T
U
Details
BLOB
Format
Text (See Definition)
Sub-Tags
Definition
This tag contains HTML data in the encoding format
specified in the attribute. Additionally, the data has
been BASE64 encoded to preserve data integrity of
cached results encoded in a different encoding scheme
then the results requested.
Name
Attributes
Format
Description
The encoding scheme of the
Text
HTML data
encoding (Encoding (See the Internationalization
Scheme) section for a list of common
encoding values)
C
Format
Sub-Tags
Definition
Indicates that the "cache:" special query term is
supported for this search result URL
Name
Attributes
SZ
Format
Description
Provides the size of the cached
version of the search result in
kilobytes ("k"). This field is
Text
not populated if no cached
(Integer +
version of a document is
"k")
available, which can be the
case if robots noarchive
metatags are used.
X
CID
Text
Identifier of a document in
Google's cache. To fetch the
document from the cache, send
a search term built like this:
"cache:" + CID text + ":" +
escaped URL. The escaped
URL is available in the UE tag.
Send this search term
normally, as one would type it
into the search form.
CACHE
Format
Sub-Tags
CACHE_URL, CACHE_REDIR_URL,
CACHE_LAST_MODIFIED, CACHE_LEGEND_FOUND?,
CACHE_LEGEND_NOTFOUND?, CACHE_CONTENT_TYPE,
CACHE_LANGUAGE, CACHE_ENCODING, CACHE_HTML
Definition
Provides encapsulation for the cached version of a
search result
Attributes
CACHE_CONTENT_TYPE
Format
Text (MIME type)
Sub-Tags
Definition
MIME type of the cached result as specified in the
HTTP header returned when the document was
crawled
Attributes
CACHE_ENCODING
Format
Text
Sub-Tags
Definition
Attributes
The encoding scheme of the cached result as specified
in the HTTP header returned when the document was
crawled
(See the Internationalization section for a list of
common values)
CACHE_HTML
Format
Text (HTML) (Custom HTML output only)
Sub-Tags
BLOB?
Definition
The cached version of the search result. All search
results are stored in HTML format after being
translated for indexing.
(XML output only)
Attributes
CACHE_LANGUAGE
Format
Text (Google language tag)
Sub-Tags
Definition
The language of the cached result as determined by
Google's automatic language classification algorithm.
The value of this tag is the same as the values used for
the automatic language collections without the
"lang_" prefix.
Attributes
CACHE_LAST_MODIFIED
Format
Text
Sub-Tags
Definition
Date that the document was crawled, as specified in
the Date HTTP header when the document was
crawled for this index. The crawler will fetch
documents from its cache if the web server responds
with a 304 (not modified) status code to an ifmodified-since request. In this case, the
CACHE_LAST_MODIFIED will be the date the
document was originally crawled and not the date of
the if-modified-since request.
Attributes
CACHE_LEGEND_FOUND
Format
Sub-Tags
CACHE_LEGEND_TEXT*
Definition
Provides encapsulation for query terms found in the
visible text of the cached result returned
Attributes
CACHE_LEGEND_NOTFOUND
Format
Text (Custom HTML output only)
Sub-Tags
BLOB?
Definition
Details of any query terms not visible in the cached
result returned
(XML output only)
Attributes
CACHE_LEGEND_TEXT
Format
Text (Custom HTML output only)
Sub-Tags
BLOB
Definition
Details of a query term which is visible in the cached
result. Any query terms found in the cached result will
automatically be highlighted using the colors
described in the attributes of this tag.
(XML output only)
Name
fgcolor
Format
Color
attribute
The foreground color of the
query term in the cached
result. This value can be used
directly in a color attribute for
HTML tags.
Color
attribute
The background color of the
query term in the cached
result. This value can be used
directly in a color attribute for
HTML tags.
Attributes
bgcolor
Description
CACHE_REDIR_URL
Format
Text (Absolute URL)
Sub-Tags
Definition
Attributes
Final URL of cached result after all redirects are
resolved
CACHE_URL
Format
Text (Absolute URL)
Sub-Tags
Definition
Initial URL of cached result
Attributes
CRAWLDATE
Format
Text
Sub-Tags
Definition
This is an optional element that shows the date that
the page was crawled. It is shown only for pages
crawled within the past two days.
Attributes
CT
Format
HTML
Sub-Tags
Definition
Search comments
Example comment: Sorry, no content found for this
URL
Attributes
CUSTOM
Format
Sub-Tags
(Any custom XML specified in the search request)
Definition
Provides encapsulation for any custom XML tags
specified in the proxycustom input parameter
Attributes
FI
Format
Sub-Tags
Definition
Indicates that document filtering was performed
during this search
Note: See the section on Automatic Filtering for more
details
Attributes
FS
Format
Sub-Tags
Definition
Additional search result details
Name
Attributes
Format
Description
NAME
Text
Name of the result descriptor
VALUE
Text
Value of the result descriptor
GSP
Format
Sub-Tags
(TM, Q, PARAM*, CUSTOM?, Spelling?,
Synonyms?, CT?, TT?, GM*, RES?) | CACHE
Definition
GSP = "Google Search Protocol"
Provides an encapsulation for all data returned in the
Google XML search results
Name
Attributes
VER
Format
Text
Description
Indicates version of the search
results output. The current
output version is "3.2".
GD
Format
Text (HTML)
Sub-Tags
Definition
Contains the description of a KeyMatch result
Attributes
GL
Format
Text (URL)
Sub-Tags
Definition
Contains the URL of a KeyMatch result
Attributes
GM
Format
Sub-Tags
GL, GD?
Definition
Provides encapsulation for a single KeyMatch result
Attributes
HAS
Format
Sub-Tags
L?, C?
Definition
Provides encapsulation for any special features
supported for this search request
Attributes
HN
Format
Text (URL-escaped web directory)
Sub-Tags
Definition
Indicates that directory crowding has occurred and
that additional results are available from the directory
where this search result was found. The value of this
tag is ready to be used with the "site:" query term.
Name
Attributes
U
Format
Text
Description
HTML version of web
directory
L
Format
Sub-Tags
Definition
Indicates that the "link:" special query term is
supported for this search result URL
Attributes
M
Format
Text (Integer)
Sub-Tags
Definition
The estimated total number of results for the search
Note: The estimate of the total number of results for a
search can be too high or too low. Please review the
appendix entitled, Estimated vs. Actual Number of
Results.
Attributes
MT
Format
Sub-Tags
Definition
Meta tag name and value pairs pulled from the search
result
Note: Only meta tags which are requested in the
search request will be returned
Name
Attributes
Format
Description
N
Text
Name of the meta tag
V
Text
Value of the meta tag
NB
Format
Sub-Tags
PU?, NU?
Definition
Provides encapsulation for result set navigation
information
Note: The NB tag will only be present if either
previous or additional results are available
Attributes
NU
Format
Text (Relative URL)
Sub-Tags
Definition
Contains relative URL to the next results page
Note: The NU tag will only be present if additional
results are available
Attributes
OneSynonym
Format
HTML
Sub-Tags
Definition
A synonym suggestion for the submitted query in
HTML format.
Name
Attributes
Q
Format
Description
The URL-escaped version of
the synonym suggestion
Text
PARAM
Format
Sub-Tags
Definition
The input parameters submitted to the Google search
engine to generate these results
Name
Attributes
Format
Description
name
Text
value
HTML formatted version
HTML of the input parameter
value
original_value
Text
Input parameter name
Original URL-escaped
version of the input
parameter value
PU
Format
Text (Relative URL)
Sub-Tags
Definition
Attributes
Contains relative URL to the previous results page
Note: The PU tag will only be present if previous
results are available
Q
Format
HTML
Sub-Tags
Definition
The search query submitted to the Google search
engine to generate these results
Attributes
R
Format
Sub-Tags
U, T?, RK, FS?, MT*, S?, HAS, HN?
Definition
Provides encapsulation for the details of an individual
search result
Name
Attributes
Format
Description
N
Text
Indicates the index (1-based)
(Integer) of this search result
L
Indicates the recommended
indentation level of the results.
Note: Currently this value will
Text
always be 1 unless directory
(Integer)
crowding occurs. In this case,
the second directory result will
have a value of 2.
MIME
Text
Indicates the MIME type of
the search result
RES
Format
Sub-Tags
M, FI?, XT?, NB?, R*
Definition
Provides encapsulation for the details of the
individual search results
Name
Format
Description
SN
Indicates the index (1-based)
Text
of the first search result
(Integer)
returned in this result set
EN
Indicates the index (1-based)
Text
of the last search result
(Integer)
returned in this result set
Attributes
RK
Format
Text (Integer in the range 0-10)
Sub-Tags
Definition
Provides a general rating of the relevance of the
search result
Attributes
S
Format
Text (HTML)
Sub-Tags
Definition
Search result snippet for the search result
Note: Query terms will be in highlighted in bold in
the results, and line breaks will be included for proper
text wrapping.
Attributes
Spelling
Format
Sub-Tags
Suggestion+
Definition
Provides encapsulation for alternate spelling
suggestions for the submitted query. Only one
spelling suggestion is returned at this time.
Attributes
Suggestion
Format
HTML
Sub-Tags
Definition
An alternate spelling suggestion for the submitted
query in HTML format
Name
Attributes
Q
Format
Text
Description
The URL-escaped version of
the spelling suggestion
Synonyms
Format
Sub-Tags
OneSynonym+
Definition
Provides encapsulation for synonym suggestions for
the submitted query. Up to 20 synonym suggestions
may be returned depending on the synonym list
associated with the front end by the search
administrator.
Attributes
T
Format
Text (HTML)
Sub-Tags
Definition
The title of the search result
Attributes
TM
Format
Text (Floating-point number)
Sub-Tags
Definition
Total server time to return search results, measured in
seconds.
Attributes
U
Format
Text (Absolute URL)
Sub-Tags
Definition
The URL of the search result.
Attributes
XT
Format
Sub-Tags
Definition
Indicates that the estimated total number of results
specified in this search result is exact.
Note: See the section on Automatic Filtering for more
details.
Attributes
[TABLE OF CONTENTS]
Appendices
This section contains any appendices relevant to Google search:


Estimated vs. Actual Number of Results
URL Escaping
Appendix A: Estimated vs.
Actual Number of Results
[APPENDICES] - [TABLE OF CONTENTS]
The Google search engine does not guarantee the ability to return a particular number
of results for any given search query. The total number of results provided by Google
in the search results is an estimate of the actual number of results for the query. This
number can be higher or lower than the actual number of results available. This
section covers any issues relating to this topic.
Behavior
When a search request is made to Google, the following behavior occurs:
1. If Google has results to satisfy the search request, then the requested number
of results will be returned.
2. If Google has results and the search request is for results beyond what is
available, the last page of results will be returned. The last page of results is
determined by dividing the total number of results into pages based on the
number of results requested.
3. If no results are available for the search request, then an empty result set will
be returned.
In order to determine if a particular results page is the last page of available results,
check for any of the following conditions:
1. The first result number returned does not match the first result number
requested.
2. The number of results returned is less than the number of results requested.
3. The results returned do not contain a link to the next result set.
Automatic Filtering
Typically, the number of results actually returned is significantly reduced by the
automatic filtering that Google performs on all search results to weed out undesirable
results. This feature can be disabled per the instructions in the Automatic Filtering
section.
Any results which have been filtered will be identified in the results returned. For
example, the <FI> XML tag will be present in any XML search results where
automatic document filtering has occurred.
Google recommends that the search results page display a message on the last page of
the search results similar to the following message when automatic filtering occurs:
In order to show you the most relevant results, we have omitted some entries very
similar to the search results already displayed. If you like, you can repeat the search
with the omitted results included.
The underlined text in the message should be a hypertext link to submit the same
search again with the filter parameter set to the value 0. Google has found that this
method of informing users about automatic document filtering works well and is used
on the Google Internet search site.
Navigation
When the total number of results returned is an estimate, the navigation structure for
search results can be complicated. Google recommends two approaches for generating
a navigation scheme for your search results:
1. Only provide the search user with the ability to navigate to the previous results
page and the next results page. Google provides links to the previous and next
result set in the results returned when appropriate.
2. Provide the search user with the ability to jump to any search page in the
estimated number of results. If the user requests a results page beyond which
results are actually available, the last results page will be returned and the
navigation structure should be updated at that time. Google uses this approach
on our Internet search site.
Appendix B: URL Escaping
[APPENDICES] - [TABLE OF CONTENTS]
In order to make a search request to the Google search engine through an HTTP URL
request, there are certain conventions that must be followed in order to allow the
search engine to correctly translate your search request.
The HTTP URL schema defines that only alphanumeric, the special characters $_.+!*'(), and the reserved characters ;/?:@=& can be used as values within an
HTTP URL request. Since reserved characters are used by the search engine to
decode the URL and some special characters are used to request search features, then
all non-alphanumeric characters used as input parameter values should be URL
escaped.
In order to URL escape a string, all space characters should be converted to a "+"
character and all other alphanumeric characters should be replaced by a "%" character
followed by two hexadecimal digits representing the value of that character.
Some input parameters require that the values passed to Google search will need to be
double URL escaped. This means that you will need to apply the URL escaping to the
string twice in succession to generate the final value. See the input parameter
descriptions for more information.
Note: Additional information on URL escaping can be found at W3C and IETF web
sites.
Examples
Original String
URL Escaped String
chicken -teriyaki
chicken+%2Dteriyaki
admission form
site:www.stanford.edu
admission+form+site%3Awww.stanford.edu
Original String
Doubly URL Escaped String
William Shakespeare
William%2BShakespeare
admission form
admission%2Bform%2Bsite%253Awww.stanford.edu
site:www.stanford.edu
Glossary
[TABLE OF CONTENTS]
This glossary contains basic descriptions of acronyms and terms found in this
document which may be new to some readers.
Cached result - As part of its core technology, Google indexes all the content on a
page, rather than a portion of the content (percentage or meta tags). Each page that is
indexed is also available to be served in a cached HTML format (up to 4 million bytes
of each document before HTML conversion). When a user views a cached document,
each query term is highlighted in a different color, making it easy for the user to find
the information sought. Because all pages are cached, the user always has access to
content that has been indexed, even if the server where the live content is stored
happens to be refusing connections or is slow to return the page.
Collection - A collection is a subset or a view of the document index. Collections are
specified by URL patterns; some collections are created automatically by the Google
search engine. Collections are useful for allowing refined or advanced searches, for
limiting access to classified information, for group-level security, for languagespecific queries and for many other applications.
DTD - Document Type Definition. The purpose of a DTD is to define the legal
building blocks of an XML document. It defines the XML document structure with a
list of legal elements.
Encoding Scheme - Each language has an official encoding scheme which is used to
represent all of the language's characters in an 8-bit data stream format. These
encoding schemes are used by Google search to determine how to translate incoming
and outgoing search requests.
KeyMatch - Because you occasionally may want to return special results for specific
queries, Google search may be configured with the KeyMatch feature. Using
KeyMatch, the search administrator can designate special results that are returned in
addition to the standard results when specific queries are made. Google recommends
using KeyMatch carefully, as it can drastically decrease the quality of results if
overused.
Meta Tags - HTML tags which can be specified within an HTML document which
are not displayed to the end user, but which may contain document meta-data. Google
search uses meta tags with the NAME attribute to enhance and filter search results
when requested.
MIME - Multipurpose Internet Mail Extensions. The MIME type of a web document
(or search result) identifies the format of the document it is associated with. Some
sample MIME types include "text/html" for HTML documents, and "application/msword" for Microsoft Word documents.
Query - A string of query terms separated by the space character which is submitted
to Google search. The results returned for a particular query will satisfy all query
terms by default.
Query term - A single term which defines a unit of search for the Google search
engine to find in the index. A single query term can not contain any spaces or
punctuation.
UTF-8 - Unicode Transformation Format (8-bit). UTF-8 is a Unicode based
encoding scheme for describing language data by representing the data using 8-bit
codes. This encoding scheme is used by Google search to support multiple languages
simultaneously.
Web Directory - A subset of files on a web server stored under its own directory
name.
XML - eXtensible Markup Language. XML is a markup language, similar to HTML,
which was designed to describe data. The tags used in XML are not pre-defined, and
are described by a DTD or the data provider.
XSL - eXtensible Stylesheet Language. XSL is a language that is designed to
describe how an XML document should be displayed. XSL contains commands that
can be used to describe the transformation and formatting of an XML document for
display. XSL is used in the Google search environment to transform XML results into
custom HTML output.
XSLT - XSL Transformation. XSLT describes the process of transforming an XML
document into another format. Google search allows search administrators to use our
XSLT server to transform our standard XML results into their own custom HTML
output.
Download