SharePoint 2010 - Search Guidance
SharePoint Deployment Planning Services
Prepared for
SDPS Delivery Partners
Monday, 8 February 2016
Version 1.0
Prepared by
SDPS Partner Organization
SDPS@microsoft.com
Contributors
SDPS Delivery Partner
Prepared for SDPS Delivery Partners
The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of
the date of publication and is subject to change at any time without notice to you. This document and its contents are provided
AS IS without warranty of any kind, and should not be interpreted as an offer or commitment on the part of Microsoft, and
Microsoft cannot guarantee the accuracy of any information presented. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR
IMPLIED, IN THIS DOCUMENT.
The descriptions of other companies’ products in this document, if any, are provided only as a convenience to you. Any such
references should not be considered an endorsement or support by Microsoft. Microsoft cannot guarantee their accuracy, and
the products may change over time. Also, the descriptions are intended as brief highlights to aid understanding, rather than as
thorough coverage. For authoritative descriptions of these products, please consult their respective manufacturers.
This deliverable is provided AS IS without warranty of any kind and MICROSOFT MAKES NO WARRANTIES, EXPRESS OR
IMPLIED, OR OTHERWISE.
All trademarks are the property of their respective companies.
©2010 Microsoft Corporation. All rights reserved.
Microsoft and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or
other countries.
The names of actual companies and products mentioned herein may be the trademarks of their respective owners.
Page ii
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners
Table of Contents
Introduction ................................................................................................................................................. 3
Purpose or What To Be Tracking Towards .............................................................................................. 4
Search Overview .......................................................................................................................................... 5
Common Search Scenarios ................................................................................................................................................. 5
What's New in SharePoint 2010 Search? ................................................................................................ 6
New and Improved Capabilities for Information Workers ...................................................................................... 6
New and Improved Query Capabilities ...................................................................................................................... 6
New and Improved Search Results Capabilities ..................................................................................................... 6
New and Improved Capabilities for IT Professionals ................................................................................................ 8
Terminology ............................................................................................................................................... 11
Planning for Search .................................................................................................................................. 14
Understanding the End User ............................................................................................................................................14
Understanding the Corpus ................................................................................................................................................15
Choosing the Right Technology .............................................................................................................. 20
Topology Planning .................................................................................................................................... 21
Architectural Components .................................................................................................................................................21
Crawler...................................................................................................................................................................................21
Indexing Engine .................................................................................................................................................................21
Query Engine ......................................................................................................................................................................21
User Interface and Query Object Model ..................................................................................................................22
Scalability and Availability .................................................................................................................................................22
Componentization and Scaling ...................................................................................................................................22
High Availability and Resiliency ...................................................................................................................................22
Topology Components (SharePoint Search 2010) ...................................................................................................23
Crawl Component .............................................................................................................................................................23
Crawl Database ..................................................................................................................................................................23
Query Component ............................................................................................................................................................23
Index Partition ....................................................................................................................................................................23
Host Distribution Rule .....................................................................................................................................................24
Planning Objectives ..............................................................................................................................................................24
Appendix .................................................................................................................................................... 25
Page 1
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners
Capacity Planning..................................................................................................................................................................25
Analyzing Enterprises Corpuses ..................................................................................................................................25
Corpus Size ..........................................................................................................................................................................25
Content Characteristics ...................................................................................................................................................25
Content Metadata and Managed Properties .........................................................................................................26
Content Versions ...............................................................................................................................................................27
Page 2
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners
INTRODUCTION
This document should be considered a component that should be included in any SharePoint
Deployment Planning Services (SDPS) engagement where there is a need to provide users the
ability to discover content using search. This guide is intended to supplement the core platform
guidance, and in some cases, may contain recommendations that supersede information that you
find in that document.
Page 3
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners
PURPOSE OR WHAT TO BE TRACKING TOWARDS
SDPS is a planning offering, and in many cases, should be considered an accelerator for either an
actual deployment or a launching point for deeper planning. For a Microsoft funded SDPS
engagement is concerned, you rarely will have sufficient time to fully document a sophisticated
solution. As such we will describe a minimum set of topics that you should cover with your
customers, even if that coverage is in some cases superficial. The objective is to:

Understand the overall importance of search to your customer.

Identify the search technology most appropriate for the customer.

Document what you can about the content and its characterization.

Ultimately, incorporate what you learn back into the logical and physical architecture
diagrams you will be providing back to your customer.
Page 4
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners
SEARCH OVERVIEW
Search is perhaps one of the most important aspects of any SharePoint 2010 deployment, which
allows users to quickly discover content that is relevant to some need they have. In some instances,
a user may know many characteristics of the content they are looking for—perhaps they’ve seen a
specific document before but have forgotten where that document is stored. In other situations, a
user may not know content specifics beyond a single keyword. Even in the most well designed and
intuitive information architectures, search effectively allows users to spontaneously create a
taxonomy of their own that facilitates both the navigation and discovery of relevant information in
their SharePoint 2010 deployment.
This document includes planning resources for:


SharePoint Search 2010
FAST Search for SharePoint 2010
The following Microsoft TechNet resource centers and blogs are relevant to content in this guide:


Enterprise Search Resource Center
Enterprise Search Team Blog
Common Search Scenarios
While this is no means exhaustive, here are some reasons why a customer may be interested in
search:




To support the specific needs of a single SharePoint Web application. Only content managed
by that web application is indexed and made available for queries. For small deployments, this
is a typical scenario.
To support the needs of a larger SharePoint 2010 deployment that may include multiple Web
applications or tenants. Search can be configured to either isolate the visibility of content to
users of a particular web application or tenant, or it could be configured to support cross
application queries.
As a dedicated search deployment to support the needs of an enterprise. For instance,
content managed by disparate SharePoint 2010 deployments, file shares, and other content
repositories could be aggregated into a single enterprise index, thereby enabling users to
discover content without having to know where that content is stored.
As a search “service farm” that might support some non-SharePoint 2010 application. For
example, a customer may have a public facing Web site built in ASP. A SharePoint search
service farm could be used to index the content on that Web site and provide a surface
through which the Web site can broker queries.
Page 5
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners
WHAT'S NEW IN SHAREPOINT 2010 SEARCH?
You can use this section to gain a better understanding of what is new in enterprise search for
SharePoint 2010.
New and Improved Capabilities for Information Workers
SharePoint 2010 provides new capabilities for formulating and submitting queries, and for working
with search results.
New and Improved Query Capabilities
SharePoint 2010 enables end users to create and run more effective search queries. It also enables
users to issue search queries from the desktop in Windows 7.
The new query capabilities are:





Boolean query syntax for free-text queries and for property queries
SharePoint 2010 supports use of the Boolean operators AND, OR, and NOT in search
queries. For example, a user can execute a query such as the following:
(“SharePoint Search” OR “Live Search”) AND (title:”Keyword Syntax” OR title:”Query Syntax”)
Prefix matching for search keywords and document properties
Search queries can use the * character as a wildcard at the end of a text string. For example,
the search query "comp*" would find documents that contain "computer" or "component"
or "competency". Similarly the query "author:Ad*" would find documents created by
"Adam" or "Administrator". Therefore, the query "comp* author:ad*" would find documents
that contain "component" and that were created by "Adam", as well as finding documents
that contain "computer" and that were created by "Administrator".
Suggestions while typing search queries
As a user types keywords in the Search box, the Search Center provides suggestions to help
complete the query. These suggestions are based on past queries from other users.
Suggestions after users run queries
Search center also provides suggestions after a query has been run. These suggestions are
also based on past queries from other users, and are distinct from the 'did you mean'
feature.
Connectors for enterprise search in Windows 7
From an Enterprise Search Center, users can easily create a connector for their SharePoint
searches in Windows 7. By typing search queries into the Windows 7 search box users can
find relevant documents from SharePoint and take advantage of Windows features such as
file preview and drag-and-drop for documents returned in those search results.
New and Improved Search Results Capabilities
SharePoint 2010 provides many improvements for getting and viewing search results. The new
search results capabilities are:

Results display
The search results page includes a refinement panel, which provides a summary of search
results and enables users to browse and understand the results quickly. For example, for a
particular search query the summary in the refinement panel might show that there are
Page 6
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners


many Web pages in the search results and many documents by a particular author. A
summary might also indicate that there are mostly Microsoft Word® and Microsoft Excel®
documents in the top set of results. The refinement panel also enables users to filter
results—for example by kind of content (document, spreadsheet, presentation, Web page,
and so on), content location (such as SharePoint 2010 sites), content author, or date last
modified. A user can also filter by category based on managed properties and enterprise
content management (ECM) taxonomy nodes that an administrator configures.
View in Browser
The View in Browser capability allows users to view most Microsoft Office documents in the
browser by using Office Web Applications.
Office Web Applications is the online companion to Word, Excel, Microsoft PowerPoint®
and Microsoft OneNote®, and it enables information workers to access documents from
anywhere. Users can view, share, and work collaboratively on documents by using personal
computers, mobile phones, and Web browsers. Office Web Applications is available to
users through Windows Live. It is also available to business customers with Microsoft Office
2010 volume licensing agreements and document management solutions based on
SharePoint 2010.
People search
People search enables users to find other people in the organization not only by name, but
also by many other categories, such as department, job title, projects, expertise, and
location. People search improvements include:
o Improved relevance in people search results
Results relevance for people search is improved, especially for searches on names
and expertise.
o Self search
The effectiveness of people search increases as users add data to their profiles.
When a user performs a search, the search system recognizes this as a “self search”
and displays related metadata. The metadata can include information such as the
number of times the My Site profile was viewed and the terms that other people
typed that returned the user’s name. This can encourage users to add information
to their profile pages to help other users when they search. As users update their
My Site profiles, other users can find them more easily in subsequent searches. This
increases productivity by helping to connect people who have common business
interests and responsibilities.
o Phonetic name matching and nickname matching
Users can search for a person in the organization without knowing the exact
spelling of their name. For example, the search query “John Steal” could yield “John
Steele” in the search results; results for the search query “Jeff” include names that
contain “Geoff.” In addition, nickname matching makes it possible for a search
query for “Bill” to yield results that include “William.”
NOTE: Phonetic matching applies to the following languages supported by
SharePoint 2010:
 English
 Spanish
 French
 German
 Italian
Page 7
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners

 Korean
 Portuguese (Brazil)
 Russian
Enhancements for relevance of search results
SharePoint 2010 provides improvements to increase the relevance and usefulness of search
results, such as the following:
o Ranking based on click-through history
If a document in a search result set is frequently clicked by users, this indicates that
information workers find the document useful. The document is therefore
promoted in the ranking of search results.
o Relevance based on extracted metadata
Document metadata is indexed along with document content. However,
information workers do not always update metadata correctly. For example, they
often re-purpose documents that were created by other people, and may not
update the author property. Therefore, the original author's name remains in the
property sheet, and is consequently indexed. However, the search system can
sometimes determine the author from a phrase in the document. For example, the
search system could infer the author from a phrase in the document such as "By
John Doe". In this case, SharePoint 2010 includes the original author, but also
maintains a shadow value of "John Doe". Both values are then treated equally when
a user searches for documents by specific authors.
New and Improved Capabilities for IT Professionals
SharePoint 2010 includes new ways for administrators to help provide the most benefits for end
users who are searching for information. IT professionals can take advantage of the following new
and improved features:


Improved administrative interface
SharePoint 2010 includes the new search administration pages that were first available for
organizations that deployed Microsoft Office SharePoint Server 2007 and then installed the
Infrastructure Update for Microsoft Office Servers. This new interface centralizes the
location for performing administrative tasks. With SharePoint 2010, administrators have an
interface that provides the following advantages:
o A single starting point for all farm-wide administration tasks, including search
administration. The most common search tasks are highlighted.
o A central location where farm administrators and search administrators can
monitor server status and activity.
Farm Configuration Wizard
After the Installation Wizard finishes, the Farm Configuration Wizard runs automatically.
The Farm Configuration Wizard helps simplify deployment of small farms. It provides the
option to automate much of the initial configuration process with default settings. For
example, when you use the Farm Configuration Wizard to deploy the first application server
in a farm, the wizard automatically creates a fully functional search system on that server,
including the following:
o A Search Center from which users can issue queries (if the person installing the
product selected this option in the Farm Installation Wizard).
Page 8
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners
A fully functional search topology that can support an index of up to 10 million
crawled documents.
o The ability to crawl SharePoint 2010 sites in the server farm immediately after the
Farm Configuration Wizard finishes running.
Search service administration independent of other shared services
In Office SharePoint Server 2007, the Office SharePoint Server Search service was bundled
with other shared services (such as Excel Calculation Services) in the Shared Services
Provider (SSP). In that architecture, you could not create a new Search service without
creating a new SSP. In contrast, in SharePoint 2010, you can create and manage Search
service applications independently of one another and independently of other service
applications. This is because of the new, more granular, Service Architecture of SharePoint
2010.
Expanded support for automating administrative tasks
You can automate many search administration tasks by using Windows PowerShell™ 2.0
scripts. For example, you can use Windows PowerShell 2.0 scripts to manage content
sources and search system topology. Windows PowerShell support is new for SharePoint
2010.
Increased performance, capacity, and reliability
SharePoint 2010 provides many new ways to configure and optimize a search solution for
better performance, capacity, and reliability, as follows:
o Scalability for increased crawling capability
In Office SharePoint Server 2007, a Shared Services Provider could only be
configured to use one indexer. With SharePoint 2010, you can scale the number of
crawl components by adding additional servers to your farm and configuring them
as crawlers. This enables you to do the following:
 Increase crawl frequency and volume, which helps the search system to
provide more comprehensive and up-to-date results.
 Increase performance by distributing the crawl load.
 Provide redundancy if a particular server fails.
o Scalability for increased throughput and reduced latency
You can increase the number of query components to do the following:
 Increase query throughput—that is, increase the number of queries that
the search system can handle at a time.
 Reduce query latency—that is, reduce the amount of time it takes to
retrieve search results. One of the general aims of enterprise search with
SharePoint 2010 is to implement sub-second query latencies for all
searches. To achieve this, you must ensure that no query server deals with
more than ten million items; you can achieve this by adding multiple query
servers to your farm, and therefore by taking advantage of the new index
partitioning features of SharePoint 2010. Office SharePoint Server 2007 did
not support the concept of index partitioning.
 Provide failover capability for query components.
Topology management during normal operations
You can tune the existing search topology during regular farm operations while search
functionality remains available to users. For example, during usual operations, you can
deploy additional index partitions and query components to accommodate changing
conditions.
o




Page 9
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners





Operations management
SharePoint 2010 provides new capabilities for monitoring farm operations and customizing
reports for enterprise search. Specifically, administrators can review status information and
topology information in the search administration pages of the Central Administration Web
site. They can also review crawl logs, as well as health reports, and can use Systems Center
Operations Manager to monitor and troubleshoot the search system.
Health and performance monitoring
Health and performance monitoring features enable an administrator to monitor search
operations in the farm. This can be especially helpful for monitoring crawl status and query
performance.
SharePoint 2010 includes a health analysis tool that you can use to check for potential
configuration, performance, and usage problems automatically. Search administrators can
configure specific health reporting jobs to do the following:
o Run on a predefined schedule.
o Alert an administrator when problems are found.
o Formulate reports that can be used for performance monitoring, capacity planning,
and troubleshooting.
Search Analytics Reports
SharePoint 2010 provides new reports that help you to analyze search system operations
and tune the search system to provide the best results for search queries. For example,
reports can include information about what terms are used most frequently in queries or
how many queries are issued during certain time periods. Information about peak query
times can help you decide about server farm topology and about best times to crawl.
Searches of diverse content by crawling
SharePoint 2010 can search content in repositories other than SharePoint sites by crawling
or federating. For example, the search system can crawl content in repositories such as file
shares, Exchange public folders, and Lotus Notes using connectors included with
SharePoint 2010. Additional connectors for crawling databases and third-party application
data are created easily by using the Business Connectivity Services connector framework.
Support for creating connectors using SharePoint® Designer 2010 or Microsoft Visual
Studio® 2010 enables quicker and easier development compared to protocol handlers for
Office SharePoint Server 2007.
Searches of diverse content by federation
SharePoint 2010 search results can include content from other search engines. For example,
an administrator might federate search results from www.bing.com or from a
geographically distributed internal location.
Page 10
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners
TERMINOLOGY
It is important that you have a solid understanding of the terms and definitions used throughout
this document.
Term
Definition
Best Bet
Best Bets are URLs to documents that are associated with one or more keywords.
Typically these documents or sites are ones that you expect users will want to see at
the top of the search results list. Best Bets are returned by queries that include the
associated keywords, regardless of whether the URL has been indexed. Site
collection administrator can create keywords and associate Best Bets with them.
Connector
Connectors are components that communicate with specific types of system, and
are used by the crawler to connect to and retrieve content to be indexed.
Connectors communicate with the systems being indexed by using appropriate
protocols. For example, the connector used to index shared folders communicate by
using the FILE:// protocol, whereas connectors used to index Web sites use the
HTTP:// or HTTPS:// protocols.
Content Source
Content sources are definitions of systems that will be crawled and indexed. For
example, administrators can create content sources to represent shared network
folders, SharePoint 2010 sites, other Web sites, Exchange public folders, third-party
applications, databases, and so on.
Crawl Rule
Crawl rules specify how crawlers retrieve content to be indexed from content
sources. For example, a crawl rule might specify that specific file types are to be
excluded from a crawl, or might specify that a specific user account is to be used to
crawl a given range of URLs.
Crawl Schedule
Crawl schedules specify the frequency and dates/times for crawling content sources.
Administrators create crawl schedules so that they do not have to start all crawl
processes manually.
Crawled Property
Crawled properties represent the metadata for content that is indexed. Typically,
crawled properties include column data for SharePoint list items, document
properties for Microsoft Office or other binary file types, and HTML metadata in
Web pages. Administrators map crawled properties to managed properties, in order
to provide useful search experiences. See Managed Properties for more details.
Crawler
The crawler is the component that uses connectors to retrieve content from content
sources.
Crawler Impact Rule
A crawler impact rule governs the load that the crawler places on source systems
when it crawls the content in those source systems. For example, one crawler impact
rule might specify that a specific content source that is not used heavily by
information workers should be crawled by requesting 64 documents simultaneously,
while another crawler impact rule might specify less aggressive crawl characteristics
for systems that are constantly in use by information workers.
Federation
Federation is the concept of retrieving search results from multiple search providers,
based on a single query performed by an information worker. For example, your
Page 11
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners
organization might include federation with Bing.com, so that results are returned by
SharePoint 2010 and Bing.com for a given query.
IFilter
IFilters are used by connectors to read the content in specific file types. For example,
the Word IFilter is used to read Word documents, while a PDF IFilter is used to read
PDF files.
Index
An index is a physical file that contains indexed content, and which is used by query
servers to satisfy a query.
Indexer
Indexers manage the content to be included in an index, and propagate that content
to query servers where they are stored in index files.
Indexing Engine
See Indexer
Index Partition
See Partitioned Indexes
Managed Property
Administrators create managed properties by mapping them to one or more
crawled property. For example, an administrator might create a managed property
named Client that maps to various crawled properties called Customer, Client, and
Cust from different content sources. Managed properties can then be used across
enterprise search solutions, such as in defining search scopes and in applying query
filters.
OpenSearch
OpenSearch is an industry standard that enables compliant search engines to be
used in federated scenarios. See Federation for more details.
Partitioned Index
SharePoint 2010 includes a new concept that enables administrators to spread the
load for queries across multiple query servers. This is achieved by creating subsets of
an index, and propagating individual subsets to different query servers. The subsets
are known as partitions. At query time, the query object model contacts each query
server that can satisfy the search so that all results to be returned to the user are
included.
Properties Database
Managed properties and security descriptors for search results are not stored in the
physical index files. Instead, they are stored in an efficient database that is
propagated to query servers. Query servers typically satisfy a query by retrieving
information from both the index file and the properties database.
Query Object Model
The query object model is responsible for accepting inputs from search user
interfaces, and for issuing appropriate queries to query servers. The search Web
Parts provided by SharePoint 2010 use the query object model to run queries.
Developers can also create custom user interfaces and solutions that run queries by
using the query object model.
Query Server
Query servers query retrieve data from index files and property databases to satisfy
queries.
Ranking
Ranking defines the sort order in which results are returned from queries. Typically,
results are sorted in order of descending relevance, so that the most relevant
documents are presented near the top of the results page. However, information
workers might choose to apply a different sort order, such as by date modified.
Relevance
Relevance describes how well a given search satisfies a user’s information needs.
Relevance includes which documents are returned in the results (document recall)
Page 12
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners
and the order of those documents in the results (ranking).
Search Center
Search Center is a site based on the Search Center site template, and provides a
focused user interface that enables information workers to run queries and work
with search results.
Search Document
See Search Item
Search Item
A search item represents a document, list item, file, Web page, Exchange public
folder post, or database row that has been indexed. Search items are sometimes
referred to as search documents, but the key point is that these items are returned
by search queries.
Stemming
Words in each language can have multiple forms, but essentially mean the same
thing. For example, the verb 'To Write' includes forms such as writing, wrote, write,
and writes. Similarly, nouns normally include singular and plural versions, such as
book and books. The stemming feature in enterprise search can increase recall of
relevant documents by mapping one form of a word to its variants.
Stop Word
Stop words (sometimes known as noise words) are those words for which there is no
value in indexing them. Some stop words are part of the language (such as 'a', 'and',
and 'the'). There is no value in indexing these words as they are likely to be
contained in a high percentage of indexed items. Furthermore, information workers
rarely search for just these types of terms.
Synonym
Synonyms are words that mean the same thing as other words. For example, you
might consider laptop and notebook to mean the same thing. Administrators can
create synonyms for keywords that information workers are likely to search for in
their organization. Additionally, synonyms that can be used to improve recall of
relevant documents are stored in thesaurus files.
Word Breaker
Streams or words are retrieved from content sources, and those streams are broken
down into discrete words for indexing. Word breakers are the components that
break down streams into individual words. Streams to be indexed are normally
broken down by identifying spaces, punctuation marks, and the particular rules of
each language. Also, when a user enters multiple words into a search box, that query
is broken into discrete terms by a word breaker.
Page 13
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners
PLANNING FOR SEARCH
Understanding the End User
The end user is ultimately the most important factor that needs to be considered when you deploy
any application. Your search solution is no different and you will not only need to consider both
their wants and needs, but also who these individuals are and their relationship to the system.
Common questions are:
Question
Impact
Where are users in relation to the
system?
Particularly in global or Internet-facing deployments, the user base
may be connecting from a variety of different locations, each subject
to the unique characteristics of the network that connects them to the
system. In some instances, you may need to factor in performance
expectations, as describe in the next section. However, search also can
benefit remote users as it might allow them to find content more
quickly, even when they know exactly what they need, with less
interaction with the system. For instance, a user might be looking for a
document that using site navigation will take 10 clicks to get to. If,
because of poor network connectivity to the system, it takes 10
seconds for each page request to be fulfilled, the total time to
destination is significantly greater than if the user can enter a succinct
search term on the home page and within ten seconds have a result
set that includes a direct link to the desired content. From a planning
perspective, make sure that you provide sufficient support for users
getting relevant results.
What are their performance
expectations of the system?
Commonly, end user performance expectations relate to the amount
of time it takes from the execution of a query to the time that the
system presents them with results. This may be covered in a service
level agreement or it may be subjective, based on experience with
other search systems. Factors that influence the perception of the
overall speed of the system can include such things as adequate
capacity planning, the relative location of the system to the end user,
and even what additional operations the system may have to do
before it can return a result set. When planning a search deployment,
you should work to quantify the performance expectations and
attempt to honor those expectations during your capacity planning. In
addition, there may be factors that you cannot manage within the
environment itself that can be mitigated through alternate approaches.
For instance, a remote user may have a poor network connection to a
corporate deployment and would benefit from interacting with a
regional system. This regional system could potentially federate results
from the corporate deployment or could itself index all or some of the
same content indexed in the head office environment.
Are end users already familiar with
search in Office SharePoint Server
2007?
If users are familiar with Office SharePoint Server 2007 search, then
they may be very familiar with how queries are accepted and qualified
(for instance, using Scopes). If they have been using earlier versions of
SharePoint, some improvements will be obvious in SharePoint 2010.
Page 14
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners
However, while some things may look the same, there are substantial
improvements in what a user can enter in the search box, such as the
new support for wildcards and Boolean operators. From a planning
perspective, it is important to stress end user training so that end users
are aware of the new capabilities.
Is there a search paradigm that users
may already be familiar with?
One possible motivator for this deployment may be to replace an
existing system. Regardless of whether SharePoint 2010 improves on
every capability offered by the previous system or not, their reception
of SharePoint 2010 may be biased by the absence of one or more
features that they were accustomed to having. For example, a
government agency may have had an advanced search form that
enabled users to quickly characterize the type of content they were
looking for using checkboxes. Users at a new deployment may only be
familiar with public facing search engines such as Bing or Google. An
even more subtle example is a situation where users expect two search
terms to be processed with a specific join term (“dog house” would be
processed as “dog and house”) or even treated as an explicit phrase.
From a planning perspective, end user training may reduce the
learning curve and also deflect any initial negative perceptions about
such things as relevance. For more sophisticated search interfaces that
training cannot accommodate for, the rich SharePoint 2010 search
object model coupled with a much more extensible set of out of box
search Web Parts, may be leveraged to support these needs.
Are end users going to be issuing
queries use language other than the
default system language?
While language packs are discussed in the core platform guidance, you
must consider the impact that this has on such search related
configurations, as well. For example, localization may need to be
considered for noise words, synonyms, best bets, and word breakers.
Are there any expectations that one
set of content will be kept separate
from another set of content?
End users may be confused, distracted, or annoyed by having their
queries return results across a diverse set of content. For example, a
user may only want to see Finance documents and not Human
Resources results when executing a query. Depending on the
underlying requirement, this may necessitate planning for separate
content sources, search scopes, search applications, or even separate
search farms.
Are there any expectations around
search availability?
This would typically be captured in a Service Level Agreement or SLA.
Basically, you want to understand what the true importance of search
is to an organization as it relates to this deployment. For example, in a
public facing Internet site, search may be the primary vector by which
end users get to the content they need to. Whereas in a small
deployment, a search outage might be more tolerable. If there is a
greater the need for availability, you should build great levels of
redundancy into your design.
Understanding the Corpus
The corpus is the entire volume of content that the customer wishes to have their deployment crawl
and make available for query fulfillment. When you discuss this with the customer, it is generally a
Page 15
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners
good idea to capture the information you are collecting in a diagram. This diagram should initially
include a placeholder for the farm design and also capture who will be interacting with the system,
ideally segmenting this pool of users into separate objects, where each object represents a group of
users with a common set of needs or characteristics. As you gain a better grasp of the content
exposed through search, you may need to consolidate or break up the users into different groups.
Question
Impact
Where and what are the repositories
that the solution should be indexing?
Generally, the answer you get will map to SharePoint Content Sources,
but at this point you are most concerned with the various systems that
SharePoint 2010 will need to interact with in order to gain access to
that content. This has broad impact on your search planning, as it may
indicate a need for specific content connectors to communicate with
those repositories. The repository types supported out-of-box by
SharePoint 2010 are SharePoint Products and Technologies sites, Web
sites, Microsoft Exchange public folders, and file shares. Databases may
also be crawled using Business Connectivity Services (BCS), however, it
will necessitate the design or development of these connectors.
Microsoft also provides a connector for Lotus notes. Some third party
companies may provide connectors for other repositories, and the rich
API set provided with SharePoint 2010 permits the development of
custom connectors.
For each repository, how is content
secured from both an authentication
perspective as well as an
authorization one?
Some content repositories require SharePoint 2010 to first
authenticate against that repository before gaining access to the
content that it manages. This may necessitate having a special
privileged account solely for this purpose. Where content within a
repository has or can have unique access requirements, the crawl
account must have sufficient permission to read that content in order
for it to be included in the index. Dependent on the type of repository
SharePoint 2010 connects to, the repository may be unable to provide
any authorization restrictions back with the content. For example, a
customer may want a particular non-SharePoint 2010 Web site
crawled. Although the site requires users authenticate against the
system and SharePoint 2010 can honor this requirement, there is no
security information returned when SharePoint 2010 crawls that
content. Consequently, all users of the SharePoint 2010 environment
are able to see results that pertain to that secure system regardless of
whether they themselves have access to that system. There are a
number of strategies that you can pursue to prevent this from
happening, if the customer believes it is an issue. One tactic is to
attempt to index Web content using the BCS. This has a dependency
on the site being database driven, the data within the database being
receptive to describing using the BCS, and the database being visible
to the crawler. An additional layer of security can be applied to the
BCS, either directly aligning to the authorization scheme associated
with the web application or through the application of broader access
rules around the BCS. In other instances, while the content connector
may be able to return authorization information along with the
content, it may necessitate mapping credentials from the foreign
system back into a format appropriate to SharePoint 2010. This is the
case with the Lotus Notes connector. Finally, SharePoint 2010 still
Page 16
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners
supports the concept of custom security trimmers which, at query
time, can determine (typically by leveraging the user’s SharePoint 2010
credentials against the repository) whether a result should be included
or not.
What is the volume of content stored
in the repository?
The answer you get should be normalized in units of search items. For
SharePoint 2010 repositories, this count will include user profiles, list
items, documents, and pages. For a database, this is the number of
rows that are to be indexed. For Web sites, this is the number of
unique Web pages. Be aware that a customer may have a Web site
with a parameter driven page the rendering of which is driven by a
given query string parameter. If, by altering that parameter, 4000
unique pages are delivered, the search items for this site would be
4000, not just one physical page. From a planning perspective, this
volume will impact the ability of SharePoint 2010 to crawl that content.
By using this example again, a full crawl will require the crawler to
make 4000 HTTP requests against the web content source. Other
content connectors, such as the BCS, may be able to crawl multiple
items in a single request, but more commonly, the more search items
there are, the more outbound requests by the crawler. This is one of
the primary impacts on crawl performance and can be countered by
the introduction of multiple crawlers targeting the same content
source.
Is the content repository ready for
you to crawl it?
In many cases ownership of secondary repositories may be different
from those managing the system you are designing. It is important to
get approval from the owner of the secondary repository so that you
understand how your crawl activity may impact that system,
particularly if that repository is not sized to service these types of
requests. If this type of load is of concern, defining less impactful crawl
schedules or even throttling the crawling of particular content
repositories may be part of the answer. For those instances where
ownership is one and the same, particularly for deployments where
search is indexing content managed by the same SharePoint 2010
instance that hosts it, you will still need to be cognizant of the impact
that crawling has on other activities on the system. Again crawl
schedules might be part of the solution, while another option may be
to setup a dedicated Web Front-End (WFE) server that is not included
in the end user rotation, so that only the crawler uses it when indexing
the deployment.
How frequently does the content in
this repository change?
Most repositories are never entirely static; new content is added,
existing content updated, and old content retired. This information is
one of the inputs that you may use to determine the crawl schedule
associated with a particular repository. If the change frequency within
a repository varies dependent on the datasets in the repository, you
may want to consider a strategy for dividing the repository into
multiple content sources to enable you to set different crawl
schedules, thus maximizing the freshness of the overall index.
Are there any predictors for growth?
The number of searchable items in the repository today is important,
but it is just as important to understand what is expected to be in that
repository a year or more from now. A customer that adds 100,000
Page 17
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners
SKUs to a product database each month will impact how you plan for
now for future capacity.
What level of parity should the index A full crawl always requests everything managed by the system. This
have as it relates to the content in the volume of requests impacts crawler performance and the system being
repository?
indexed incurs load during a crawl. You also need to consider the
amount of time that it takes for a full crawl to complete; for some
content repositories, this may be measured in days. Incremental crawls
target only content that has experienced a change, but the crawler is
dependent on the system being crawled to be able to provide this
information. If it cannot (as is the case with Web sites), an incremental
crawl is, effectively, treated as a full crawl. It is largely an ongoing effort
to develop a crawl schedule that both attends to end user expectations
on content freshness, takes into consideration the frequency of change
within that repository, while also being sensitive to the resource
demands placed on both the system crawling as well as the system
being crawled. You might benefit from content source segmentation,
allowing volatile areas of content to be indexed more frequently than
others. Choosing off peak crawl schedules, scaling crawlers to enable
parallel indexing, and relying more on incremental crawls are possible
alternatives.
Does the repository include content
that should not be crawled?
This could be directories, certain types of files, or even certain named
files. SharePoint 2010 offers improved support for defining crawl rules
using regular expressions. It is possible to define patterns for content
not being included in the index.
What types of content are stored in
the repository?
For this you are primarily concerned with how the data structures that
each of the search items in the repository are contained. For
databases, this is likely a row of text data. For a web site, it might be a
web page or a hyperlinked document. For SharePoint 2010, this could
be user profiles, a rich assortment of document types, as well as a list
item. While it is rarely possible to be exact regarding this, you should
strive to approximate how the volume of content captured earlier is
distributed across various data structures. Knowledge of the types will
assist you in several ways:




While SharePoint 2010 includes out of box a number of
IFilters, others may be required to truly gain access to
content stored within those documents.
The amount of processing time associated with crawling one
file type versus another may be significant, even if we are
talking milliseconds. Ultimately, we are concerned about the
aggregate impact.
Different file types typically have different index densities. By
this we mean that there may be more actual content that will
make its way into the index for a small Microsoft Word file
than in a large JPEG file.
Some file types expose metadata along with their content.
For example, a Microsoft Word file may have properties on it
that supplement the content contained within the actual file
– keywords, customer name, and subtitle are a few examples.
A web page may expose HTML META tags that are exposed
Page 18
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners

Does content in one repository need
to be kept separate from another?
by an IFilter as properties. Dependent on how you choose to
respond to these properties, they may end up in the property
store, increasing its overall size.
SharePoint 2010 may need additional configuration to crawl
specific file extensions, even if an existing IFilter could crawl
the content.
There may be regulatory or security motivators for doing this. As with
the similar topic in the end user section, there can be a number of
solutions to this problem. If the motivation is to prevent users from
seeing content that they should not be able to see, assuming the
content is already secured and the crawler can consume the
permissions on that content, isolation by way of security trimming
should already occur. It is also possible to logically isolate content by
creating multiple content sources that target different segments of a
repository, or even redefining the default search scope or defining new
scopes that at query time, to restrict the query to only a portion of the
corpus. If the customer demands the more physical isolation, separate
crawl and query applications (along with their required databases)
could be setup to ensure that content is truly managed separately. If
the demands are even more extreme, it may necessitate production of
a separate farm to honor this isolation requirement.
For each type of content within a
This is often an estimate, but may influence some of your
repository, what is the average size of recommendations on capacity planning. For files, every crawl operation
that content?
will demand moving the file across the network to the crawler. For
database search items, it is rows of data that move across the network.
You must consider network saturation, but you will also need to
remember that larger files could result in more of an index signature
than smaller files. Larger files that have more data require more
processing time to extract that data.
Is content within the repository
augmented by additional metadata?
While some content connectors will not do much of this, others can do
it greatly. For example, the file share connector will only surface such
things as where the file resides and security information around the
file. With the SharePoint 2010 content connector, however, a
document may be heavily decorated with system metadata, but also
be supplemented with data corresponding to custom columns in a
document library or even as a result of the document association with
a given content type. This metadata, depending on configuration, can
be saved in the property store. The more properties on a particular
piece of content, the larger its signature within the property store. This
is heavily dependent on a customer implementation, but you must
consider this when sizing the property database.
Additional planning details are described on the Search Environment Planning diagram available at:
http://www.microsoft.com/downloads/details.aspx?familyid=5655EACA-22DF-4089-BCD338A1F5318140&displaylang=en
Page 19
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners
CHOOSING THE RIGHT TECHNOLOGY
You will need to work with your customer to decide which of the two search platforms that
Microsoft is offering for this release of SharePoint is most appropriate for their needs. While
ultimately this is a decision that may be driven by an interest in specific functional elements, there
are a few key points that you can share.




Both FAST and SharePoint Search share a unified API when interacting with their respective
engines. Using Central Administration as an example, both are managed by similar
administrative screens and support many of the key concepts discussed in this document.
For instance, in both the content source definition and configuration screens are identical.
FAST is largely a superset of functionality offered by SharePoint Search 2010. Some key
improvements are:
o Deep results refinement that effectively allow users to drill into a result
o Improved web indexing support including the index of client side script
o Provides for previews of certain document formats inline in the result set
FAST can support an index that can exceed 500 million search items, while SharePoint
Search 2010 can support only 100 million (Both figures are heavily dependent on the
content and characterization of that content)
FAST has more opportunities to tune such things as relevance.
Some counter points might be:



Operational readiness for FAST may not be what it is for SharePoint. In organizations where
earlier versions of SharePoint has already been deployed, the learning curve for moving to
SharePoint Search 2010 will be less than what it will be for FAST
There may be more significant hardware or software costs associated with incorporating
FAST into the overall architecture
Because of the bigger configuration surface exposed by FAST, initial and ongoing tuning
might require additional operational overhead (Not all of these configuration surfaces are
exposed through Central Admin).
An overview of some of these differences is captured in:
http://www.microsoft.com/downloads/details.aspx?familyid=C422D3C7-1443-41E4-B0FEFC402EE4D8C1&displaylang=en
Page 20
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners
TOPOLOGY PLANNING
Architectural Components
There are four key architectural components that need to be understood prior to pursuing any
topology design work:
Crawler
This component invokes connectors that are capable of communicating with content sources.
Because SharePoint 2010 can crawl different types of content sources (such as SharePoint sites,
other Web sites, file shares, Lotus Notes databases, and data exposed by Business Connectivity
Services), a specific connector is used to communicate with each type of source. The crawler then
uses the connectors to connect to and traverse the content sources, according to crawl rules that an
administrator can define. For example, the crawler uses the file connector to connect to file shares
by using the FILE:// protocol, and then traverses the folder structure in that content source to
retrieve file content and metadata. Similarly, the crawler uses the Web connector to connect to
external Web sites by using the HTTP:// or HTTPS:// protocols, and then traverses the Web pages in
that content source by following hyperlinks to retrieve Web page content and metadata.
Connectors load specific IFilters to read the actual data contained in files. Refer to the Connector
Framework section later in this document for more information about connectors.
Indexing Engine
This component receives streams of data from the crawler, and determines how to store that
information in a physical, file-based index. For example, the indexer optimizes the storage space
requirements for words that have already been indexed, manages word-breaking and stemming in
certain circumstances, removes noise words, and determines how to store data in specific index
partitions if you have multiple query servers and partitioned indexes. Together with the crawler and
its connectors, the indexing engine meets the business requirements of ensuring that enterprise
data from multiple systems can be indexed. This includes collaborative data stored in SharePoint
sites, files in file shares, and data in custom business solutions, such as CRM databases, ERP
solutions, and so on.
Query Engine
Indexed data that is generated by the indexing engine are propagated to query servers in the
SharePoint farm where it is stored in one or more index files. This process is known as 'continuous
propagation'. That is, while indexed data is being generated or updated during the crawl process,
those changes are propagated to query servers, where they are applied to the index file (or files). In
this way, the data in the indexes on query servers experience a very short latency. In essence, when
new data has been indexed (or existing data in the index has been updated), those changes will be
applied to the index files on query servers in just a few seconds. A server that is performing the
query server role responds to searches from users by searching its own index files, so it is important
that latency be kept to a minimum. SharePoint 2010 ensures that this is the case automatically. The
query server is responsible for retrieving results from the index in response to a query received via
the query object model. The query sever is also responsible for the word-breaking, noise word
removal, and stemming (if stemming is enabled) for the search terms provided by the query object
model.
Page 21
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners
User Interface and Query Object Model
As mentioned above, searches are formed and issued to query servers by the query object model.
This is typically in response to a user performing a search from the user interface in a SharePoint
site, but it may also be in response to a search from a custom solution (either hosted in or out of
SharePoint 2010). Furthermore, the search might have been issued by custom code, such as from a
workflow, or from a custom navigation component. In any case, the query object model parses the
search terms, and issues the query to a query server in the SharePoint farm. The results of the query
are returned from the query server to the query object model, and the object model provides those
results to the user interface components (or other components that may have issued the query).
Scalability and Availability
SharePoint 2010 enables you to add multiple instances of each of the crawler, indexing, and query
components. This level of flexibility means that you can scale your SharePoint farms. (Previous
versions of SharePoint Server did not allow you to scale the indexing components).
The aims of the enterprise search features in SharePoint 2010 are to provide sub-second query
latencies for all queries, regardless of the size of your farm, and to remove bottlenecks that were
present in previous versions of SharePoint Server. You can achieve these aims by implementing a
scaled-out architecture. SharePoint 2010 enables you to scale out every logical component in your
search architecture, unlike previous versions.
Componentization and Scaling
You can add multiple indexers to your farm to provide availability and to scale to achieve high
performance for the indexing process. Each indexer can crawl a discrete set of content sources, so
not all indexers need to index the entire corpus. This is a new capability for SharePoint 2010.
Furthermore, indexers no longer store full copies of the index; they simply crawl content sources
and propagate the indexes to query servers.
You can also add multiple query servers to provide availability and to scale to achieve high query
performance, as shown in Figure 9. If you add multiple query servers, you are really implementing
index partitioning; each query server maintains a subset of the entire logical index, and therefore
does not need to query the entire index (which could be a very large file) for every query. The
partitions are maintained automatically by SharePoint 2010, which uses a hash of each crawled
document's ID to determine in which partition a document belongs. The indexed data is then
propagated to the appropriate query server.
Another new feature is that property databases are also propagated to query servers so that
retrieving managed properties and security descriptors is much more efficient than in Microsoft
Office SharePoint Server 2007.
High Availability and Resiliency
Each search component also fulfills high-availability requirements, by supporting mirroring.
Page 22
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners
Topology Components (SharePoint Search 2010)
Crawl Component
In SharePoint 2010, crawl components process crawls of content sources, propagate the resulting
index files to query components, and add information about the location and crawl schedule of
content sources to their associated crawl databases. Crawl components are associated with a single
Search service application. You can distribute the crawl load by adding crawl components to
different farm servers.
You assign a farm server to participate in crawling by creating a crawl component on that server. If
you want to balance the load of servicing crawls across multiple farm servers, add crawl
components to the farm and associate them with the servers that you want to crawl content.
Crawl Database
In SharePoint 2010, a crawl database contains data related to the location of content sources, crawl
schedules, and other information specific to crawl operations for a specific Search service
application. You can distribute the database load by adding crawl databases to different computers
that are running SQL Server. Crawl databases are associated with crawl components, and can be
dedicated to specific hosts by creating host distribution rules
Query Component
In SharePoint 2010, query components return search results to the query originator. Each query
component is part of an index partition, which is associated with a specific property database that
contains metadata associated with a specific set of crawled content. You can distribute query load
by adding mirror query components to an index partition and placing them on different farm
servers.
Typically, a given index partition contains one or two query components depending on whether you
want to provide load balancing or failover capabilities to the index partition. You can add more
than two query components to an index partition, but in general, we recommend that in such cases,
you instead create a new index partition.
You assign a server to service queries by creating a query component on that server. If you want to
balance the load of servicing queries across multiple farm servers, add mirror query components to
an index partition and associate them with the servers you want to service queries.
Index Partition
In SharePoint 2010, an index partition is a group of query components. Each query component
holds a subset of the full text index and returns search results to the query originator. Each index
partition is associated with a specific property database that contains metadata associated with a
specific set of crawled content. You can distribute the load of query servicing by adding index
partitions to a Search service application and placing their query components on different farm
servers.
You assign a server to service queries by creating a query component on that server. If you want to
balance the load of servicing queries across multiple farm servers, add query components to an
index partition and associate them with the servers that you want to service queries.
Page 23
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners
Host Distribution Rule
In Search Server 2010, host distribution rules are used to associate a host with a specific crawl
database. By default, hosts are load-balanced across crawl databases based on space availability.
However, you may want to assign a host to a specific crawl database for availability and
performance optimization.
Planning Objectives
Using all the information that you have been able to collect thus far, you should be nearly prepared
to propose a search architecture that includes the following:





The number of servers and server roles required to support search
The number of crawlers required to support performance, isolation, and redundancy
requirements
The number of crawl databases and their association with the crawlers
The number of query components required to support performance, isolation, and
redundancy requirements.
The number and distribution of indices across those query server, including partitions
An excellent, actionable set of guidance can be found here:
http://www.microsoft.com/downloads/details.aspx?familyid=5A3CA177-FB9A-4901-97970C384277DB7C&displaylang=en
You should also have sufficient information to describe what could be considered a starting point
for content source planning including:




A listing of all content repositories that this deployment will need to interact with. Note
that in this instance, you should have already divided this repository into what should be
approximations to SharePoint content sources
A characterization of the type, volume, and size of content within each content source
Any special considerations related to that content source – such as whether a BCS or other
custom connector will need to be developed or procured
A characterization of the frequency in which content changes within that content source
and a description of how your customer needs to account for those changes with crawl
schedules.
Page 24
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners
APPENDIX
Capacity Planning
One of the factors that determine the ongoing success of the enterprise search solution is whether
your customers can plan for and specify disk space requirements for full-text catalog files and
search databases. These space requirements are affected by many factors, including the
characteristics and size of the corpus of information being indexed.
Analyzing Enterprises Corpuses
The relationship between a corpus and the disk space requirement for full-text catalog files is very
complex. Although the relationship is generally governed by corpus size, many other factors can
cause considerable variation in this relationship. There is also a complex relationship between a
corpus of information and the disk space requirement for the search database.
Corpus Size
The most apparent (although by no means only) factor that governs the full-text catalog space
requirements is the size of the corpus of information. Therefore, you must help your customers to
determine the corpus size before you calculate disk space requirements.
Customers can attempt to measure the corpus size by adding together the size of all of the files
and other items to be indexed. For example, the disk space used in file shares and the file sizes for
all SharePoint 2010 hosted documents and other content. However, because content sizes for
similar files vary by system, this approach may yield misleading data. For example, SharePoint
content database sizes can vary depending on the versioning strategy for documents and other
items. Also, this approach can be unnecessarily time-consuming.
A more typical, probably more robust and manageable approach is to estimate corpus size rather
than measuring it. The steps for estimating corpus size are as follows:
1.
Categorize the different content forms, such as files, Web pages, lists items, and database
items.
2.
Multiply the average size of each content form by the number of items in each form, to obtain
size estimates of each content form.
3.
Add together all of the size estimates.
Customers must also estimate corpus growth characteristics, based on past growth patterns, and
gather expected growth characteristics from analysts and systems management staff in the
organization.
Content Characteristics
Although the main governing factor that affects full-text catalog size is the size of the corpus, the
relationship is not a simple one. The following characteristics of the content in the corpus can affect
this relationship in many ways:
Page 25
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners
File Formats. These affect the ratio between total file sizes and full-text catalog size. File

compression can also affect this ratio. For example, the compressed nature of the a more
recent Office Word format results in a smaller file size than if the equivalent content is stored
in an Office Word 2003 file.
Content Density. This is the ratio of textual content in files to embedded objects. For example,

a PowerPoint presentation with 15 slides of images will have less density than a PowerPoint
presentation with 15 slides of text. The former may have a larger file size, but the latter will
have a larger index footprint.
Content Uniqueness. This represents the uniqueness of the content that is being indexed.

SharePoint 2010 tokenizes indexed words for efficient storage and lookup; the less unique the
words that are being indexed, the lower the ratio between the corpus size and the full-text
catalog size. This factor applies both to uniqueness of words within files, and to uniqueness of
content between files:
o
Uniqueness within files. If a 10 MB file contains technical content about SharePoint
2010, it is likely to have many occurrences of the words such as SharePoint, search,
Microsoft, enterprise, document, file, server, index, and query. Because of the
tokenizing of these common words, the space required to index the file will be smaller
than that required to index 10 MB of a novel that has a rich and varied vocabulary.
o
Uniqueness of content. The size of a full-text catalog from indexing a corpus that
consists of many unique documents about various subjects is larger than a corpus that
consists of many copies of similar documents. For example, if an organization stores a
copy of terms and conditions in each project site within a site collection, the terms and
conditions are likely to be very similar for each project, with perhaps only minor
variations on a project-by-project basis. The words within these documents are
tokenized by the indexer and result in a smaller full-text catalog than if each file had
relatively unique content.

Diminishing Uniqueness. Because all vocabularies are essentially limited, there is a relationship
between total corpus size and the ratio of that size to the full-text catalog space requirements.
This is simply a statistical phenomenon: 10 terabytes of data usually contain less unique content
as a proportion of the corpus size than 1 terabyte of data. To illustrate this point further, as a
corpus grows, it tends to include more and more occurrences of words that have already been
used elsewhere in the corpus, until at some point the corpus contains every word in the
organization’s vocabulary and further additions to the corpus does not introduce new words.
Content Metadata and Managed Properties
The amount of data stored in indexed and Managed Properties affects the ratio between the size of
the corpus and the sizes of both the full-text catalog and search property store. The three major
properties that affect these sizes are crawled properties, managed properties, and ACLs.

Crawled Properties. These are the attributes that are discovered and indexed at crawl time,
including attributes from content source systems, such as the last modified date for files in
file shares, and the column data for items in SharePoint lists and libraries. They also include
embedded property values from the property sheets of specific file types, such as Microsoft
Office documents. Crawled property values are stored in the full-text catalog and so can
affect the ratio between the size of the corpus and the size of the full-text catalog file.
Page 26
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16
Prepared for SDPS Delivery Partners

Managed Properties. These represent a virtual mapping between one or more crawled
properties and the values for each item that are stored in the search database. Therefore,
the number of Managed Properties and their mappings to indexed properties can affect the
ratio between the number of files in a corpus and the size of the search database.

ACLs. These represent the permissions that are stored in the search database for each
secured item. Therefore, the ratio between the number of items in the corpus and the size
of the search database depends on whether the items are secured.
Content Versions
Another factor that affects the ratio between corpus size and full-text catalog size is the versioning
strategy in the farm.

SharePoint Versioning and Indexing. The SharePoint 2010 indexer only indexes one version
of each item, so it is not possible to index all of the versions of files in a document library,
or all of the versions of items in a list.

Versioned Corpus and Index Ratios. If a corpus is characterized by many versions of items
in SharePoint lists or libraries, the ratio of the entire corpus (including all item versions) to
the size of the full-text catalog file is higher than if versioning in SharePoint lists and
libraries is disabled. You should draw your customer’s attention to this if the corpus size
measurement is based on content database sizes.

Content Access Accounts and Versioning. The content access account affects the versioned
content that is being indexed (although it does not affect the ratio between corpus size and
index space requirements). SharePoint technologies can maintain multiple versions of a
page or document and present specific versions to different users based on their roles. For
example, if a user checks out and modifies a published page, and then saves it but does not
checked it back in, the next time that she requests the page, she is presented with the
saved version. Anyone else who requests the page is presented with the latest published
version. Then, if the user makes further changes and checks the page back in and submits it
for approval, the next time she requests the page, she is presented with the edited version
that is waiting for approval. And any person who is in the approver’s role is also presented
with that version. However, all other readers are presented with the latest published
version. In the same way, when the indexer requests a page or file for indexing purposes,
SharePoint technologies presents the version of the item that is appropriate for the account
that is being used to perform the crawl. Although there is no fixed rule for selecting content
access accounts, it is important to specify an appropriate account for the crawl. In general,
if only approved, published content is indexed, a reader’s account should be used to crawl
SharePoint content. However, for unpublished content, perhaps for a volatile authoring
environment, then an editor account, approver account, or another administrative account
would be appropriate.
Page 27
SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0
Prepared by SDPS Partner Organization, last modified on 8 Feb. 16