SharePoint 2010 - Search Guidance SharePoint Deployment Planning Services Prepared for SDPS Delivery Partners Monday, 8 February 2016 Version 1.0 Prepared by SDPS Partner Organization SDPS@microsoft.com Contributors SDPS Delivery Partner Prepared for SDPS Delivery Partners The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication and is subject to change at any time without notice to you. This document and its contents are provided AS IS without warranty of any kind, and should not be interpreted as an offer or commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. The descriptions of other companies’ products in this document, if any, are provided only as a convenience to you. Any such references should not be considered an endorsement or support by Microsoft. Microsoft cannot guarantee their accuracy, and the products may change over time. Also, the descriptions are intended as brief highlights to aid understanding, rather than as thorough coverage. For authoritative descriptions of these products, please consult their respective manufacturers. This deliverable is provided AS IS without warranty of any kind and MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, OR OTHERWISE. All trademarks are the property of their respective companies. ©2010 Microsoft Corporation. All rights reserved. Microsoft and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners. Page ii SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners Table of Contents Introduction ................................................................................................................................................. 3 Purpose or What To Be Tracking Towards .............................................................................................. 4 Search Overview .......................................................................................................................................... 5 Common Search Scenarios ................................................................................................................................................. 5 What's New in SharePoint 2010 Search? ................................................................................................ 6 New and Improved Capabilities for Information Workers ...................................................................................... 6 New and Improved Query Capabilities ...................................................................................................................... 6 New and Improved Search Results Capabilities ..................................................................................................... 6 New and Improved Capabilities for IT Professionals ................................................................................................ 8 Terminology ............................................................................................................................................... 11 Planning for Search .................................................................................................................................. 14 Understanding the End User ............................................................................................................................................14 Understanding the Corpus ................................................................................................................................................15 Choosing the Right Technology .............................................................................................................. 20 Topology Planning .................................................................................................................................... 21 Architectural Components .................................................................................................................................................21 Crawler...................................................................................................................................................................................21 Indexing Engine .................................................................................................................................................................21 Query Engine ......................................................................................................................................................................21 User Interface and Query Object Model ..................................................................................................................22 Scalability and Availability .................................................................................................................................................22 Componentization and Scaling ...................................................................................................................................22 High Availability and Resiliency ...................................................................................................................................22 Topology Components (SharePoint Search 2010) ...................................................................................................23 Crawl Component .............................................................................................................................................................23 Crawl Database ..................................................................................................................................................................23 Query Component ............................................................................................................................................................23 Index Partition ....................................................................................................................................................................23 Host Distribution Rule .....................................................................................................................................................24 Planning Objectives ..............................................................................................................................................................24 Appendix .................................................................................................................................................... 25 Page 1 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners Capacity Planning..................................................................................................................................................................25 Analyzing Enterprises Corpuses ..................................................................................................................................25 Corpus Size ..........................................................................................................................................................................25 Content Characteristics ...................................................................................................................................................25 Content Metadata and Managed Properties .........................................................................................................26 Content Versions ...............................................................................................................................................................27 Page 2 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners INTRODUCTION This document should be considered a component that should be included in any SharePoint Deployment Planning Services (SDPS) engagement where there is a need to provide users the ability to discover content using search. This guide is intended to supplement the core platform guidance, and in some cases, may contain recommendations that supersede information that you find in that document. Page 3 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners PURPOSE OR WHAT TO BE TRACKING TOWARDS SDPS is a planning offering, and in many cases, should be considered an accelerator for either an actual deployment or a launching point for deeper planning. For a Microsoft funded SDPS engagement is concerned, you rarely will have sufficient time to fully document a sophisticated solution. As such we will describe a minimum set of topics that you should cover with your customers, even if that coverage is in some cases superficial. The objective is to: Understand the overall importance of search to your customer. Identify the search technology most appropriate for the customer. Document what you can about the content and its characterization. Ultimately, incorporate what you learn back into the logical and physical architecture diagrams you will be providing back to your customer. Page 4 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners SEARCH OVERVIEW Search is perhaps one of the most important aspects of any SharePoint 2010 deployment, which allows users to quickly discover content that is relevant to some need they have. In some instances, a user may know many characteristics of the content they are looking for—perhaps they’ve seen a specific document before but have forgotten where that document is stored. In other situations, a user may not know content specifics beyond a single keyword. Even in the most well designed and intuitive information architectures, search effectively allows users to spontaneously create a taxonomy of their own that facilitates both the navigation and discovery of relevant information in their SharePoint 2010 deployment. This document includes planning resources for: SharePoint Search 2010 FAST Search for SharePoint 2010 The following Microsoft TechNet resource centers and blogs are relevant to content in this guide: Enterprise Search Resource Center Enterprise Search Team Blog Common Search Scenarios While this is no means exhaustive, here are some reasons why a customer may be interested in search: To support the specific needs of a single SharePoint Web application. Only content managed by that web application is indexed and made available for queries. For small deployments, this is a typical scenario. To support the needs of a larger SharePoint 2010 deployment that may include multiple Web applications or tenants. Search can be configured to either isolate the visibility of content to users of a particular web application or tenant, or it could be configured to support cross application queries. As a dedicated search deployment to support the needs of an enterprise. For instance, content managed by disparate SharePoint 2010 deployments, file shares, and other content repositories could be aggregated into a single enterprise index, thereby enabling users to discover content without having to know where that content is stored. As a search “service farm” that might support some non-SharePoint 2010 application. For example, a customer may have a public facing Web site built in ASP. A SharePoint search service farm could be used to index the content on that Web site and provide a surface through which the Web site can broker queries. Page 5 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners WHAT'S NEW IN SHAREPOINT 2010 SEARCH? You can use this section to gain a better understanding of what is new in enterprise search for SharePoint 2010. New and Improved Capabilities for Information Workers SharePoint 2010 provides new capabilities for formulating and submitting queries, and for working with search results. New and Improved Query Capabilities SharePoint 2010 enables end users to create and run more effective search queries. It also enables users to issue search queries from the desktop in Windows 7. The new query capabilities are: Boolean query syntax for free-text queries and for property queries SharePoint 2010 supports use of the Boolean operators AND, OR, and NOT in search queries. For example, a user can execute a query such as the following: (“SharePoint Search” OR “Live Search”) AND (title:”Keyword Syntax” OR title:”Query Syntax”) Prefix matching for search keywords and document properties Search queries can use the * character as a wildcard at the end of a text string. For example, the search query "comp*" would find documents that contain "computer" or "component" or "competency". Similarly the query "author:Ad*" would find documents created by "Adam" or "Administrator". Therefore, the query "comp* author:ad*" would find documents that contain "component" and that were created by "Adam", as well as finding documents that contain "computer" and that were created by "Administrator". Suggestions while typing search queries As a user types keywords in the Search box, the Search Center provides suggestions to help complete the query. These suggestions are based on past queries from other users. Suggestions after users run queries Search center also provides suggestions after a query has been run. These suggestions are also based on past queries from other users, and are distinct from the 'did you mean' feature. Connectors for enterprise search in Windows 7 From an Enterprise Search Center, users can easily create a connector for their SharePoint searches in Windows 7. By typing search queries into the Windows 7 search box users can find relevant documents from SharePoint and take advantage of Windows features such as file preview and drag-and-drop for documents returned in those search results. New and Improved Search Results Capabilities SharePoint 2010 provides many improvements for getting and viewing search results. The new search results capabilities are: Results display The search results page includes a refinement panel, which provides a summary of search results and enables users to browse and understand the results quickly. For example, for a particular search query the summary in the refinement panel might show that there are Page 6 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners many Web pages in the search results and many documents by a particular author. A summary might also indicate that there are mostly Microsoft Word® and Microsoft Excel® documents in the top set of results. The refinement panel also enables users to filter results—for example by kind of content (document, spreadsheet, presentation, Web page, and so on), content location (such as SharePoint 2010 sites), content author, or date last modified. A user can also filter by category based on managed properties and enterprise content management (ECM) taxonomy nodes that an administrator configures. View in Browser The View in Browser capability allows users to view most Microsoft Office documents in the browser by using Office Web Applications. Office Web Applications is the online companion to Word, Excel, Microsoft PowerPoint® and Microsoft OneNote®, and it enables information workers to access documents from anywhere. Users can view, share, and work collaboratively on documents by using personal computers, mobile phones, and Web browsers. Office Web Applications is available to users through Windows Live. It is also available to business customers with Microsoft Office 2010 volume licensing agreements and document management solutions based on SharePoint 2010. People search People search enables users to find other people in the organization not only by name, but also by many other categories, such as department, job title, projects, expertise, and location. People search improvements include: o Improved relevance in people search results Results relevance for people search is improved, especially for searches on names and expertise. o Self search The effectiveness of people search increases as users add data to their profiles. When a user performs a search, the search system recognizes this as a “self search” and displays related metadata. The metadata can include information such as the number of times the My Site profile was viewed and the terms that other people typed that returned the user’s name. This can encourage users to add information to their profile pages to help other users when they search. As users update their My Site profiles, other users can find them more easily in subsequent searches. This increases productivity by helping to connect people who have common business interests and responsibilities. o Phonetic name matching and nickname matching Users can search for a person in the organization without knowing the exact spelling of their name. For example, the search query “John Steal” could yield “John Steele” in the search results; results for the search query “Jeff” include names that contain “Geoff.” In addition, nickname matching makes it possible for a search query for “Bill” to yield results that include “William.” NOTE: Phonetic matching applies to the following languages supported by SharePoint 2010: English Spanish French German Italian Page 7 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners Korean Portuguese (Brazil) Russian Enhancements for relevance of search results SharePoint 2010 provides improvements to increase the relevance and usefulness of search results, such as the following: o Ranking based on click-through history If a document in a search result set is frequently clicked by users, this indicates that information workers find the document useful. The document is therefore promoted in the ranking of search results. o Relevance based on extracted metadata Document metadata is indexed along with document content. However, information workers do not always update metadata correctly. For example, they often re-purpose documents that were created by other people, and may not update the author property. Therefore, the original author's name remains in the property sheet, and is consequently indexed. However, the search system can sometimes determine the author from a phrase in the document. For example, the search system could infer the author from a phrase in the document such as "By John Doe". In this case, SharePoint 2010 includes the original author, but also maintains a shadow value of "John Doe". Both values are then treated equally when a user searches for documents by specific authors. New and Improved Capabilities for IT Professionals SharePoint 2010 includes new ways for administrators to help provide the most benefits for end users who are searching for information. IT professionals can take advantage of the following new and improved features: Improved administrative interface SharePoint 2010 includes the new search administration pages that were first available for organizations that deployed Microsoft Office SharePoint Server 2007 and then installed the Infrastructure Update for Microsoft Office Servers. This new interface centralizes the location for performing administrative tasks. With SharePoint 2010, administrators have an interface that provides the following advantages: o A single starting point for all farm-wide administration tasks, including search administration. The most common search tasks are highlighted. o A central location where farm administrators and search administrators can monitor server status and activity. Farm Configuration Wizard After the Installation Wizard finishes, the Farm Configuration Wizard runs automatically. The Farm Configuration Wizard helps simplify deployment of small farms. It provides the option to automate much of the initial configuration process with default settings. For example, when you use the Farm Configuration Wizard to deploy the first application server in a farm, the wizard automatically creates a fully functional search system on that server, including the following: o A Search Center from which users can issue queries (if the person installing the product selected this option in the Farm Installation Wizard). Page 8 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners A fully functional search topology that can support an index of up to 10 million crawled documents. o The ability to crawl SharePoint 2010 sites in the server farm immediately after the Farm Configuration Wizard finishes running. Search service administration independent of other shared services In Office SharePoint Server 2007, the Office SharePoint Server Search service was bundled with other shared services (such as Excel Calculation Services) in the Shared Services Provider (SSP). In that architecture, you could not create a new Search service without creating a new SSP. In contrast, in SharePoint 2010, you can create and manage Search service applications independently of one another and independently of other service applications. This is because of the new, more granular, Service Architecture of SharePoint 2010. Expanded support for automating administrative tasks You can automate many search administration tasks by using Windows PowerShell™ 2.0 scripts. For example, you can use Windows PowerShell 2.0 scripts to manage content sources and search system topology. Windows PowerShell support is new for SharePoint 2010. Increased performance, capacity, and reliability SharePoint 2010 provides many new ways to configure and optimize a search solution for better performance, capacity, and reliability, as follows: o Scalability for increased crawling capability In Office SharePoint Server 2007, a Shared Services Provider could only be configured to use one indexer. With SharePoint 2010, you can scale the number of crawl components by adding additional servers to your farm and configuring them as crawlers. This enables you to do the following: Increase crawl frequency and volume, which helps the search system to provide more comprehensive and up-to-date results. Increase performance by distributing the crawl load. Provide redundancy if a particular server fails. o Scalability for increased throughput and reduced latency You can increase the number of query components to do the following: Increase query throughput—that is, increase the number of queries that the search system can handle at a time. Reduce query latency—that is, reduce the amount of time it takes to retrieve search results. One of the general aims of enterprise search with SharePoint 2010 is to implement sub-second query latencies for all searches. To achieve this, you must ensure that no query server deals with more than ten million items; you can achieve this by adding multiple query servers to your farm, and therefore by taking advantage of the new index partitioning features of SharePoint 2010. Office SharePoint Server 2007 did not support the concept of index partitioning. Provide failover capability for query components. Topology management during normal operations You can tune the existing search topology during regular farm operations while search functionality remains available to users. For example, during usual operations, you can deploy additional index partitions and query components to accommodate changing conditions. o Page 9 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners Operations management SharePoint 2010 provides new capabilities for monitoring farm operations and customizing reports for enterprise search. Specifically, administrators can review status information and topology information in the search administration pages of the Central Administration Web site. They can also review crawl logs, as well as health reports, and can use Systems Center Operations Manager to monitor and troubleshoot the search system. Health and performance monitoring Health and performance monitoring features enable an administrator to monitor search operations in the farm. This can be especially helpful for monitoring crawl status and query performance. SharePoint 2010 includes a health analysis tool that you can use to check for potential configuration, performance, and usage problems automatically. Search administrators can configure specific health reporting jobs to do the following: o Run on a predefined schedule. o Alert an administrator when problems are found. o Formulate reports that can be used for performance monitoring, capacity planning, and troubleshooting. Search Analytics Reports SharePoint 2010 provides new reports that help you to analyze search system operations and tune the search system to provide the best results for search queries. For example, reports can include information about what terms are used most frequently in queries or how many queries are issued during certain time periods. Information about peak query times can help you decide about server farm topology and about best times to crawl. Searches of diverse content by crawling SharePoint 2010 can search content in repositories other than SharePoint sites by crawling or federating. For example, the search system can crawl content in repositories such as file shares, Exchange public folders, and Lotus Notes using connectors included with SharePoint 2010. Additional connectors for crawling databases and third-party application data are created easily by using the Business Connectivity Services connector framework. Support for creating connectors using SharePoint® Designer 2010 or Microsoft Visual Studio® 2010 enables quicker and easier development compared to protocol handlers for Office SharePoint Server 2007. Searches of diverse content by federation SharePoint 2010 search results can include content from other search engines. For example, an administrator might federate search results from www.bing.com or from a geographically distributed internal location. Page 10 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners TERMINOLOGY It is important that you have a solid understanding of the terms and definitions used throughout this document. Term Definition Best Bet Best Bets are URLs to documents that are associated with one or more keywords. Typically these documents or sites are ones that you expect users will want to see at the top of the search results list. Best Bets are returned by queries that include the associated keywords, regardless of whether the URL has been indexed. Site collection administrator can create keywords and associate Best Bets with them. Connector Connectors are components that communicate with specific types of system, and are used by the crawler to connect to and retrieve content to be indexed. Connectors communicate with the systems being indexed by using appropriate protocols. For example, the connector used to index shared folders communicate by using the FILE:// protocol, whereas connectors used to index Web sites use the HTTP:// or HTTPS:// protocols. Content Source Content sources are definitions of systems that will be crawled and indexed. For example, administrators can create content sources to represent shared network folders, SharePoint 2010 sites, other Web sites, Exchange public folders, third-party applications, databases, and so on. Crawl Rule Crawl rules specify how crawlers retrieve content to be indexed from content sources. For example, a crawl rule might specify that specific file types are to be excluded from a crawl, or might specify that a specific user account is to be used to crawl a given range of URLs. Crawl Schedule Crawl schedules specify the frequency and dates/times for crawling content sources. Administrators create crawl schedules so that they do not have to start all crawl processes manually. Crawled Property Crawled properties represent the metadata for content that is indexed. Typically, crawled properties include column data for SharePoint list items, document properties for Microsoft Office or other binary file types, and HTML metadata in Web pages. Administrators map crawled properties to managed properties, in order to provide useful search experiences. See Managed Properties for more details. Crawler The crawler is the component that uses connectors to retrieve content from content sources. Crawler Impact Rule A crawler impact rule governs the load that the crawler places on source systems when it crawls the content in those source systems. For example, one crawler impact rule might specify that a specific content source that is not used heavily by information workers should be crawled by requesting 64 documents simultaneously, while another crawler impact rule might specify less aggressive crawl characteristics for systems that are constantly in use by information workers. Federation Federation is the concept of retrieving search results from multiple search providers, based on a single query performed by an information worker. For example, your Page 11 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners organization might include federation with Bing.com, so that results are returned by SharePoint 2010 and Bing.com for a given query. IFilter IFilters are used by connectors to read the content in specific file types. For example, the Word IFilter is used to read Word documents, while a PDF IFilter is used to read PDF files. Index An index is a physical file that contains indexed content, and which is used by query servers to satisfy a query. Indexer Indexers manage the content to be included in an index, and propagate that content to query servers where they are stored in index files. Indexing Engine See Indexer Index Partition See Partitioned Indexes Managed Property Administrators create managed properties by mapping them to one or more crawled property. For example, an administrator might create a managed property named Client that maps to various crawled properties called Customer, Client, and Cust from different content sources. Managed properties can then be used across enterprise search solutions, such as in defining search scopes and in applying query filters. OpenSearch OpenSearch is an industry standard that enables compliant search engines to be used in federated scenarios. See Federation for more details. Partitioned Index SharePoint 2010 includes a new concept that enables administrators to spread the load for queries across multiple query servers. This is achieved by creating subsets of an index, and propagating individual subsets to different query servers. The subsets are known as partitions. At query time, the query object model contacts each query server that can satisfy the search so that all results to be returned to the user are included. Properties Database Managed properties and security descriptors for search results are not stored in the physical index files. Instead, they are stored in an efficient database that is propagated to query servers. Query servers typically satisfy a query by retrieving information from both the index file and the properties database. Query Object Model The query object model is responsible for accepting inputs from search user interfaces, and for issuing appropriate queries to query servers. The search Web Parts provided by SharePoint 2010 use the query object model to run queries. Developers can also create custom user interfaces and solutions that run queries by using the query object model. Query Server Query servers query retrieve data from index files and property databases to satisfy queries. Ranking Ranking defines the sort order in which results are returned from queries. Typically, results are sorted in order of descending relevance, so that the most relevant documents are presented near the top of the results page. However, information workers might choose to apply a different sort order, such as by date modified. Relevance Relevance describes how well a given search satisfies a user’s information needs. Relevance includes which documents are returned in the results (document recall) Page 12 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners and the order of those documents in the results (ranking). Search Center Search Center is a site based on the Search Center site template, and provides a focused user interface that enables information workers to run queries and work with search results. Search Document See Search Item Search Item A search item represents a document, list item, file, Web page, Exchange public folder post, or database row that has been indexed. Search items are sometimes referred to as search documents, but the key point is that these items are returned by search queries. Stemming Words in each language can have multiple forms, but essentially mean the same thing. For example, the verb 'To Write' includes forms such as writing, wrote, write, and writes. Similarly, nouns normally include singular and plural versions, such as book and books. The stemming feature in enterprise search can increase recall of relevant documents by mapping one form of a word to its variants. Stop Word Stop words (sometimes known as noise words) are those words for which there is no value in indexing them. Some stop words are part of the language (such as 'a', 'and', and 'the'). There is no value in indexing these words as they are likely to be contained in a high percentage of indexed items. Furthermore, information workers rarely search for just these types of terms. Synonym Synonyms are words that mean the same thing as other words. For example, you might consider laptop and notebook to mean the same thing. Administrators can create synonyms for keywords that information workers are likely to search for in their organization. Additionally, synonyms that can be used to improve recall of relevant documents are stored in thesaurus files. Word Breaker Streams or words are retrieved from content sources, and those streams are broken down into discrete words for indexing. Word breakers are the components that break down streams into individual words. Streams to be indexed are normally broken down by identifying spaces, punctuation marks, and the particular rules of each language. Also, when a user enters multiple words into a search box, that query is broken into discrete terms by a word breaker. Page 13 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners PLANNING FOR SEARCH Understanding the End User The end user is ultimately the most important factor that needs to be considered when you deploy any application. Your search solution is no different and you will not only need to consider both their wants and needs, but also who these individuals are and their relationship to the system. Common questions are: Question Impact Where are users in relation to the system? Particularly in global or Internet-facing deployments, the user base may be connecting from a variety of different locations, each subject to the unique characteristics of the network that connects them to the system. In some instances, you may need to factor in performance expectations, as describe in the next section. However, search also can benefit remote users as it might allow them to find content more quickly, even when they know exactly what they need, with less interaction with the system. For instance, a user might be looking for a document that using site navigation will take 10 clicks to get to. If, because of poor network connectivity to the system, it takes 10 seconds for each page request to be fulfilled, the total time to destination is significantly greater than if the user can enter a succinct search term on the home page and within ten seconds have a result set that includes a direct link to the desired content. From a planning perspective, make sure that you provide sufficient support for users getting relevant results. What are their performance expectations of the system? Commonly, end user performance expectations relate to the amount of time it takes from the execution of a query to the time that the system presents them with results. This may be covered in a service level agreement or it may be subjective, based on experience with other search systems. Factors that influence the perception of the overall speed of the system can include such things as adequate capacity planning, the relative location of the system to the end user, and even what additional operations the system may have to do before it can return a result set. When planning a search deployment, you should work to quantify the performance expectations and attempt to honor those expectations during your capacity planning. In addition, there may be factors that you cannot manage within the environment itself that can be mitigated through alternate approaches. For instance, a remote user may have a poor network connection to a corporate deployment and would benefit from interacting with a regional system. This regional system could potentially federate results from the corporate deployment or could itself index all or some of the same content indexed in the head office environment. Are end users already familiar with search in Office SharePoint Server 2007? If users are familiar with Office SharePoint Server 2007 search, then they may be very familiar with how queries are accepted and qualified (for instance, using Scopes). If they have been using earlier versions of SharePoint, some improvements will be obvious in SharePoint 2010. Page 14 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners However, while some things may look the same, there are substantial improvements in what a user can enter in the search box, such as the new support for wildcards and Boolean operators. From a planning perspective, it is important to stress end user training so that end users are aware of the new capabilities. Is there a search paradigm that users may already be familiar with? One possible motivator for this deployment may be to replace an existing system. Regardless of whether SharePoint 2010 improves on every capability offered by the previous system or not, their reception of SharePoint 2010 may be biased by the absence of one or more features that they were accustomed to having. For example, a government agency may have had an advanced search form that enabled users to quickly characterize the type of content they were looking for using checkboxes. Users at a new deployment may only be familiar with public facing search engines such as Bing or Google. An even more subtle example is a situation where users expect two search terms to be processed with a specific join term (“dog house” would be processed as “dog and house”) or even treated as an explicit phrase. From a planning perspective, end user training may reduce the learning curve and also deflect any initial negative perceptions about such things as relevance. For more sophisticated search interfaces that training cannot accommodate for, the rich SharePoint 2010 search object model coupled with a much more extensible set of out of box search Web Parts, may be leveraged to support these needs. Are end users going to be issuing queries use language other than the default system language? While language packs are discussed in the core platform guidance, you must consider the impact that this has on such search related configurations, as well. For example, localization may need to be considered for noise words, synonyms, best bets, and word breakers. Are there any expectations that one set of content will be kept separate from another set of content? End users may be confused, distracted, or annoyed by having their queries return results across a diverse set of content. For example, a user may only want to see Finance documents and not Human Resources results when executing a query. Depending on the underlying requirement, this may necessitate planning for separate content sources, search scopes, search applications, or even separate search farms. Are there any expectations around search availability? This would typically be captured in a Service Level Agreement or SLA. Basically, you want to understand what the true importance of search is to an organization as it relates to this deployment. For example, in a public facing Internet site, search may be the primary vector by which end users get to the content they need to. Whereas in a small deployment, a search outage might be more tolerable. If there is a greater the need for availability, you should build great levels of redundancy into your design. Understanding the Corpus The corpus is the entire volume of content that the customer wishes to have their deployment crawl and make available for query fulfillment. When you discuss this with the customer, it is generally a Page 15 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners good idea to capture the information you are collecting in a diagram. This diagram should initially include a placeholder for the farm design and also capture who will be interacting with the system, ideally segmenting this pool of users into separate objects, where each object represents a group of users with a common set of needs or characteristics. As you gain a better grasp of the content exposed through search, you may need to consolidate or break up the users into different groups. Question Impact Where and what are the repositories that the solution should be indexing? Generally, the answer you get will map to SharePoint Content Sources, but at this point you are most concerned with the various systems that SharePoint 2010 will need to interact with in order to gain access to that content. This has broad impact on your search planning, as it may indicate a need for specific content connectors to communicate with those repositories. The repository types supported out-of-box by SharePoint 2010 are SharePoint Products and Technologies sites, Web sites, Microsoft Exchange public folders, and file shares. Databases may also be crawled using Business Connectivity Services (BCS), however, it will necessitate the design or development of these connectors. Microsoft also provides a connector for Lotus notes. Some third party companies may provide connectors for other repositories, and the rich API set provided with SharePoint 2010 permits the development of custom connectors. For each repository, how is content secured from both an authentication perspective as well as an authorization one? Some content repositories require SharePoint 2010 to first authenticate against that repository before gaining access to the content that it manages. This may necessitate having a special privileged account solely for this purpose. Where content within a repository has or can have unique access requirements, the crawl account must have sufficient permission to read that content in order for it to be included in the index. Dependent on the type of repository SharePoint 2010 connects to, the repository may be unable to provide any authorization restrictions back with the content. For example, a customer may want a particular non-SharePoint 2010 Web site crawled. Although the site requires users authenticate against the system and SharePoint 2010 can honor this requirement, there is no security information returned when SharePoint 2010 crawls that content. Consequently, all users of the SharePoint 2010 environment are able to see results that pertain to that secure system regardless of whether they themselves have access to that system. There are a number of strategies that you can pursue to prevent this from happening, if the customer believes it is an issue. One tactic is to attempt to index Web content using the BCS. This has a dependency on the site being database driven, the data within the database being receptive to describing using the BCS, and the database being visible to the crawler. An additional layer of security can be applied to the BCS, either directly aligning to the authorization scheme associated with the web application or through the application of broader access rules around the BCS. In other instances, while the content connector may be able to return authorization information along with the content, it may necessitate mapping credentials from the foreign system back into a format appropriate to SharePoint 2010. This is the case with the Lotus Notes connector. Finally, SharePoint 2010 still Page 16 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners supports the concept of custom security trimmers which, at query time, can determine (typically by leveraging the user’s SharePoint 2010 credentials against the repository) whether a result should be included or not. What is the volume of content stored in the repository? The answer you get should be normalized in units of search items. For SharePoint 2010 repositories, this count will include user profiles, list items, documents, and pages. For a database, this is the number of rows that are to be indexed. For Web sites, this is the number of unique Web pages. Be aware that a customer may have a Web site with a parameter driven page the rendering of which is driven by a given query string parameter. If, by altering that parameter, 4000 unique pages are delivered, the search items for this site would be 4000, not just one physical page. From a planning perspective, this volume will impact the ability of SharePoint 2010 to crawl that content. By using this example again, a full crawl will require the crawler to make 4000 HTTP requests against the web content source. Other content connectors, such as the BCS, may be able to crawl multiple items in a single request, but more commonly, the more search items there are, the more outbound requests by the crawler. This is one of the primary impacts on crawl performance and can be countered by the introduction of multiple crawlers targeting the same content source. Is the content repository ready for you to crawl it? In many cases ownership of secondary repositories may be different from those managing the system you are designing. It is important to get approval from the owner of the secondary repository so that you understand how your crawl activity may impact that system, particularly if that repository is not sized to service these types of requests. If this type of load is of concern, defining less impactful crawl schedules or even throttling the crawling of particular content repositories may be part of the answer. For those instances where ownership is one and the same, particularly for deployments where search is indexing content managed by the same SharePoint 2010 instance that hosts it, you will still need to be cognizant of the impact that crawling has on other activities on the system. Again crawl schedules might be part of the solution, while another option may be to setup a dedicated Web Front-End (WFE) server that is not included in the end user rotation, so that only the crawler uses it when indexing the deployment. How frequently does the content in this repository change? Most repositories are never entirely static; new content is added, existing content updated, and old content retired. This information is one of the inputs that you may use to determine the crawl schedule associated with a particular repository. If the change frequency within a repository varies dependent on the datasets in the repository, you may want to consider a strategy for dividing the repository into multiple content sources to enable you to set different crawl schedules, thus maximizing the freshness of the overall index. Are there any predictors for growth? The number of searchable items in the repository today is important, but it is just as important to understand what is expected to be in that repository a year or more from now. A customer that adds 100,000 Page 17 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners SKUs to a product database each month will impact how you plan for now for future capacity. What level of parity should the index A full crawl always requests everything managed by the system. This have as it relates to the content in the volume of requests impacts crawler performance and the system being repository? indexed incurs load during a crawl. You also need to consider the amount of time that it takes for a full crawl to complete; for some content repositories, this may be measured in days. Incremental crawls target only content that has experienced a change, but the crawler is dependent on the system being crawled to be able to provide this information. If it cannot (as is the case with Web sites), an incremental crawl is, effectively, treated as a full crawl. It is largely an ongoing effort to develop a crawl schedule that both attends to end user expectations on content freshness, takes into consideration the frequency of change within that repository, while also being sensitive to the resource demands placed on both the system crawling as well as the system being crawled. You might benefit from content source segmentation, allowing volatile areas of content to be indexed more frequently than others. Choosing off peak crawl schedules, scaling crawlers to enable parallel indexing, and relying more on incremental crawls are possible alternatives. Does the repository include content that should not be crawled? This could be directories, certain types of files, or even certain named files. SharePoint 2010 offers improved support for defining crawl rules using regular expressions. It is possible to define patterns for content not being included in the index. What types of content are stored in the repository? For this you are primarily concerned with how the data structures that each of the search items in the repository are contained. For databases, this is likely a row of text data. For a web site, it might be a web page or a hyperlinked document. For SharePoint 2010, this could be user profiles, a rich assortment of document types, as well as a list item. While it is rarely possible to be exact regarding this, you should strive to approximate how the volume of content captured earlier is distributed across various data structures. Knowledge of the types will assist you in several ways: While SharePoint 2010 includes out of box a number of IFilters, others may be required to truly gain access to content stored within those documents. The amount of processing time associated with crawling one file type versus another may be significant, even if we are talking milliseconds. Ultimately, we are concerned about the aggregate impact. Different file types typically have different index densities. By this we mean that there may be more actual content that will make its way into the index for a small Microsoft Word file than in a large JPEG file. Some file types expose metadata along with their content. For example, a Microsoft Word file may have properties on it that supplement the content contained within the actual file – keywords, customer name, and subtitle are a few examples. A web page may expose HTML META tags that are exposed Page 18 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners Does content in one repository need to be kept separate from another? by an IFilter as properties. Dependent on how you choose to respond to these properties, they may end up in the property store, increasing its overall size. SharePoint 2010 may need additional configuration to crawl specific file extensions, even if an existing IFilter could crawl the content. There may be regulatory or security motivators for doing this. As with the similar topic in the end user section, there can be a number of solutions to this problem. If the motivation is to prevent users from seeing content that they should not be able to see, assuming the content is already secured and the crawler can consume the permissions on that content, isolation by way of security trimming should already occur. It is also possible to logically isolate content by creating multiple content sources that target different segments of a repository, or even redefining the default search scope or defining new scopes that at query time, to restrict the query to only a portion of the corpus. If the customer demands the more physical isolation, separate crawl and query applications (along with their required databases) could be setup to ensure that content is truly managed separately. If the demands are even more extreme, it may necessitate production of a separate farm to honor this isolation requirement. For each type of content within a This is often an estimate, but may influence some of your repository, what is the average size of recommendations on capacity planning. For files, every crawl operation that content? will demand moving the file across the network to the crawler. For database search items, it is rows of data that move across the network. You must consider network saturation, but you will also need to remember that larger files could result in more of an index signature than smaller files. Larger files that have more data require more processing time to extract that data. Is content within the repository augmented by additional metadata? While some content connectors will not do much of this, others can do it greatly. For example, the file share connector will only surface such things as where the file resides and security information around the file. With the SharePoint 2010 content connector, however, a document may be heavily decorated with system metadata, but also be supplemented with data corresponding to custom columns in a document library or even as a result of the document association with a given content type. This metadata, depending on configuration, can be saved in the property store. The more properties on a particular piece of content, the larger its signature within the property store. This is heavily dependent on a customer implementation, but you must consider this when sizing the property database. Additional planning details are described on the Search Environment Planning diagram available at: http://www.microsoft.com/downloads/details.aspx?familyid=5655EACA-22DF-4089-BCD338A1F5318140&displaylang=en Page 19 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners CHOOSING THE RIGHT TECHNOLOGY You will need to work with your customer to decide which of the two search platforms that Microsoft is offering for this release of SharePoint is most appropriate for their needs. While ultimately this is a decision that may be driven by an interest in specific functional elements, there are a few key points that you can share. Both FAST and SharePoint Search share a unified API when interacting with their respective engines. Using Central Administration as an example, both are managed by similar administrative screens and support many of the key concepts discussed in this document. For instance, in both the content source definition and configuration screens are identical. FAST is largely a superset of functionality offered by SharePoint Search 2010. Some key improvements are: o Deep results refinement that effectively allow users to drill into a result o Improved web indexing support including the index of client side script o Provides for previews of certain document formats inline in the result set FAST can support an index that can exceed 500 million search items, while SharePoint Search 2010 can support only 100 million (Both figures are heavily dependent on the content and characterization of that content) FAST has more opportunities to tune such things as relevance. Some counter points might be: Operational readiness for FAST may not be what it is for SharePoint. In organizations where earlier versions of SharePoint has already been deployed, the learning curve for moving to SharePoint Search 2010 will be less than what it will be for FAST There may be more significant hardware or software costs associated with incorporating FAST into the overall architecture Because of the bigger configuration surface exposed by FAST, initial and ongoing tuning might require additional operational overhead (Not all of these configuration surfaces are exposed through Central Admin). An overview of some of these differences is captured in: http://www.microsoft.com/downloads/details.aspx?familyid=C422D3C7-1443-41E4-B0FEFC402EE4D8C1&displaylang=en Page 20 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners TOPOLOGY PLANNING Architectural Components There are four key architectural components that need to be understood prior to pursuing any topology design work: Crawler This component invokes connectors that are capable of communicating with content sources. Because SharePoint 2010 can crawl different types of content sources (such as SharePoint sites, other Web sites, file shares, Lotus Notes databases, and data exposed by Business Connectivity Services), a specific connector is used to communicate with each type of source. The crawler then uses the connectors to connect to and traverse the content sources, according to crawl rules that an administrator can define. For example, the crawler uses the file connector to connect to file shares by using the FILE:// protocol, and then traverses the folder structure in that content source to retrieve file content and metadata. Similarly, the crawler uses the Web connector to connect to external Web sites by using the HTTP:// or HTTPS:// protocols, and then traverses the Web pages in that content source by following hyperlinks to retrieve Web page content and metadata. Connectors load specific IFilters to read the actual data contained in files. Refer to the Connector Framework section later in this document for more information about connectors. Indexing Engine This component receives streams of data from the crawler, and determines how to store that information in a physical, file-based index. For example, the indexer optimizes the storage space requirements for words that have already been indexed, manages word-breaking and stemming in certain circumstances, removes noise words, and determines how to store data in specific index partitions if you have multiple query servers and partitioned indexes. Together with the crawler and its connectors, the indexing engine meets the business requirements of ensuring that enterprise data from multiple systems can be indexed. This includes collaborative data stored in SharePoint sites, files in file shares, and data in custom business solutions, such as CRM databases, ERP solutions, and so on. Query Engine Indexed data that is generated by the indexing engine are propagated to query servers in the SharePoint farm where it is stored in one or more index files. This process is known as 'continuous propagation'. That is, while indexed data is being generated or updated during the crawl process, those changes are propagated to query servers, where they are applied to the index file (or files). In this way, the data in the indexes on query servers experience a very short latency. In essence, when new data has been indexed (or existing data in the index has been updated), those changes will be applied to the index files on query servers in just a few seconds. A server that is performing the query server role responds to searches from users by searching its own index files, so it is important that latency be kept to a minimum. SharePoint 2010 ensures that this is the case automatically. The query server is responsible for retrieving results from the index in response to a query received via the query object model. The query sever is also responsible for the word-breaking, noise word removal, and stemming (if stemming is enabled) for the search terms provided by the query object model. Page 21 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners User Interface and Query Object Model As mentioned above, searches are formed and issued to query servers by the query object model. This is typically in response to a user performing a search from the user interface in a SharePoint site, but it may also be in response to a search from a custom solution (either hosted in or out of SharePoint 2010). Furthermore, the search might have been issued by custom code, such as from a workflow, or from a custom navigation component. In any case, the query object model parses the search terms, and issues the query to a query server in the SharePoint farm. The results of the query are returned from the query server to the query object model, and the object model provides those results to the user interface components (or other components that may have issued the query). Scalability and Availability SharePoint 2010 enables you to add multiple instances of each of the crawler, indexing, and query components. This level of flexibility means that you can scale your SharePoint farms. (Previous versions of SharePoint Server did not allow you to scale the indexing components). The aims of the enterprise search features in SharePoint 2010 are to provide sub-second query latencies for all queries, regardless of the size of your farm, and to remove bottlenecks that were present in previous versions of SharePoint Server. You can achieve these aims by implementing a scaled-out architecture. SharePoint 2010 enables you to scale out every logical component in your search architecture, unlike previous versions. Componentization and Scaling You can add multiple indexers to your farm to provide availability and to scale to achieve high performance for the indexing process. Each indexer can crawl a discrete set of content sources, so not all indexers need to index the entire corpus. This is a new capability for SharePoint 2010. Furthermore, indexers no longer store full copies of the index; they simply crawl content sources and propagate the indexes to query servers. You can also add multiple query servers to provide availability and to scale to achieve high query performance, as shown in Figure 9. If you add multiple query servers, you are really implementing index partitioning; each query server maintains a subset of the entire logical index, and therefore does not need to query the entire index (which could be a very large file) for every query. The partitions are maintained automatically by SharePoint 2010, which uses a hash of each crawled document's ID to determine in which partition a document belongs. The indexed data is then propagated to the appropriate query server. Another new feature is that property databases are also propagated to query servers so that retrieving managed properties and security descriptors is much more efficient than in Microsoft Office SharePoint Server 2007. High Availability and Resiliency Each search component also fulfills high-availability requirements, by supporting mirroring. Page 22 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners Topology Components (SharePoint Search 2010) Crawl Component In SharePoint 2010, crawl components process crawls of content sources, propagate the resulting index files to query components, and add information about the location and crawl schedule of content sources to their associated crawl databases. Crawl components are associated with a single Search service application. You can distribute the crawl load by adding crawl components to different farm servers. You assign a farm server to participate in crawling by creating a crawl component on that server. If you want to balance the load of servicing crawls across multiple farm servers, add crawl components to the farm and associate them with the servers that you want to crawl content. Crawl Database In SharePoint 2010, a crawl database contains data related to the location of content sources, crawl schedules, and other information specific to crawl operations for a specific Search service application. You can distribute the database load by adding crawl databases to different computers that are running SQL Server. Crawl databases are associated with crawl components, and can be dedicated to specific hosts by creating host distribution rules Query Component In SharePoint 2010, query components return search results to the query originator. Each query component is part of an index partition, which is associated with a specific property database that contains metadata associated with a specific set of crawled content. You can distribute query load by adding mirror query components to an index partition and placing them on different farm servers. Typically, a given index partition contains one or two query components depending on whether you want to provide load balancing or failover capabilities to the index partition. You can add more than two query components to an index partition, but in general, we recommend that in such cases, you instead create a new index partition. You assign a server to service queries by creating a query component on that server. If you want to balance the load of servicing queries across multiple farm servers, add mirror query components to an index partition and associate them with the servers you want to service queries. Index Partition In SharePoint 2010, an index partition is a group of query components. Each query component holds a subset of the full text index and returns search results to the query originator. Each index partition is associated with a specific property database that contains metadata associated with a specific set of crawled content. You can distribute the load of query servicing by adding index partitions to a Search service application and placing their query components on different farm servers. You assign a server to service queries by creating a query component on that server. If you want to balance the load of servicing queries across multiple farm servers, add query components to an index partition and associate them with the servers that you want to service queries. Page 23 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners Host Distribution Rule In Search Server 2010, host distribution rules are used to associate a host with a specific crawl database. By default, hosts are load-balanced across crawl databases based on space availability. However, you may want to assign a host to a specific crawl database for availability and performance optimization. Planning Objectives Using all the information that you have been able to collect thus far, you should be nearly prepared to propose a search architecture that includes the following: The number of servers and server roles required to support search The number of crawlers required to support performance, isolation, and redundancy requirements The number of crawl databases and their association with the crawlers The number of query components required to support performance, isolation, and redundancy requirements. The number and distribution of indices across those query server, including partitions An excellent, actionable set of guidance can be found here: http://www.microsoft.com/downloads/details.aspx?familyid=5A3CA177-FB9A-4901-97970C384277DB7C&displaylang=en You should also have sufficient information to describe what could be considered a starting point for content source planning including: A listing of all content repositories that this deployment will need to interact with. Note that in this instance, you should have already divided this repository into what should be approximations to SharePoint content sources A characterization of the type, volume, and size of content within each content source Any special considerations related to that content source – such as whether a BCS or other custom connector will need to be developed or procured A characterization of the frequency in which content changes within that content source and a description of how your customer needs to account for those changes with crawl schedules. Page 24 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners APPENDIX Capacity Planning One of the factors that determine the ongoing success of the enterprise search solution is whether your customers can plan for and specify disk space requirements for full-text catalog files and search databases. These space requirements are affected by many factors, including the characteristics and size of the corpus of information being indexed. Analyzing Enterprises Corpuses The relationship between a corpus and the disk space requirement for full-text catalog files is very complex. Although the relationship is generally governed by corpus size, many other factors can cause considerable variation in this relationship. There is also a complex relationship between a corpus of information and the disk space requirement for the search database. Corpus Size The most apparent (although by no means only) factor that governs the full-text catalog space requirements is the size of the corpus of information. Therefore, you must help your customers to determine the corpus size before you calculate disk space requirements. Customers can attempt to measure the corpus size by adding together the size of all of the files and other items to be indexed. For example, the disk space used in file shares and the file sizes for all SharePoint 2010 hosted documents and other content. However, because content sizes for similar files vary by system, this approach may yield misleading data. For example, SharePoint content database sizes can vary depending on the versioning strategy for documents and other items. Also, this approach can be unnecessarily time-consuming. A more typical, probably more robust and manageable approach is to estimate corpus size rather than measuring it. The steps for estimating corpus size are as follows: 1. Categorize the different content forms, such as files, Web pages, lists items, and database items. 2. Multiply the average size of each content form by the number of items in each form, to obtain size estimates of each content form. 3. Add together all of the size estimates. Customers must also estimate corpus growth characteristics, based on past growth patterns, and gather expected growth characteristics from analysts and systems management staff in the organization. Content Characteristics Although the main governing factor that affects full-text catalog size is the size of the corpus, the relationship is not a simple one. The following characteristics of the content in the corpus can affect this relationship in many ways: Page 25 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners File Formats. These affect the ratio between total file sizes and full-text catalog size. File compression can also affect this ratio. For example, the compressed nature of the a more recent Office Word format results in a smaller file size than if the equivalent content is stored in an Office Word 2003 file. Content Density. This is the ratio of textual content in files to embedded objects. For example, a PowerPoint presentation with 15 slides of images will have less density than a PowerPoint presentation with 15 slides of text. The former may have a larger file size, but the latter will have a larger index footprint. Content Uniqueness. This represents the uniqueness of the content that is being indexed. SharePoint 2010 tokenizes indexed words for efficient storage and lookup; the less unique the words that are being indexed, the lower the ratio between the corpus size and the full-text catalog size. This factor applies both to uniqueness of words within files, and to uniqueness of content between files: o Uniqueness within files. If a 10 MB file contains technical content about SharePoint 2010, it is likely to have many occurrences of the words such as SharePoint, search, Microsoft, enterprise, document, file, server, index, and query. Because of the tokenizing of these common words, the space required to index the file will be smaller than that required to index 10 MB of a novel that has a rich and varied vocabulary. o Uniqueness of content. The size of a full-text catalog from indexing a corpus that consists of many unique documents about various subjects is larger than a corpus that consists of many copies of similar documents. For example, if an organization stores a copy of terms and conditions in each project site within a site collection, the terms and conditions are likely to be very similar for each project, with perhaps only minor variations on a project-by-project basis. The words within these documents are tokenized by the indexer and result in a smaller full-text catalog than if each file had relatively unique content. Diminishing Uniqueness. Because all vocabularies are essentially limited, there is a relationship between total corpus size and the ratio of that size to the full-text catalog space requirements. This is simply a statistical phenomenon: 10 terabytes of data usually contain less unique content as a proportion of the corpus size than 1 terabyte of data. To illustrate this point further, as a corpus grows, it tends to include more and more occurrences of words that have already been used elsewhere in the corpus, until at some point the corpus contains every word in the organization’s vocabulary and further additions to the corpus does not introduce new words. Content Metadata and Managed Properties The amount of data stored in indexed and Managed Properties affects the ratio between the size of the corpus and the sizes of both the full-text catalog and search property store. The three major properties that affect these sizes are crawled properties, managed properties, and ACLs. Crawled Properties. These are the attributes that are discovered and indexed at crawl time, including attributes from content source systems, such as the last modified date for files in file shares, and the column data for items in SharePoint lists and libraries. They also include embedded property values from the property sheets of specific file types, such as Microsoft Office documents. Crawled property values are stored in the full-text catalog and so can affect the ratio between the size of the corpus and the size of the full-text catalog file. Page 26 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16 Prepared for SDPS Delivery Partners Managed Properties. These represent a virtual mapping between one or more crawled properties and the values for each item that are stored in the search database. Therefore, the number of Managed Properties and their mappings to indexed properties can affect the ratio between the number of files in a corpus and the size of the search database. ACLs. These represent the permissions that are stored in the search database for each secured item. Therefore, the ratio between the number of items in the corpus and the size of the search database depends on whether the items are secured. Content Versions Another factor that affects the ratio between corpus size and full-text catalog size is the versioning strategy in the farm. SharePoint Versioning and Indexing. The SharePoint 2010 indexer only indexes one version of each item, so it is not possible to index all of the versions of files in a document library, or all of the versions of items in a list. Versioned Corpus and Index Ratios. If a corpus is characterized by many versions of items in SharePoint lists or libraries, the ratio of the entire corpus (including all item versions) to the size of the full-text catalog file is higher than if versioning in SharePoint lists and libraries is disabled. You should draw your customer’s attention to this if the corpus size measurement is based on content database sizes. Content Access Accounts and Versioning. The content access account affects the versioned content that is being indexed (although it does not affect the ratio between corpus size and index space requirements). SharePoint technologies can maintain multiple versions of a page or document and present specific versions to different users based on their roles. For example, if a user checks out and modifies a published page, and then saves it but does not checked it back in, the next time that she requests the page, she is presented with the saved version. Anyone else who requests the page is presented with the latest published version. Then, if the user makes further changes and checks the page back in and submits it for approval, the next time she requests the page, she is presented with the edited version that is waiting for approval. And any person who is in the approver’s role is also presented with that version. However, all other readers are presented with the latest published version. In the same way, when the indexer requests a page or file for indexing purposes, SharePoint technologies presents the version of the item that is appropriate for the account that is being used to perform the crawl. Although there is no fixed rule for selecting content access accounts, it is important to specify an appropriate account for the crawl. In general, if only approved, published content is indexed, a reader’s account should be used to crawl SharePoint content. However, for unpublished content, perhaps for a volatile authoring environment, then an editor account, approver account, or another administrative account would be appropriate. Page 27 SharePoint 2010 - Search guidance, SharePoint Deployment Planning Services, Version 1.0 Prepared by SDPS Partner Organization, last modified on 8 Feb. 16