Smart Search in Kentico 6.0 1/18/2012 Miro Remias, Solution Architect Agenda • • • • • Smart Search: How It Works Index Types Analyzer Types Related Scheduled Tasks & Keys Example - Searching In Content Of Media Files How It Works Definition Smart Search is index-based searching through the content of websites or other objects within the system (3rd party library - Lucene.Net, v 2.1.0). Where to find the index file(s)? File system: /App_Data/CMSModules/SmartSearch/<Index code name> Customization: CMSSearchIndexPath ("App_Data\\CMSModules\\SmartSearch\\") How to analyze the index file? Luke - Lucene Index Toolbox (http://www.getopt.org/luke/) Note: Don't forget to have the write disk permission assigned to the App_Data folder! How It Works “When a search request is sent to the system by a user, it is the index file that gets searched, which results in significantly better performance compared to linear SQL query search.” Life cycle of a document/object in the index file: A) When a document/object is created/updated/deleted, new indexing task is logged in the database. B) The database (CMS_SearchTask table) is automatically checked (on a regular basis) for the presence of indexing tasks. C) The task is processed and document/object is added/updated/deleted to/in/from the index file. How It Works - Database • • • • CMS_SearchIndex CMS_SearchIndexCulture CMS_SearchIndexSite CMS_SearchTask (API: SearchTaskInfo) SearchTaskType (nvarchar) - (SearchTaskTypeEnum: Update, Delete, Rebuild, Optimize, Process) SearchTaskObjectType (nvarchar) - (PredefinedObjectType: ABTEST, ACCOUNT, BIZFORM etc.) SearchTaskField (nvarchar) - usually name of the ID field. SearchTaskValue (nvarchar) - usually the object/document ID. SearchTaskServerName (nvarchar) - server name in case web farms are used. SearchTaskStatus (SearchTaskStatusEnum: Ready, InProgress). SearchTaskPriority (int) - higher value = higher priority. SearchTaskCreated (datetime) - task creation date. Index Types 1) Custom index - indexes any kind of data depending on its implementation. 2) Custom tables - indexes records in custom tables. 3) Documents - indexes content of documents in the content tree. 4) Documents crawler - indexes the content of the HTML output generated by documents in the content tree. • Customization options: CMS.SiteProvider.SearchHelper.OnHtmlToPlainText Triggered when the HTML output is processed by a crawler CMS.SiteProvider.SearchHelper.HtmlToPlainText() Converts html to the plain text (body part) CMS.SiteProvider.SearchHelper.DownloadHtmlContent(url) Returns complete HTML code of the page based on the provided URL CMS.TreeEngine.TreeNode.GetSearchDocument() Returns Lucene Document object 5) Forums - indexes content of discussion forums. 6) General - indexes objects of a specified type. Any objects within the CMS can be searched this way. 7) Users - indexes details about system users. Analyzer Types Tokenized Field - indicates if the content of the field should be processed by the analyzer when indexing. The general rule is to use this for Content fields and not for Searchable fields. 1) Custom - Option of performing tokenization according to your particular requirements. 2) Keyword - Tokenizes the entire stream as a single token. This is useful for data like zip codes, ids, and some product names. 3) Simple - divides text at non-letter characters. 4) Standard - grammar-based analyzer (stop-words, shortcuts, ...), very efficient for English, but may not produce satisfactory results with other languages. 5) Starts with - tokenizes all prefixes contained in words, which allows searching for words that start with the entered string. Text is divided at whitespace characters. Example: abc => a ab abc 6) Stop - contains a collection of stop-words at which text is divided. 7) Subset - tokenizes all substrings in words, which allows searching for words that contain the entered string. Text is divided at whitespace characters. Example: abc => abc ab bc a b c 8) White space - divides text at whitespace characters. Note: Stop words - dictionary containing words which will be omitted from indexing (e.g. 'and', 'or', ...) when a Stop or Standard analyzer is used (~\App_Data\CMSModules\SmartSearch\_StopWords) Starts with & Subset analyzers are using CMS.SiteProvider.SubSetAnalyzer class. Related Scheduled Tasks & Keys Scheduled tasks: • Optimize search indexes (Search.IndexOptimizer) [Enabled] - performs index optimization • (defragmentation resulting in better performance, particularly in the case of large indexes). By default executed once per week. Execute search tasks (Search.TaskExecutor) [Enabled] - executes indexing tasks (created and executed automatically when the indexed content changes) that were not completed successfully on their automatic execution. By default performed every 4 hours. Web.config keys: • • • • • • CMSRemoveDiacriticsForIndexField (true) - indicates whether diacritics should be removed for index field. CMSSearchStoreContentField (false) - indicates whether content field should be stored in the index CMSSmartSearchIndexCategories (false) - indicates whether document categories should be indexed. CMSSearchContentXpathValue ("//property[@name='text' or @name='contentbefore' or @name='contentafter']") - webparts fields should be added to the search document content. CMSProcessSearchTasksByScheduler (false) - If true, smart search tasks are processed by scheduler. CMSCreateTemplateSearchTasks (true)- Any changes made to a page template will automatically trigger an update of all documents that are based on the given template in the appropriate smart search indexes. Example - Searching In Content Of Media Files Requirements: • Be able to search in content of media files. • Whenever the media file is updated/deleted/inserted, smart search index should be updated as well. Process: • Create custom index - Rebuild operation needs be able to read media library file definitions [media_file table] plus physical content of them on file system and index them afterwards. • Create custom analyzer (optional). • Use global events to react to insert/delete/update events of media files in order to create indexing tasks (CMS_SearchTask). • Create scheduled task for processing these indexing tasks. Example - Creating Custom Index Steps: A.) Create class file that implements CMS.Siteprovider.ICustomSearchIndex interface. B.) Implement Rebuild method. C.) Register the custom index in the CMS. D.) Rebuild index and test. Note: Step-by-step guide available here: http://devnet.kentico.com/docs/devguide/smart_search_defining_custom_index_content.htm Example - Creating Custom Analyzer Steps: A.) Create class file that inherits from Lucene.Net.Analysis.Analyzer class. B.) Implement TokenStream method. C.) Create class file that inherits from Lucene.Net.Analysis.Tokenizer class. D.) Implement Next method. E.) Register the custom analyzer for index in the CMS. F.) Rebuild index and test. Note: Step-by-step guide available here: http://devnet.kentico.com/docs/devguide/smart_search_using_a_custom_analyzer.htm Example - Creating Smart Search Task Steps: A.) Register for Insert, Update, Delete events of MediaFileInfo object with global events. B.) Create SearchTaskInfo object (record in CMS_SearchTask). • Delete task - SearchTaskInfoProvider.CreateTask() • Update/Insert task - SearchTaskInfo Example - Creating Smart Search Task Processor Steps: A.) Create scheduled task class file that inherits from Itask class. B.) Implement Execute method. C.) Register scheduled task in CMS. Note: Step-by-step guide available here: http://devnet.kentico.com/docs/devguide/scheduling_a_custom_code.htm Tips • Documents that have their Exclude this document from search property enabled will not be indexed. This property can be configured by selecting a document from the content tree in CMS Desk and going to Content -> Edit -> Properties -> General. • If Smart Search is not working - check the event log for possible exception/error to investigate (CMS Site Manager -> Administration -> Event log). • Remember: Smart Search condition is not a SQL WHERE condition! • Smart Search result DataSet contains columns: id, type, score, position, title, content, created, image (_customurl in case of custom index). Questions Sources Developer’s Guide • http://devnet.kentico.com/docs/devguide/smart_search_overview.htm Contact Miro Remias • e-mail: miro@kentico.com • consulting: http://www.kentico.com/Support/Consulting/Overview