Smart_Search_in_Kentico_6_0

advertisement
Smart Search in Kentico 6.0
1/18/2012
Miro Remias, Solution Architect
Agenda
•
•
•
•
•
Smart Search: How It Works
Index Types
Analyzer Types
Related Scheduled Tasks & Keys
Example - Searching In Content Of Media
Files
How It Works
Definition
Smart Search is index-based searching through the content of websites or
other objects within the system (3rd party library - Lucene.Net, v 2.1.0).
Where to find the index file(s)?
File system: /App_Data/CMSModules/SmartSearch/<Index code name>
Customization: CMSSearchIndexPath ("App_Data\\CMSModules\\SmartSearch\\")
How to analyze the index file?
Luke - Lucene Index Toolbox (http://www.getopt.org/luke/)
Note: Don't forget to have the write disk permission assigned to the App_Data folder!
How It Works
“When a search request is sent to the system by a user, it is the index file
that gets searched, which results in significantly better performance
compared to linear SQL query search.”
Life cycle of a document/object in the index file:
A) When a document/object is created/updated/deleted, new
indexing task is logged in the database.
B) The database (CMS_SearchTask table) is automatically checked
(on a regular basis) for the presence of indexing tasks.
C) The task is processed and document/object is
added/updated/deleted to/in/from the index file.
How It Works - Database
•
•
•
•
CMS_SearchIndex
CMS_SearchIndexCulture
CMS_SearchIndexSite
CMS_SearchTask (API: SearchTaskInfo)
SearchTaskType (nvarchar) - (SearchTaskTypeEnum: Update, Delete, Rebuild, Optimize,
Process)
SearchTaskObjectType (nvarchar) - (PredefinedObjectType: ABTEST, ACCOUNT,
BIZFORM etc.)
SearchTaskField (nvarchar) - usually name of the ID field.
SearchTaskValue (nvarchar) - usually the object/document ID.
SearchTaskServerName (nvarchar) - server name in case web farms are used.
SearchTaskStatus (SearchTaskStatusEnum: Ready, InProgress).
SearchTaskPriority (int) - higher value = higher priority.
SearchTaskCreated (datetime) - task creation date.
Index Types
1) Custom index - indexes any kind of data depending on its implementation.
2) Custom tables - indexes records in custom tables.
3) Documents - indexes content of documents in the content tree.
4) Documents crawler - indexes the content of the HTML output generated by
documents in the content tree.
• Customization options:
CMS.SiteProvider.SearchHelper.OnHtmlToPlainText
Triggered when the HTML output is processed by a crawler
CMS.SiteProvider.SearchHelper.HtmlToPlainText()
Converts html to the plain text (body part)
CMS.SiteProvider.SearchHelper.DownloadHtmlContent(url)
Returns complete HTML code of the page based on the provided URL
CMS.TreeEngine.TreeNode.GetSearchDocument()
Returns Lucene Document object
5) Forums - indexes content of discussion forums.
6) General - indexes objects of a specified type. Any objects within the CMS can be
searched this way.
7) Users - indexes details about system users.
Analyzer Types
Tokenized Field - indicates if the content of the field should be processed by the analyzer when
indexing. The general rule is to use this for Content fields and not for Searchable fields.
1) Custom - Option of performing tokenization according to your particular requirements.
2) Keyword - Tokenizes the entire stream as a single token. This is useful for data like zip
codes, ids, and some product names.
3) Simple - divides text at non-letter characters.
4) Standard - grammar-based analyzer (stop-words, shortcuts, ...), very efficient for
English, but may not produce satisfactory results with other languages.
5) Starts with - tokenizes all prefixes contained in words, which allows searching for
words that start with the entered string. Text is divided at whitespace characters.
Example: abc => a ab abc
6) Stop - contains a collection of stop-words at which text is divided.
7) Subset - tokenizes all substrings in words, which allows searching for words that
contain the entered string. Text is divided at whitespace characters.
Example: abc => abc ab bc a b c
8) White space - divides text at whitespace characters.
Note: Stop words - dictionary containing words which will be omitted from indexing (e.g. 'and', 'or',
...) when a Stop or Standard analyzer is used (~\App_Data\CMSModules\SmartSearch\_StopWords)
Starts with & Subset analyzers are using CMS.SiteProvider.SubSetAnalyzer class.
Related Scheduled Tasks & Keys
Scheduled tasks:
• Optimize search indexes (Search.IndexOptimizer) [Enabled] - performs index optimization
•
(defragmentation resulting in better performance, particularly in the case of large indexes).
By default executed once per week.
Execute search tasks (Search.TaskExecutor) [Enabled] - executes indexing tasks (created
and executed automatically when the indexed content changes) that were not completed
successfully on their automatic execution. By default performed every 4 hours.
Web.config keys:
•
•
•
•
•
•
CMSRemoveDiacriticsForIndexField (true) - indicates whether diacritics should be removed for
index field.
CMSSearchStoreContentField (false) - indicates whether content field should be stored in the index
CMSSmartSearchIndexCategories (false) - indicates whether document categories should be
indexed.
CMSSearchContentXpathValue ("//property[@name='text' or @name='contentbefore' or
@name='contentafter']") - webparts fields should be added to the search document content.
CMSProcessSearchTasksByScheduler (false) - If true, smart search tasks are processed by scheduler.
CMSCreateTemplateSearchTasks (true)- Any changes made to a page template will automatically
trigger an update of all documents that are based on the given template in the appropriate smart
search indexes.
Example - Searching In Content Of Media Files
Requirements:
• Be able to search in content of media files.
• Whenever the media file is updated/deleted/inserted, smart
search index should be updated as well.
Process:
• Create custom index - Rebuild operation needs be able to read
media library file definitions [media_file table] plus physical
content of them on file system and index them afterwards.
• Create custom analyzer (optional).
• Use global events to react to insert/delete/update events of
media files in order to create indexing tasks (CMS_SearchTask).
• Create scheduled task for processing these indexing tasks.
Example - Creating Custom Index
Steps:
A.) Create class file that implements
CMS.Siteprovider.ICustomSearchIndex interface.
B.) Implement Rebuild method.
C.) Register the custom index in the CMS.
D.) Rebuild index and test.
Note: Step-by-step guide available here:
http://devnet.kentico.com/docs/devguide/smart_search_defining_custom_index_content.htm
Example - Creating Custom Analyzer
Steps:
A.) Create class file that inherits from Lucene.Net.Analysis.Analyzer class.
B.) Implement TokenStream method.
C.) Create class file that inherits from Lucene.Net.Analysis.Tokenizer class.
D.) Implement Next method.
E.) Register the custom analyzer for index in the CMS.
F.) Rebuild index and test.
Note: Step-by-step guide available here:
http://devnet.kentico.com/docs/devguide/smart_search_using_a_custom_analyzer.htm
Example - Creating Smart Search Task
Steps:
A.) Register for Insert, Update, Delete events of MediaFileInfo
object with global events.
B.) Create SearchTaskInfo object (record in CMS_SearchTask).
• Delete task - SearchTaskInfoProvider.CreateTask()
• Update/Insert task - SearchTaskInfo
Example - Creating Smart Search Task Processor
Steps:
A.) Create scheduled task class file that inherits from Itask class.
B.) Implement Execute method.
C.) Register scheduled task in CMS.
Note: Step-by-step guide available here:
http://devnet.kentico.com/docs/devguide/scheduling_a_custom_code.htm
Tips
• Documents that have their Exclude this document from search property
enabled will not be indexed. This property can be configured by selecting a
document from the content tree in CMS Desk and going to Content -> Edit
-> Properties -> General.
• If Smart Search is not working - check the event log for possible
exception/error to investigate (CMS Site Manager -> Administration ->
Event log).
• Remember: Smart Search condition is not a SQL WHERE condition!
• Smart Search result DataSet contains columns: id, type, score, position,
title, content, created, image (_customurl in case of custom index).
Questions
Sources
Developer’s Guide
•
http://devnet.kentico.com/docs/devguide/smart_search_overview.htm
Contact
Miro Remias
• e-mail: miro@kentico.com
• consulting: http://www.kentico.com/Support/Consulting/Overview
Download