LOGO A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com www.themegallery.com Contents Background Brief Intro to principle of Full Text Search Engine Implement of FTSE for BBS Lily Maybe Google&Baidu has done these... Conclusion 1.Background 1.1 What is a full text search engine? 1.2 Why do we need it? What is a full text search engine? In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. ------From Wiki Why do we need a FTSE for BBS Lily? Base Total amount :around 3million posts Capacity Data In BBS Lily Increasing Over a thousand everyday. Speed Post Granularity Each post’s size :1K~4K Contents Background Brief Intro to principle of Full Text Search Engine Implement of FTSE for BBS Lily Maybe Google&Baidu has done these... Conclusion 2.Brief Intro to the Principle of Full Text Search Engine What happens after you press enter? Abstract IR Architecture Query Documents online offline Representation Function Representation Function Query Representation Document Representation Comparison Function Index Hits About Representation Function Documents case folding, tokenization, stopword removal, stemming Bag of Words Inverted Index syntax, semantics, word knowledge, etc. A Simple Inverted Index Demo Doc 1 Doc 2 one fish, two fish 1 blue red fish, blue hat 2 3 1 egg 1 2 Doc 4 cat in the hat green eggs and ham 4 1 cat fish Doc 3 1 1 blue 1 2 1 1 cat 1 3 1 1 egg 1 4 1 2 fish 2 1 2 green 1 1 green 1 4 1 ham 1 1 ham 1 4 1 2 hat 1 3 1 1 one 1 1 1 1 red 1 2 1 1 two 1 1 1 hat one 1 1 red two 1 1 1 2 1 2 1 Map/Reduce’s Role… Character Description Indexing Problem Retrieval Problem 1.scalability 2.relatively fast 3.batch operation 4.updates may not be important 5.crawling is a challenge in itself 1.must have sub-second response time 2.for the web, only need relatively few results Suitable? Perfect ! Not so good … Contents Background Brief Intro to principle of Full Text Search Engine Implement of FTSE for BBS Lily Maybe Google&Baidu has done these... Conclusion 3.Implement of FTSE for Lily BBS 3.1 Outline of Work Flow 3.2 Crawl Web Pages & Mine Info 3.3 Indexing Process 3.4 Set up Web Retrieval Interface 3.5 Optimization 3.1 Outline of work Flow Web Page 0 Web Page 1 Crawler Crawl && Info Mining /Content Title Context Formate d Files Author URL Hot Web Page n Inverted Index && Ranking Map/Reduce /Vice Info token n token 1 token 0 <DID, Rank> …… <DID, Rank> …… <DID, Rank> …… Target DID Search & Merge Result List Split Index For Indices Response Web Retrival Query String Term0,Term1…Term n JSP Page 3.2 Crawl Web Pages & Mine Info 3.2.1 Target 3.2.2 Framework of Lily BBS 3.2.3 Strategy of Crawler 3.2.4 Strategy of Miner Target of Crawler&Miner A Crawler Crawl every post From BBS lily Continuously . Fault tolerance B Miner Mine wanted info From each post that Crawler has got from web; store the them in a designed pattern. Framework of BBS Lily (1) BBS Title in Lily here Title in section0 here Title in section1 here Title 0 Board in here Post 0 Title in section2 here ……… Title n Board 1 ……… Board in here Post 1 ……… Post n Title in 12 section here Framework of BBS Lily (2) Strategy of Crawler——DFS BBS Title in Lily here Title in section0 here Title 0 Board in here Post 0 Title in section1 here Title n Board 1 ……… Board in here Post 1 ……… Post n Title in section2 here ……… Title in 12 Section here tips -Traversal catalog links to get the content; -Automatic link to Next Page and do the routine job. Strategy of Miner——Regex Use HtmlParser To get Tags’ Content Extract Info by regex Store in a designed pattern Click to add Text [Each post will be stored in a line as the pattern blew] URL’/007’hot’/007’auhtor’/007’title’/007’content See Demo 3.3 Indexing Process 3.3.1 Target 3.3.2 Filter Source File 3.3.3 Build Inverted Index 3.3.4 Partition Inverted Index File 3.3.5 Second-Level Index (Index for Indices) Target of Indexing Process Txt_Filter Indexing Process Run a series of Map/Reduce operations to generate Inverted Indices with rank and position info. Partition Index Table Index For Indices Inverted Index Filter Source File Although Source File stores posts in a well-designed pattern ,We still need to filter it before we do the Inverted Indices job. Reasons 1.Examine and eliminate noises and duplications -“http://bbs.nju.edu.cn/...’\007’ null ‘\007’ \null ‘\007’ null ‘\007’ null” -About duplications… 2.It is natural to pre-process the data before we really handle it. Build Inverted Index Details… The process of building Inverted Index is smart ,it will be smarter if we can calculate and record some side info properly at the same time. The side info includes rank、positions etc. Build Inverted Index—Side Info Side Info 1.TF-IDF (Term Frequency-Inverse Document Frequency): •| D | : total number of documents in the corpus • : number of documents where the term ti appears (that is ) 2.Positions info do not need any calculation , the can be record as a Integer Pair like(StartIndex,EndIndex). Build Inverted Index--structure 1. For each post in filtered source file , the offset in the file can be considered as its DID; 2. Each line of Inverted Index file stores a term with its info ,the details are as blew: term info info=SingleDIDInfo;SingleDIDInfo;SingleDIDInfo.... SingleDIDInfo=DID:rank:positions positions=position%position%position%position... position=IsTitle|start|end Eg. 黑莓 48522292:162.6:1|2|4%0|804|806;42910773:106.26:0|456|458%0|560|562 Partition Inverted Index File After last step,we got the Inverted Index File. However,the file is so big….. Source file size 48M 182M 703M Inverted index file size 72.5M 240M 828M Second-Level Index (Index for Indices) In last step,we partitioned the Inverted Index file into a certain num parts,for example 16.Each file contains some term-info pairs. So,when a term is given?How can we know which part-file is it in?which line is it in? We need an Index for Indices. Ps.This really works.The second-level index file’s size is less than 10% of the source file. Second-Level Index (Index for Indices) Source file size Inverted index Second-Level file size Index file size 48.1M 72.5M 2.375M 182M 240M 5.17M 703M 828M 10.5M 3.4 Set up Web Retrieval Interface 3.4.1 Target 3.4.2 Sort Pages Target of Web Retrieval Interface Web Retrieval Interface Make an Interface which accpet user’s query and response search results. 1.Restrict the query string; 2.Sort search result dynamically; 3.Response results page by page. Sort Pages Merge the rank and Rank again~ Here is a demo. Doc1 10 Term 0 Doc3 90 Doc7 20 Word Merge Doc2 20 Segement Query String Term 1 Term 2 Rank Again Doc3 95 Doc7 80 Doc6 40 Doc5 15 Doc2 40 Doc3 05 Doc2 40 Doc6 40 Text in here Doc7 100 Doc5 15 Doc1 10 3.5 Optimization a)For each term only top 1500 DID are reserved at most. b)Use TreeMap to sort.. Reduce Sort Time Cache Strategy a) Put some hot Inverted Index file in the memory. b) Cache replacement --- LRU a)Response Page is created dynamically. b)Each time return 10 records. Reduce I/O operations …… .......... Optimization measures in different areas. Contents Background Brief Intro to principle of Full Text Search Engine Implement of FTSE for BBS Lily Maybe Google&Baidu has done these... Conclusion 4. Maybe Google&Baidu has done... 1. Search Stuff parallelly 2. An OutStanding Word Segement Algorithm Parallelly 3.A better rank strategy :To Rank descirbe the relationship between a token and DID precisely Word Segement User's query ……. …… 4.Record each user's query string; a) feed back to Word Segement b) Provide remind function.(By input change event) Contents Background Brief Intro to principle of Full Text Search Engine Implement of FTSE for BBS Lily Maybe Google&Baidu has done these... Conclusion 5. Conclusion 5.1 summary 5.2 Highlights 5.3 About Map/Reduce… Summary Related Work has three parts as blew: Crawler Crawler A Hard Coding Crawler and Miner ,aimed to get data for BBS lily Indexing process The indexing process runs as a sequence of MapReduce operations. Web retrieval Set up a Web Interface for user to retrieve info. Web Interface Indexing process Highlights View of Application This stuff is COOL~ It can provide a friendly User Experience,when we wanna search something in our BBS Lily. View of Technics Use Map/Reduce to process data offline.It has provided several benefits such as: 1.The indexing code is simpler, smaller, and easier to understand. 2. we can keep conceptually unrelated computations separate. 3. The indexing process has become much easier to Operate and maintain. the system About Map/Reduce Click to add Text Map/Reduce is not just a 1 Programming Model, actually it’s also a Life Model… Many thanks to… Teacher Huang; Yang Xiaoliang; Xiao Tao; Liu Yulong; Zhang Lu; NUAA & NJU … LOGO Email:gurongwalker@gmail.com www.themegallery.com