A Full Text Search Engine For BBS Lily

advertisement
LOGO
A Full Text Search
Engine For BBS Lily
主讲人:顾荣
指导老师:黄宜华
Email:gurongwalker@gmail.com
www.themegallery.com
Contents
Background
Brief Intro to principle of Full Text Search Engine
Implement of FTSE for BBS Lily
Maybe Google&Baidu has done these...
Conclusion
1.Background
1.1
What is a full text search engine?
1.2
Why do we need it?
What is a full text search engine?
In a full text search, the search engine examines
all of the words in every stored document as it
tries to match search words supplied by the
user.
------From Wiki
Why do we need a FTSE for BBS Lily?
Base
Total amount :around 3million posts
Capacity
Data In
BBS Lily
Increasing
Over a thousand everyday.
Speed
Post
Granularity
Each post’s size :1K~4K
Contents
Background
Brief Intro to principle of Full Text Search Engine
Implement of FTSE for BBS Lily
Maybe Google&Baidu has done these...
Conclusion
2.Brief Intro to the Principle of Full
Text Search Engine
What happens after you press enter?
Abstract IR Architecture
Query
Documents
online offline
Representation
Function
Representation
Function
Query Representation
Document Representation
Comparison
Function
Index
Hits
About Representation Function
Documents
case folding, tokenization, stopword removal, stemming
Bag of
Words
Inverted
Index
syntax, semantics, word knowledge, etc.
A Simple Inverted Index Demo
Doc 1
Doc 2
one fish, two fish
1
blue
red fish, blue hat
2
3
1
egg
1
2
Doc 4
cat in the hat
green eggs and ham
4
1
cat
fish
Doc 3
1
1
blue
1
2
1
1
cat
1
3
1
1
egg
1
4
1
2
fish
2
1
2
green
1
1
green
1
4
1
ham
1
1
ham
1
4
1
2
hat
1
3
1
1
one
1
1
1
1
red
1
2
1
1
two
1
1
1
hat
one
1
1
red
two
1
1
1
2
1
2
1
Map/Reduce’s Role…
Character Description
Indexing
Problem
Retrieval
Problem
1.scalability
2.relatively fast
3.batch operation
4.updates may not be important
5.crawling is a challenge in itself
1.must have sub-second response
time
2.for the web, only need relatively
few results
Suitable?
Perfect !
Not so good
…
Contents
Background
Brief Intro to principle of Full Text Search Engine
Implement of FTSE for BBS Lily
Maybe Google&Baidu has done these...
Conclusion
3.Implement of FTSE for Lily BBS
3.1
Outline of Work Flow
3.2
Crawl Web Pages & Mine Info
3.3
Indexing Process
3.4
Set up Web Retrieval Interface
3.5
Optimization
3.1 Outline of work Flow
Web Page 0
Web Page 1
Crawler
Crawl
&&
Info Mining
/Content
Title
Context
Formate
d Files
Author
URL
Hot
Web Page n
Inverted Index
&&
Ranking
Map/Reduce
/Vice Info
token n
token 1
token 0
<DID,
Rank>
……
<DID,
Rank>
……
<DID,
Rank>
……
Target DID
Search &
Merge
Result
List
Split
Index
For
Indices
Response
Web
Retrival
Query String
Term0,Term1…Term n
JSP Page
3.2 Crawl Web Pages & Mine Info
3.2.1 Target
3.2.2 Framework of Lily BBS
3.2.3 Strategy of Crawler
3.2.4 Strategy of Miner
Target of Crawler&Miner
A
Crawler
Crawl every post
From BBS lily
Continuously .
Fault tolerance
B
Miner
Mine wanted info
From each post that
Crawler has got from
web;
store the them in a
designed pattern.
Framework of BBS Lily (1)
BBS
Title
in
Lily
here
Title
in
section0
here
Title
in
section1
here
Title 0
Board
in
here
Post 0
Title
in
section2
here
………
Title n
Board 1 ……… Board
in
here
Post 1
………
Post n
Title in 12
section
here
Framework of BBS Lily (2)
Strategy of Crawler——DFS
BBS
Title
in
Lily
here
Title
in
section0
here
Title 0
Board
in
here
Post 0
Title
in
section1
here
Title n
Board 1 ……… Board
in
here
Post 1
………
Post n
Title
in
section2
here
………
Title in 12
Section
here
tips
-Traversal catalog links to
get the content;
-Automatic link to Next
Page and do the routine
job.
Strategy of Miner——Regex
Use HtmlParser
To get Tags’
Content
Extract Info
by regex
Store in a
designed
pattern
Click to add Text
[Each post will be stored in a line as the pattern blew]
URL’/007’hot’/007’auhtor’/007’title’/007’content
See Demo
3.3 Indexing Process
3.3.1 Target
3.3.2 Filter Source File
3.3.3 Build Inverted Index
3.3.4
Partition Inverted Index File
3.3.5
Second-Level Index (Index for Indices)
Target of Indexing Process
Txt_Filter
Indexing
Process
Run a series of
Map/Reduce
operations to
generate Inverted
Indices with rank
and position info.
Partition
Index Table
Index
For
Indices
Inverted
Index
Filter Source File
Although Source File stores posts in a well-designed
pattern ,We still need to filter it before we do the Inverted
Indices job.
Reasons
1.Examine and eliminate noises and duplications
-“http://bbs.nju.edu.cn/...’\007’ null ‘\007’ \null ‘\007’
null ‘\007’ null”
-About duplications…
2.It is natural to pre-process the data before we
really handle it.
Build Inverted Index
Details…
The process of building Inverted Index is
smart ,it will be smarter if we can calculate
and record some side info properly at the
same time.
The side info includes rank、positions
etc.
Build Inverted Index—Side
Info
Side Info
1.TF-IDF (Term Frequency-Inverse Document Frequency):
•| D | : total number of documents in the corpus
•
: number of documents where the term ti appears (that is
)
2.Positions info do not need any calculation , the can be
record as a Integer Pair like(StartIndex,EndIndex).
Build Inverted Index--structure
1. For each post in filtered source file , the offset in the file can be
considered as its DID;
2. Each line of Inverted Index file stores a term with its info ,the
details are as blew:
term info
info=SingleDIDInfo;SingleDIDInfo;SingleDIDInfo....
SingleDIDInfo=DID:rank:positions
positions=position%position%position%position...
position=IsTitle|start|end
Eg.
黑莓 48522292:162.6:1|2|4%0|804|806;42910773:106.26:0|456|458%0|560|562
Partition Inverted Index File
After last step,we got the Inverted Index File.
However,the file is so big…..
Source file size
48M
182M
703M
Inverted index file size
72.5M
240M
828M
Second-Level Index (Index for Indices)
In last step,we partitioned the Inverted Index file
into a certain num parts,for example 16.Each file
contains some term-info pairs.
So,when a term is given?How can we know which
part-file is it in?which line is it in?
We need an Index for Indices.
Ps.This really works.The second-level index
file’s size is less than 10% of the source file.
Second-Level Index (Index for Indices)
Source file
size
Inverted index Second-Level
file size
Index file size
48.1M
72.5M
2.375M
182M
240M
5.17M
703M
828M
10.5M
3.4 Set up Web Retrieval Interface
3.4.1 Target
3.4.2 Sort Pages
Target of Web Retrieval Interface
Web
Retrieval
Interface
Make an Interface which accpet
user’s query and response search
results.
1.Restrict the query string;
2.Sort search result dynamically;
3.Response results page by page.
Sort Pages
Merge the rank and Rank again~
Here is a demo.
Doc1 10
Term 0
Doc3 90
Doc7 20
Word
Merge
Doc2 20
Segement
Query String
Term 1
Term 2
Rank Again
Doc3 95
Doc7 80
Doc6 40
Doc5 15
Doc2 40
Doc3 05
Doc2 40
Doc6 40
Text in here
Doc7 100
Doc5 15
Doc1 10
3.5 Optimization
a)For each term only
top 1500 DID are
reserved at most.
b)Use TreeMap to sort..
Reduce Sort Time
Cache Strategy
a) Put some hot Inverted Index
file in the memory.
b) Cache replacement --- LRU
a)Response Page is created
dynamically.
b)Each time return 10 records.
Reduce I/O operations
……
..........
Optimization measures in different areas.
Contents
Background
Brief Intro to principle of Full Text Search Engine
Implement of FTSE for BBS Lily
Maybe Google&Baidu has done these...
Conclusion
4. Maybe Google&Baidu has done...
1. Search Stuff
parallelly
2. An OutStanding
Word Segement
Algorithm
Parallelly
3.A better rank
strategy :To
Rank
descirbe the
relationship
between a token
and DID precisely
Word
Segement
User's
query
…….
……
4.Record each
user's query string;
a) feed back to Word
Segement
b) Provide remind
function.(By input
change event)
Contents
Background
Brief Intro to principle of Full Text Search Engine
Implement of FTSE for BBS Lily
Maybe Google&Baidu has done these...
Conclusion
5. Conclusion
5.1
summary
5.2
Highlights
5.3
About Map/Reduce…
Summary
Related Work has three parts as blew:
Crawler
Crawler
A Hard Coding Crawler and Miner
,aimed to get data for BBS lily
Indexing
process
The indexing process runs as a
sequence of MapReduce operations.
Web
retrieval
Set up a Web Interface for user
to retrieve info.
Web
Interface
Indexing
process
Highlights
View of Application
This stuff is COOL~
It can provide a friendly User Experience,when we wanna
search something in our BBS Lily.
View of Technics
Use Map/Reduce to process data offline.It has
provided several benefits such as:
1.The indexing code is simpler, smaller, and easier to
understand.
2. we can keep conceptually unrelated computations
separate.
3. The indexing process has become much easier to
Operate and maintain.
the system
About Map/Reduce
Click to
add Text
Map/Reduce is not just a
1
Programming Model,
actually it’s also a Life Model…
Many thanks to…
Teacher Huang;
Yang Xiaoliang;
Xiao Tao;
Liu Yulong;
Zhang Lu;
NUAA & NJU
…
LOGO
Email:gurongwalker@gmail.com
www.themegallery.com
Download