PDD

advertisement
Project Development Document for
“P2P Document Sharing Community”
Doc version
1.0
1.1
1.2
1.3
Author
羅婉琪
羅婉琪
黃振修
羅婉琪
Date
2003.04.26
2003.04.28
2003.05.03
2003.06.09
Abstract
In this document we present the Project Development Document according to
the Project Requirement Document of PDSC. Project architecture and detail
implementation issues are discussed below.
Architecture
Program components
Our System Architecture is pure peer to peer.
Kernel
Graphics
User
Interface
Protocol
API for GUI
Client
Host
Lookup
Thread
Server
Server
Thread
Document
Keyword
Processor
Database
Local
Shared
File
Database
Host
Database
Task
Database
Component description

Graphic User Interface:
Pass user requests to kernel by calling corresponding APIs.
Those APIs will return immediately, and the results from the kernel are
reported by posting messages to GUI's window handler. Since GUI needs to
distinguish from those messages, a unique id is required by every request.

Kernel:
API for GUI:
Called by GUI. In order to return immediately, we should create a thread for
each request. Then use the client protocol to get what we need, and report the
result by messages.
Host Lookup Thread:
A thread picks up some hosts from the host database,
and using the client protocol to get hosts information.
Then push back to the host database.
Server Thread:
A thread serves the incoming connections from the server socket. It will create
a thread for each connection. The new thread records the client host
information, analyzes request type, call the corresponding server protocol. If
the protocol is done without any error, add the client host information to the
host database.

Protocol:
Client:
Protocol implementations for client. There are five functions now
implemented: get host list, get file list, download, search, and response. The
search function is very special because it first forwards the request to some
hosts randomly, then search the local shared file database, and finally response
if something is found. To avoid getting duplicate request, it will register this
request to the task database before forwarding. If the request is already
registered, the function will return immediately. The forwarding is now
implemented by multi-thread to speed up. Other functions except searching
are just connecting to the destination host directly.
Server:
2
Protocol implementations for server. There are five functions corresponding to
the client protocol. The search function first get the request, then just use the
search function in client protocol. The response function first get the response,
then pass the result to the kernel.

Document Keyword Processor:
Maintain shared documents' information. Monitor every files. Re-compute
keyword vector if some document is modified. Then update the local shared
file database.

Database:
Local Shared File Database:
Save keyword vectors for each document under the shared directory. Provide
search function to get the file list from database. The search function is
implemented as computing the score between search keyword and document
keyword vector, and return the file list whose score is higher than a threshold.
Host Database:
Save hosts in this community. Provide some basic operations for management,
such as dump, merge, load, and save.
Task Database:
Save requests received from server. Provide few operations to check for
existence.
Protocol specification
Name
Specification
hostname
host name.
ip
IPV4 network address, in a.b.c.d format.
port
service port.
time
time of last communication.
task
unique id for search request.
key
search keywords.
count
total record count.
pathname
path name.
filename
file name.
filetype
file type, 1 for directory, 0 for file.
ctime
time of last file status change, in seconds.
3
mtime
time of last data modification, in seconds.
lsize
low four bytes of file size, convert to unsigned integer.
hsize
high four bytes of file size, convert to unsigned integer.
data
file data, use binary transmission.
#
after # is comment.
Note: 'hsize' is not fully implemented, because not every file system can save file
larger than 4GB.
Former edition.
System Architecture:
Our System Architecture is Client-Server. Each client program gets information from
server side, and only downloads or upload file peer to peer.
Get Information
Download File
4
Work Flow
Client To Server
Server To Client

Get host list:
1. GET_HOST_LIST hostname ip port\n [Client]
2. time count\n [Server]
repeat 3 “count” times
3. hostname ip port time\n [Server]
4. OK\n [Server]
GET_HOST_LIST hostname ip port\n
Time count\n
For I=0 to I=(count-1)
Client A
hostname ip port time\n
End;
OK\n
5

Get file list:
1. GET_FILE_LIST hostname ip port\n [Client]
2. pathname\n [Client]
3. count\n [Server]
repeat 4,5 “count” times
4. filename\n [Server]
5. filetype hsize lsize ctime mtime\n [Server]
6. OK\n [Server]
GET_HOST_LIST hostname ip port\n
pathname\n
count\n
Client A
Client B
For I=0 to I=(count-1)
filename\n
filetype hsize lsize ctime mtime\n
End;
OK\n

Download:
1. DOWNLOAD hostname ip port\n [Client]
2. pathname/filename\n [Client]
3. lsize\n [Server]
4. data [Server]
5. OK\n [Server]
DOWNLOAD hostname ip port\n
pathname/filename\n
lsize\n
Client A
data
Client B
OK\n
6

Search:
1. SEARCH hostname ip port\n [Client]
2. hostname ip port task\n [Client]
3. key\n [Client]
4. OK\n [Server]
SEARCH hostname ip port\n
hostname ip port task\n
key\n
OK\n
Client A

Client B
Response:
1. RESPONSE hostname ip port\n [Client]
2. hostname ip port task\n [Client]
3. count\n [Client]
repeat 4,5 “count” times
4. pathname/filename\n [Client]
5. filetype hsize lsize ctime mtime\n [Client]
6. OK\n [Client]
RESPONSE hostname ip port\n
hostname ip port task\n
count\n
Client A
Client B
For I=0 to I=(count-1)
pathname/filename\n
filetype hsize lsize ctime mtime\n
End;
OK\n
7
Former edition.

Register
5. Connect to Server with no UID. [Client]
6. Server sends a UID to Client. [Server]
Connect to Server, Protocol 00
Server sends a UID to Client A, Protocol01
Client A
Server
第一次連到 Server,尚未有 md5 key.
→
HELLO “” “Maggie”\n
.
←
HELLO “efx827ad3” “Maggie”\n
.
p.s. Server 將 帳號,md5 key,IP 加入到 Computer List.

Connect
1. User connects to server. [Client]
2. Server checks the UID of this computer. [Server]
3. Server updates some information to Computer List. [Server]
p.s. Computer List contains information of UID, IP, file list, file key.
8
Former edition.
Connect to Server, Protocol 02
Server checks the UID of Client A.
Client A
Server updates some information.
Server
非第一次連到 Server,已擁有 md5 key.
→
HELLO “efx827ad3” “Maggie”\n
.
←
HELLO “efx827ad3” “Maggie”\n
.
p.s. Server 將 IP 更新到 Computer List.

Browse Computer List
1. Connect to Server. [Client]
2. Server sends its Online Computer List. [Server]
3. Client shows the result. [Client]
Connect to Server, Protocol 03
Send Online Computer List, Protocol 04
Client A
Client Shows The Result
→
BWCOMP\n
.
←
BWCOMP OK\n
“Maggie”\n
9
Server
Former edition.
“Apple”\n
“Tony”\n
.

Browse File List
1. Client requests one computer from Online Computer List. [Client]
2. Server sends the file information of requested computer. [Server]
3. Client shows the result. [Client]
Request one computer, Protocol 05
Send file information, Protocol 06
Client A
Client Shows The Result
→
BWRES “Maggie”\n
.
←
BWRES OK\n
“paper.doc” “2003 年 4 月 28 日” “12KB”\n
“demo.ppt” “2003 年 4 月 10 日” “220KB”\n
“3DmodelRetrieval.pdf” “2002 年 11 月 25 日” “5MB”\n
“presentation0122.ppt” “2003 年 1 月 22 日” “34KB”\n
.

Search
1. User inserts the keyword. Sends the keyword to Server. [Client]
2. Server searches the keyword with its all file information. [Server]
3. Server sends the search result to Client. [Server]
4. Client shows the result. [Client]
10
Server
Former edition.
Send Keyword, Protocol 07
Search all file information.
Send search result, Protocol 08
Client A
Server
Client Shows The Result
→
SEARCH “Mpeg7” “Maggie”\n
.
p.s.”Mpeg7”是 Keyword, “Maggie”是帳號名稱,option!
←
SEARCH OK\n
“Maggie” “Mpeg7tutorial.pdf” “2002 年 7 月 24 日” “7MB” “98”\n
“Apple” “Mpeg7encoder.doc” “2002 年 10 月 8 日” “56KB” “95”\n
“Maggie” “Mpeg7decoder.doc” “2002 年 11 月 27 日” “110KB” “92”\n
“Tony” “decoder.doc” “2002 年 11 月 28 日” “110KB” “60”\n
.

Download
1. Connect to Server for IP.
2. Server sends IP to ClientA.
Connect to Server and ask for IP, Protocol 09
Send IP, Protocol 10
Client A
Server
→
GETIP “Maggie”\n
11
Former edition.
.
←
GETIP OK\n
“140.112.223.35”\n
.
1. Connect to a certain computer (client B). [Client]
2. Client B sends file to client A. [Client]
Connect to Client B, Protocol 11
Search all file information.
Send search result, Protocol 12
Client A
Client B
→
GET “Mpeg7decoder.doc”\n
.
←
Char(0) [File Length Char] (0) [Binary File Data]
←
Char(1)
p.s. Char(0)代表有此檔案,Char(1)代表此檔案已不存在。

Upload
1. Connect to a certain computer (client B). [Client]
2. Client B checks if it can permit uploading. [Client]
3. Client B sends result to Client A. [Client]
4. Client A sends file to Client B or not. [Client]
→
CanSend\n
.
Connect to Client B, Protocol 13
Checks if it can permit uploading.
12
Send the result, Protocol 14
Client A
Send file to Client B, Protocol 15
Client B
Former edition.
←
CanSend OK\n
.
←
CanSend FAIL\n
.
若是 CanSendOK
→
Send “Mpeg7decoder.doc” \n
.
←
Send OK\n
.
→
Char(0) [File Length Char] (0) [Binary File Data]
→
Char(1)

Update
1. When get hook event, means that the shared directory has been modified.
[Client]
2. Reprocess Client shared directory. [Client]
3. Client sends information and file list to server. [Client]
4. Server indexes this information. [Server]
Get hook event.
Reprocess Client shared directory.
Send information to Server, Protocol 16
Client A
Server
Server indexes this information.
→
13
Former edition.
ADDFILE “Maggie” “Mpeg7decoder.doc” “2002 年 11 月 27 日” “110KB”
[document vector]\n
.
←
ADDFILE OK\n
.
→
DELFILE “Maggie” “Mpeg7decoder.doc” \n
.
←
DELFILE OK\n
.
Development Tools/Environment




BCB 5.0
Microsoft Visual C++ 6.0
Whole Tomato Software Visual Assist 4.1
CVS server and WinCvs 1.2 as client
Related Survey
1. Monitor windows events. ( hook )
2. Index Algorithm
3. File Format of PDF
Component Description
Document Digest
Document digest means extracting the representative portion of the sharing
document for further searching. First we need to convert our supported document
formats: word, power point, and pdf files into plain text format. We’ve surveyed
several open source projects about this kind of conversion:
xlhtml (http://chicago.sourceforge.net/xlhtml/)
14
word2x (http://www.gnu.org/directory/text/wordproc/word2x.html)
wvWare (http://www.wvware.com/)
antiword (http://www.winfield.demon.nl/)
xpdf (pdftotext) (http://www.foolabs.com/xpdf/download.html)
For those tools, we decide to use xlhtml for converting power point files,
antiword for converting word files, and pdftotext for pdf files. Because most of those
projects are originated form Unix platform, we also need to port them to win32
system and re-build those tools as libraries, which can be linked with other modules.
Besides, we also use another library called libiconv, which is used to convert some
Unicode of office document into Big5 encoding.
We’re not going to implement Big5 searching in the initial version, so the Big5
characters are filtered out. After that, we parse the plain text to get the words for
analysis. We also select 1764 most common English words, which are filtered out
during the word frequency analysis; and Porter Stemming Algorithm is used in this
digest process. A typical document digest may like:
coding:3 transcoding:3 reduction:2 motion:2 video:2 scalable:2
resolution:2 spatial:2 temporal:1 bit-rate:1 dct:1 coefficients:1
vector:1 data:1 drift:1 residual:1 frame:1 architecture:1 vectors:1
blocks:1 ...
We just try to do the most simplest implementation: take the word’s frequency
as its importance. So each parsed word is followed by a number, which is represented
for its frequency compared to the summation of all words’ frequency. For example,
“coding:3” means the word “coding” has the 3% of total frequencies.
Document Search/Ranking
Now we are going to design a ranking algorithm for searching those document
digest. First, if the keyword is not found in the document digest, absolutely we got
zero point. If the keyword is found, than we extract it frequency; multiplying it with
100; plus the location of the substring comparing to length of document digest (in
percentage * 100) and minus by 100. For example, if we search “transcoding” on the
document digest above, we get:
15
3*100 + (100 - 9/7400 * 100) = 400 (approximately)
We support multiply keywords in combination as AND logic. For each
keyword’s search points, we normalize it with the highest points and then calculate
the average points. If only one keyword is supplied, normalization is not performed.
The typical calling convention is like:
CDocDigest dg_doc;
dg_doc.startDiest(“transcd_ov.ppt”);
CDocSearch search;
search.setDigest(dg_doc.getDigest());
int point = search.Search(“mpeg-4 transcoding”);
// points are returned for each document digest, thus the server
// can sort those documents according to its points
Sharing Module
I am in charge of using the object of Digest / Modify / Share, making them
capable combined with Client / Server.
The function of Digest is filtering the common words of a file, leaving the
concentrated useful words in the digest of the file. The function of Search use the
keyword which the user type in to search in the digest and grade the file with a point,
showing how many times the keyword appear in the file.
So, what I need to do is to find all of the files in the share root which the user
orders, and record the digests and related information of the files in a list: setd. When
a user close the file, the records of the list will save in a save_file. Next time when
user opens the program, we will download the information and the digest of the files
we saved last time from this save_file to the list setd. And one thing should be noted,
a user may changes the files in the original root, so we have to check every file in the
root with the records in the setd immediately.
A user maybe want to delete / new / modify the files in the root when use the
program. So we have to monitor if any change has happened immediately. To do this,
16
I use the FileChangeNotify.cpp to monitor the every file in the root with the setd list,
if anything has changed, modify it immediately in setd.
Then the function of Search, I have record the digest and related information of
each file in the root, so I just grade every file by the digest, and sort these files by the
points. We also set a throughput, if any points of the files are over the throughput, I
will send these files to server.
My program is digest.h / digest.cpp. The three functions follows are wrote by
Chang-Chiu, I just use that. FileChangeNotifymonitor if any change in a root.
CDocDigestget the digest of a file. CdocSearchgrade a file by the digest.
Then I introduce every function in Digest.cpp.
Digest::digest()Check the file we used to save setd records last time exits or not.
Void digest::downloadfile()If the save_file exits, download the records from this
file to setd we used now.
Digest::~digest()When the user close the program, we write the records of setd to
save_file and set 1 in begin_data.txt, showing that we have save the record of setd to
the file successfully.
Void digest::setdigestdata(string shared_path,string db_path)
Set the root and the name of the file we save records in the two variables: shared_path
and save_file. And we decide we need to initialall or compare by the save_file exit or
not.
Void digets::initialall() / void digest::findfile(string npath, string dipath)
Find all of the files in the root and set the digest and related information for it, then
save it in setd list. Here we use two functions because the function we used to find all
files: FindFirstFile which only find files in one level. If we want to find all files in
multilevel of folders, we should use recursive to do this, and the one more function
could make us easy to deal with variables. By the way, we will filter the files which
are not .ppt/ .doc/.pdf.
Void digest::compare() / void digest::findcomfile(string npath, string dipath)
Check if any change has happened in the root, if yes, find the changed file and
monitor the record of the file in the setd.
17
Void digest::searchkey(char* key)
Using the Key which the user types in to grade every file and give it a point. Then I
sort the files by the points. If the point of a file is grater than the throughput we set, I
send the file to server.
Void digest::change_shared_path(string newpath)
Deal with the situation that the user changes the share root when she or he are using
our program.
UI Design

URL and KeyWord Edit Component all can be executed just by “Enter”. But
for the reason that some users may not know this method, so we still put a
button “Search” beside the Edit Component.

The function list will dynamically change according to different component in
List. For example, if the component belongs to “computer” or “folder”, right
click will only execute “Get File List”. If the component belongs to “file”,
right click will only execute “Download”.
Task Assignment
1. Communication Protocol between Client and Server. – ID5
18
2. Client. – Maggie
3. Server. – ID5
4. File Format of PDF, Index Algorithm. – 振修
5. Digest, Hook. –宜儒
6. Test PDF, PPT component. –Slater
7. Test Doc. –燕君
Technical Issue
1. Peer to Peer, or Client – Server Architecture?
Because we choose to do “Full Text Search”, considering search efficiency, we
decide to select peer to peer system architecture.
Former edition.
Peer to Peer has a problem of synchronization. 第一次加入這個系統時,該如何
讓所有的 Client 都可以得知?要怎麼 lock 住正在修正 Computer List 的這個
process?討論結果,覺得用一個 Server 可能會比較好控管,程式也比較好寫。
而且不同的 Server 可以代表不同的 group。
2. How to solve the problem of non-fixed IP Client?
第一次進入系統跟 Server 握手之後, Server 送回一個 unique 的 key (ex:
md5 hash),之後每次進入 community 跟 Server 握手之後,就給 Server 看
之前的 unique key,這樣 Server 就可以根據 key 知道 client 的身分,再去
更新 IP 紀錄即可,而這部分可以直接在 protocol 裡面做掉,user 完全沒感
覺。md5 hash 則有現成的 library。
3. How to update information of our indexes?
We use the method of “hook” to monitor our share folder, if system detects that
the share folder has been modified, it signal an event to our application. Then our
application will digest those files instantly.
Former edition.
原本的想法是,每隔一兩個小時就去看 shared directory 是否有被更動,如果
有被更動的話就重新掃描取出 indexes,不過這樣的作法很不即時。有一種作
法叫作 ” hook “,可以讓 windows 幫我們監控 directory,如果監控到 directory
被修改,便立即產生 event,Client 端程式收到 event 後就重新掃描取出
indexes,如此一來,更新變得比較即時,且才不會作了很多沒有必要的 check
動作。
4. Protocol for “Update”?
19
Document index vector 應該在 client 端作好,那麼 update 時,除了更新 file list
以外,該如何將 vector 也上傳?目前暫定為直接接在 ADDFILE 指令後端,
等 Index Algorithm 出來之後,再來討論。
Memo
Former edition.
p.s. Server 必需定期清理久未使用的帳號!!
Document ends here
20
Download