CProDM_Presentation

advertisement
Supervisor:
Mr. Phan Trường Lâm
Team information
Agenda
Introduction
Project plan
System Requirement Specifications
System Analysis and Design
Testing
Deployment and User Guide
Summary
Demo and Q&A
Introduction
1
2
3
4
5
6
7
8
Initial Idea
Literature Review of Existing System
Proposal & Product
Initial Idea
1
2
3
4
5
6
7
8
Initial Idea
1
2
3
4
5
6
7
8
We decide to develop a new system that integrated:
 Collect documents
 Organize these documents
 Extract keyword
 Ranking
 Searching
Literature Review of Existing System
1
2
3
4
5
6
Methods that these websites use
to build their systems:
Big database
Search
Ranking and highlight return results
Compare documents to detect plagiarism
7
8
Literature Review
1
2
3
4
5
6
Achievements of the existing systems
Attractive
•Easy to use
•Speed & Reliability
•Quality Results
•Ensuring Security
Awareness
Limitations of the existing systems
Costs
Privacy
7
8
Proposal
1
2
3
4
5
6
7
8
•Public for everyone
•Inside and outside University
•Collect and manage Capstone projects
•Support looking up Capstone projects
•Avoid repeating and copying idea
•Ranking results
•Chipper to build
•Refer to other materials
•Free to use
•Friendly interface like Google
Product
1
2
3
4
5
6
7
8
Mobile application
(in future)
Web application
Project Plan
1
2
3
4
5
6
7
Development environment
Process
Project organization
Project schedule
Risk management
8
Development Environment
1
2
3
4
5
6
7
8
HARD WARE
2 Gb of RAM
100Gb of hard disk
Core 2 Duo 2.0 GHz
1 Gb of RAM
100Gb of hard disk
Core 2 Duo 2.0 GHz
SOFT WARE
Process
1
2
3
4
5
Follow Waterfall model
6
7
8
Project organization
1
2
3
4
5
6
7
8
Project organization
1
2
3
4
 Controlling and Monitoring
• Meeting
• Assign task
• Tracking task
• Issue resolve
• Review task
• Report
5
6
7
8
Project organization
1
2
3
 Communication control
 Online activity
• Email
• Chat
• Phone
 Offline activity
• Kick-Off project
• Team building
4
5
6
7
8
Project Schedule
1
Overall plan
2
3
4
5
6
7
8
Risk Management
1
2
3
4
5
6
7
8
People risk
Estimation risk
Risk
Management
Technology risk
Requirement risk
Schedule risk
System Requirement Specifications
1
2
3
4
User Requirements
System Requirements
Non-functional requirements
5
6
7
8
User Requirements
1
2
3
Lecturers and Students:
•Search project documents.
•Download documents.
Librarians:
•Edit profile.
•Search documents.
•Add/Edit/Delete document.
•Add/Edit/Delete category.
Administrator
•Edit profile.
•Add/Edit/Delete account.
4
5
6
7
8
User Requirements
1
2
3
4
5
Other requirement
•Searched results will be ranked.
•Document has following information:
Name
Author
Supervisor
Category
Description
6
7
8
User Requirements
1
•Input files:
Keyword file
Abstract file
Full document file
Other materials
2
3
4
5
6
7
8
System Requirements
1
2
3
4
5
6
7
8
 Communicate via the protocol HTTP to complete interactions based on
service with client computers and use standard protocols.
 Configuration
 Server: Windows Server 2008 operating system
.NET framework 3.5
SQL server 2008
IIS 7
 Client: Web browser
Non-functional Requirements
1
2
3
4
5
6
7
Usability
Availability
Reliability
Security
Security
Performance
Maintainability
8
System Analysis and Design
1
2
3
4
5
Architectural design
Detail design
Database design
Coding convention
Extract Keyword algorithm
Ranking
6
7
8
Architectural design
1
2
Overall architecture
3
4
5
6
7
8
MVC architecture design pattern
Detail design
1
2
3
4
5
6
7
CProDMS Component Diagram
8
Database design
1
2
3
4
5
6
Entity diagram
7
8
Coding convention
1
2
3
4
5
Follow:
Microsoft .NET Library Standards
FxCop rules and Code Analysis for
Managed Code Warnings
6
7
8
Extract Keyword Algorithm
1
2
3
4
5
6
7
8
Keyword Extraction from a Single Document using Word Co-occurrence
Statistical Information
(YUTAKA MATSUO and MITSURU ISHIZUKA)
(Dec. 10, 2003)
Introduction
Study Algorithm
Evaluation
Algorithm – What is the keyword?
1
2
3
4
5
6
7
8
Meaning
Frequency
Position
Algorithm – Step by step
1
Discard stop
words
Calculate
X’2 value
2
3
Stem
Expected
probability
4
5
6
7
8
Extract
frequency
Preprocessing
Select
frequent term
Processing
Output
Algorithm – Studying
1
2
3
4
5
6
7
8
Step2
Example:
Step1
Stemmed Words
Discarded Stop Words
Original Text
Information is the most powerful
weapon in the modern society.
Every day we are overflowed with
a huge amount of data in form of
electronic newspaper articles,
emails, web pages and search
results. Often, information we
receive is incomplete, such that
further search activities are
required to enable correct
interpretation and usage of this
information.
Information is the most powerful
weapon in the modern society.
Every day we are overflowed with
a huge amount of data in form of
electronic newspaper articles,
emails, web pages and search
results. Often, information we
receive is incomplete, such that
further search activities are
required to enable correct
interpretation and usage of this
information.
Informat
Information
powerful
power
weapon
modern societi
society
day
overflowed
overflow
huge amount
amoun data
electronic newspaper
articles email
articl
emails web pages
page
search result
results Often
information
informat
receive
incomplete such
incomplet
further
search activ
activities required
requir
enable correct interpret
interpretation
usage
usag
informat
information
Using Porter Stemming Algorithm
Algorithm – Studying
1
2
3
4
5
6
7
8
Select frequent Term
As study, number of keyword is about 10% number of term in document
and no more than 30 terms.
The top ten frequent terms (denoted as G) and the probability of
occurrence, normalized so that the sum is to be 1.
Algorithm – Studying
1
2
3
4
5
6
7
8
Co-occurrence and Importance
Two terms in a sentence are considered to co-occur once.
Example:
The imitation game could then be played with the machine in
question and the mimicking digital computer and the interrogator
would be unable to distinguish them.
“imitation” and “digital computer” have one co-occurrence
Algorithm – Studying
1
2
3
4
Co-occurrence and Importance
5
6
7
8
Algorithm – Studying
1
2
3
4
5
6
7
8
Co-occurrence and Importance
The degree of biases of co-occurrence can be used as a indicator of term importance
Algorithm – Studying
1
2
3
4
5
6
7
8
The statistical value of χ2 is defined as
pg
Unconditional probability of a frequent term g ∈ G
(the expected probability)
nw
The total number of co-occurrence of term w and
frequent terms G
freq (w, g)
Frequency of co-occurrence of term w and term g
Algorithm – Studying
1
2
3
4
5
6
7
8
We consider the length of each sentence and revise our definitions
pg
(the sum of the total number of terms in sentences where g
appears) divided by (the total number of terms in the document)
nw
The total number of terms in the sentences where w appears
including w
Algorithm – Studying
1
2
3
4
5
6
7
8
Algorithm – Studying
1
2
3
4
5
6
7
8
the following function to measure robustness of bias values
Subtracts the maximal term from the X2 value
Algorithm – Studying
1
2
3
4
5
6
7
8
Algorithm – Studying
1
2
3
4
5
6
7
8
To improve extracted keyword, we will cluster terms
Two major approaches (Hofmann & Puzicha 1998) are:
 Similarity-based clustering
IfEg:
terms Monday
w1 and w2
similar
is ahave
day in
week.distribution of co-occurrence
with other
terms,isw1
andinw2
are considered to be the same
Tuesday
a day
week.
cluster. Wednesday is a day in week.
 Pairwise clustering
If terms w1 and w2 co-occur frequently, w1 and w2 are
considered to be the same cluster.
Algorithm – Studying
1
2
3
4
5
6
7
8
Similarity-based clustering centers upon Red Circles
Pairwise clustering focuses on Green Circles
Algorithm – Studying
1
2
3
4
5
6
7
8
Similarity-based clustering
Cluster a pair of terms whose Jensen-Shannon divergence is
Where:
and:
Algorithm – Studying
1
2
3
4
5
6
7
8
Pairwise clustering
Cluster a pair of terms whose mutual information is
Where:
Algorithm – Evaluation
1
2
3
4
5
6
7
8
Precision: Ratio
Coverage:
Frequency
index:ofaverage
right
indispensable
keyword
frequency
to
keyword
number
of keyword
inof
listkeyword
in list
to all the indispensable terms
Ranking – Why?
1
2
3
4
5
6
Ranking Result
7
8
Ranking
1
2
3
4
5
6
7
8
Ranking
1
2
3
4
5
6
7
8
of formula Term in a collection documents:
Use rankFrequency
calculate
Total number of
Term
t in theExtraction for Database Search
( Automatic
Keyword
documents that
given : Prof. Dr. techn. Dipl.-Ing. Wolfgang Nejdl
First examiner
contain Term t
document
Second examiner : Prof. Dr. Heribert Vollmer
Rank Elena
of Term
t in
Supervisor
: MSc. Dipl.-Inf.
Demidova
)
document, which
reliability
extracted by Extract
coefficient
R(t) = Fd(t)*log(1 + N/N(t))
(1)
Service
Ranking formula :
Rank = d * Rd(t) / R(t)
Rank of Term t in all
=> the collection
Rank = d
*
(2)
Total number of
documents in the
Rd(t) / (Fd(t)*log(1
collection
+ N/N(t)))
(3)
Searching
1
2
3
4
5
6
7
8
Testing
1
2
3
4
5
V - model
6
7
8
Testing
1
2
3
4
5
6
7
8
Testing
1
2
3
4
5
6
7
8
Test result
No
Tester
1
AnhNT
2
Module code
Pass
Fail
Untested
N/A
Number of test cases
Master Page
18
0
0
0
18
AnhNT
Home Page
12
0
0
0
12
3
AnhNT
Search Result
5
0
0
0
5
4
AnhNT
User Account
69
0
0
0
69
5
AnhNT
Error Page
8
0
0
0
8
6
NamH
Category
36
0
0
0
36
7
NamH
Document
47
0
0
0
47
8
NamH
Authenticated
81
0
0
0
81
9
NamH
User Document Detail
9
0
0
0
9
285
0
0
0
285
Sub total
Test coverage
100.00
%
Test successful coverage
100.00
%
Deployment
Package Source Code
Client side
Server side
User guide
1
2
3
4
5
6
7
8
Summary
1

2
3
4
5
6
7
8
Strong point
• Enthusiasm
• Creative
• Cope with change
Weak point
• Lack of technical skill
• Lack of management skills
 Lessons learned
• Improve technical & management skills
• Release on-time product with the restriction of time and
resource
• Improve communication skills & problem solving
Demo & Q&A
1
2
3
4
5
6
7
8
Download