Supervisor: Mr. Phan Trường Lâm Team information Agenda Introduction Project plan System Requirement Specifications System Analysis and Design Testing Deployment and User Guide Summary Demo and Q&A Introduction 1 2 3 4 5 6 7 8 Initial Idea Literature Review of Existing System Proposal & Product Initial Idea 1 2 3 4 5 6 7 8 Initial Idea 1 2 3 4 5 6 7 8 We decide to develop a new system that integrated: Collect documents Organize these documents Extract keyword Ranking Searching Literature Review of Existing System 1 2 3 4 5 6 Methods that these websites use to build their systems: Big database Search Ranking and highlight return results Compare documents to detect plagiarism 7 8 Literature Review 1 2 3 4 5 6 Achievements of the existing systems Attractive •Easy to use •Speed & Reliability •Quality Results •Ensuring Security Awareness Limitations of the existing systems Costs Privacy 7 8 Proposal 1 2 3 4 5 6 7 8 •Public for everyone •Inside and outside University •Collect and manage Capstone projects •Support looking up Capstone projects •Avoid repeating and copying idea •Ranking results •Chipper to build •Refer to other materials •Free to use •Friendly interface like Google Product 1 2 3 4 5 6 7 8 Mobile application (in future) Web application Project Plan 1 2 3 4 5 6 7 Development environment Process Project organization Project schedule Risk management 8 Development Environment 1 2 3 4 5 6 7 8 HARD WARE 2 Gb of RAM 100Gb of hard disk Core 2 Duo 2.0 GHz 1 Gb of RAM 100Gb of hard disk Core 2 Duo 2.0 GHz SOFT WARE Process 1 2 3 4 5 Follow Waterfall model 6 7 8 Project organization 1 2 3 4 5 6 7 8 Project organization 1 2 3 4 Controlling and Monitoring • Meeting • Assign task • Tracking task • Issue resolve • Review task • Report 5 6 7 8 Project organization 1 2 3 Communication control Online activity • Email • Chat • Phone Offline activity • Kick-Off project • Team building 4 5 6 7 8 Project Schedule 1 Overall plan 2 3 4 5 6 7 8 Risk Management 1 2 3 4 5 6 7 8 People risk Estimation risk Risk Management Technology risk Requirement risk Schedule risk System Requirement Specifications 1 2 3 4 User Requirements System Requirements Non-functional requirements 5 6 7 8 User Requirements 1 2 3 Lecturers and Students: •Search project documents. •Download documents. Librarians: •Edit profile. •Search documents. •Add/Edit/Delete document. •Add/Edit/Delete category. Administrator •Edit profile. •Add/Edit/Delete account. 4 5 6 7 8 User Requirements 1 2 3 4 5 Other requirement •Searched results will be ranked. •Document has following information: Name Author Supervisor Category Description 6 7 8 User Requirements 1 •Input files: Keyword file Abstract file Full document file Other materials 2 3 4 5 6 7 8 System Requirements 1 2 3 4 5 6 7 8 Communicate via the protocol HTTP to complete interactions based on service with client computers and use standard protocols. Configuration Server: Windows Server 2008 operating system .NET framework 3.5 SQL server 2008 IIS 7 Client: Web browser Non-functional Requirements 1 2 3 4 5 6 7 Usability Availability Reliability Security Security Performance Maintainability 8 System Analysis and Design 1 2 3 4 5 Architectural design Detail design Database design Coding convention Extract Keyword algorithm Ranking 6 7 8 Architectural design 1 2 Overall architecture 3 4 5 6 7 8 MVC architecture design pattern Detail design 1 2 3 4 5 6 7 CProDMS Component Diagram 8 Database design 1 2 3 4 5 6 Entity diagram 7 8 Coding convention 1 2 3 4 5 Follow: Microsoft .NET Library Standards FxCop rules and Code Analysis for Managed Code Warnings 6 7 8 Extract Keyword Algorithm 1 2 3 4 5 6 7 8 Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information (YUTAKA MATSUO and MITSURU ISHIZUKA) (Dec. 10, 2003) Introduction Study Algorithm Evaluation Algorithm – What is the keyword? 1 2 3 4 5 6 7 8 Meaning Frequency Position Algorithm – Step by step 1 Discard stop words Calculate X’2 value 2 3 Stem Expected probability 4 5 6 7 8 Extract frequency Preprocessing Select frequent term Processing Output Algorithm – Studying 1 2 3 4 5 6 7 8 Step2 Example: Step1 Stemmed Words Discarded Stop Words Original Text Information is the most powerful weapon in the modern society. Every day we are overflowed with a huge amount of data in form of electronic newspaper articles, emails, web pages and search results. Often, information we receive is incomplete, such that further search activities are required to enable correct interpretation and usage of this information. Information is the most powerful weapon in the modern society. Every day we are overflowed with a huge amount of data in form of electronic newspaper articles, emails, web pages and search results. Often, information we receive is incomplete, such that further search activities are required to enable correct interpretation and usage of this information. Informat Information powerful power weapon modern societi society day overflowed overflow huge amount amoun data electronic newspaper articles email articl emails web pages page search result results Often information informat receive incomplete such incomplet further search activ activities required requir enable correct interpret interpretation usage usag informat information Using Porter Stemming Algorithm Algorithm – Studying 1 2 3 4 5 6 7 8 Select frequent Term As study, number of keyword is about 10% number of term in document and no more than 30 terms. The top ten frequent terms (denoted as G) and the probability of occurrence, normalized so that the sum is to be 1. Algorithm – Studying 1 2 3 4 5 6 7 8 Co-occurrence and Importance Two terms in a sentence are considered to co-occur once. Example: The imitation game could then be played with the machine in question and the mimicking digital computer and the interrogator would be unable to distinguish them. “imitation” and “digital computer” have one co-occurrence Algorithm – Studying 1 2 3 4 Co-occurrence and Importance 5 6 7 8 Algorithm – Studying 1 2 3 4 5 6 7 8 Co-occurrence and Importance The degree of biases of co-occurrence can be used as a indicator of term importance Algorithm – Studying 1 2 3 4 5 6 7 8 The statistical value of χ2 is defined as pg Unconditional probability of a frequent term g ∈ G (the expected probability) nw The total number of co-occurrence of term w and frequent terms G freq (w, g) Frequency of co-occurrence of term w and term g Algorithm – Studying 1 2 3 4 5 6 7 8 We consider the length of each sentence and revise our definitions pg (the sum of the total number of terms in sentences where g appears) divided by (the total number of terms in the document) nw The total number of terms in the sentences where w appears including w Algorithm – Studying 1 2 3 4 5 6 7 8 Algorithm – Studying 1 2 3 4 5 6 7 8 the following function to measure robustness of bias values Subtracts the maximal term from the X2 value Algorithm – Studying 1 2 3 4 5 6 7 8 Algorithm – Studying 1 2 3 4 5 6 7 8 To improve extracted keyword, we will cluster terms Two major approaches (Hofmann & Puzicha 1998) are: Similarity-based clustering IfEg: terms Monday w1 and w2 similar is ahave day in week.distribution of co-occurrence with other terms,isw1 andinw2 are considered to be the same Tuesday a day week. cluster. Wednesday is a day in week. Pairwise clustering If terms w1 and w2 co-occur frequently, w1 and w2 are considered to be the same cluster. Algorithm – Studying 1 2 3 4 5 6 7 8 Similarity-based clustering centers upon Red Circles Pairwise clustering focuses on Green Circles Algorithm – Studying 1 2 3 4 5 6 7 8 Similarity-based clustering Cluster a pair of terms whose Jensen-Shannon divergence is Where: and: Algorithm – Studying 1 2 3 4 5 6 7 8 Pairwise clustering Cluster a pair of terms whose mutual information is Where: Algorithm – Evaluation 1 2 3 4 5 6 7 8 Precision: Ratio Coverage: Frequency index:ofaverage right indispensable keyword frequency to keyword number of keyword inof listkeyword in list to all the indispensable terms Ranking – Why? 1 2 3 4 5 6 Ranking Result 7 8 Ranking 1 2 3 4 5 6 7 8 Ranking 1 2 3 4 5 6 7 8 of formula Term in a collection documents: Use rankFrequency calculate Total number of Term t in theExtraction for Database Search ( Automatic Keyword documents that given : Prof. Dr. techn. Dipl.-Ing. Wolfgang Nejdl First examiner contain Term t document Second examiner : Prof. Dr. Heribert Vollmer Rank Elena of Term t in Supervisor : MSc. Dipl.-Inf. Demidova ) document, which reliability extracted by Extract coefficient R(t) = Fd(t)*log(1 + N/N(t)) (1) Service Ranking formula : Rank = d * Rd(t) / R(t) Rank of Term t in all => the collection Rank = d * (2) Total number of documents in the Rd(t) / (Fd(t)*log(1 collection + N/N(t))) (3) Searching 1 2 3 4 5 6 7 8 Testing 1 2 3 4 5 V - model 6 7 8 Testing 1 2 3 4 5 6 7 8 Testing 1 2 3 4 5 6 7 8 Test result No Tester 1 AnhNT 2 Module code Pass Fail Untested N/A Number of test cases Master Page 18 0 0 0 18 AnhNT Home Page 12 0 0 0 12 3 AnhNT Search Result 5 0 0 0 5 4 AnhNT User Account 69 0 0 0 69 5 AnhNT Error Page 8 0 0 0 8 6 NamH Category 36 0 0 0 36 7 NamH Document 47 0 0 0 47 8 NamH Authenticated 81 0 0 0 81 9 NamH User Document Detail 9 0 0 0 9 285 0 0 0 285 Sub total Test coverage 100.00 % Test successful coverage 100.00 % Deployment Package Source Code Client side Server side User guide 1 2 3 4 5 6 7 8 Summary 1 2 3 4 5 6 7 8 Strong point • Enthusiasm • Creative • Cope with change Weak point • Lack of technical skill • Lack of management skills Lessons learned • Improve technical & management skills • Release on-time product with the restriction of time and resource • Improve communication skills & problem solving Demo & Q&A 1 2 3 4 5 6 7 8