TCM - Prince Sultan University

advertisement
Prince Sultan University
College of Computer & Information Sciences
Department of Computer Science
CS371: Web Development
Term Project
Building a Text Corpus Management Systems
Objective:
Text corpora are playing a very important role in the areas of computational linguistics, text
mining, web mining plus many other areas. Such corpora provide very interesting statistics and
information that greatly simplify the development of many applications, including machine
translation, spelling correction, speech recognition, text summarization, opinion mining …etc.
The main purpose of this project is to develop a web based Arabic text corpus management
system that may be used to maintain and manage a large corpus of Arabic text documents. The
system should support document collection, document categorization, search for a particular
document, listing documents in a particular category, display the attributes of a particular
document, summary of statistics for a given document, summary of statistics for the whole
corpus …etc.
Details:
The Text Corpus Management System (TCM) will be used by the following categories of users:
1- The “Master” of the system
2- TCM Developers
3- TCM Collaborators
4- Researchers and other end users
This requires a multilevel security system with different capabilities and rights
The TCM should support the following operations in general:
 Adding a new item (document) to the system
 Deleting an item
 Updating the contents of an item
 Modifying the content of an item
 Analyzing an item to compute statistics
 Maintaining a global statistics portfolio for all required statistics, attributes and structures
The TCM should make it possible for developers to access the items programmatically for
analysis and other purposes.
Clearly, you need to design and maintain a suitable database to satisfy the above requirements
Other detailed features will be provided by the instructor as appropriate.
Bonus:
Another desired feature is to add a crawler component that collects Arabic documents from the
web, extracts content and stores the resulting doc in the TCM. This is a bonus feature that
guarantees more marks.
Download