Prince Sultan University College of Computer & Information Sciences Department of Computer Science CS371: Web Development Term Project Building a Text Corpus Management Systems Objective: Text corpora are playing a very important role in the areas of computational linguistics, text mining, web mining plus many other areas. Such corpora provide very interesting statistics and information that greatly simplify the development of many applications, including machine translation, spelling correction, speech recognition, text summarization, opinion mining …etc. The main purpose of this project is to develop a web based Arabic text corpus management system that may be used to maintain and manage a large corpus of Arabic text documents. The system should support document collection, document categorization, search for a particular document, listing documents in a particular category, display the attributes of a particular document, summary of statistics for a given document, summary of statistics for the whole corpus …etc. Details: The Text Corpus Management System (TCM) will be used by the following categories of users: 1- The “Master” of the system 2- TCM Developers 3- TCM Collaborators 4- Researchers and other end users This requires a multilevel security system with different capabilities and rights The TCM should support the following operations in general: Adding a new item (document) to the system Deleting an item Updating the contents of an item Modifying the content of an item Analyzing an item to compute statistics Maintaining a global statistics portfolio for all required statistics, attributes and structures The TCM should make it possible for developers to access the items programmatically for analysis and other purposes. Clearly, you need to design and maintain a suitable database to satisfy the above requirements Other detailed features will be provided by the instructor as appropriate. Bonus: Another desired feature is to add a crawler component that collects Arabic documents from the web, extracts content and stores the resulting doc in the TCM. This is a bonus feature that guarantees more marks.