On the use of Side I..

Side-Information-For-MiningTextData ABSTRACT In many text mining applications, side-information is available along with the text documents. Such side-information may be of different kinds, such as document provenance information, the links in the document, user-access behavior from web logs, or other non-textual attributes which are embedded into the text document. Such attributes may contain a tremendous amount of information for clustering purposes. However, the relative importance of this sideinformation may be difficult to estimate, especially when some of the information is noisy. In such cases, it can be risky to incorporate side-information into the mining process, because it can either improve the quality of the representation for the mining process, or can add noise to the process. Therefore, we need a principled way to perform the mining process, so as to maximize the advantages from using this side information. In this paper, we design an algorithm which combines classical partitioning algorithms with probabilistic models in order to create an effective clustering approach. We then show how to extend the approach to the classification problem. We present experimental results on a number of real data sets in order to illustrate the advantages of using such an approach. Existing System: In many text mining applications, side-information is available along with the text documents. Such side-information may be of different kinds, such as document provenance information, the links in the document, user-access behavior from web logs, or other non-textual attributes which are embedded into the text document. Such Attributes may contain a tremendous amount of information for clustering purposes. However, the relative importance of this sideinformation may be difficult to estimate, especially when some of the information is noisy. Disadvantage: 1. In such cases, it can be risky to incorporate side-information into the mining process, because it can either improve the quality of the representation for the mining process, or can add noise to the process. Proposed System In this Project, We design an algorithm which combines classical partitioning algorithms with probabilistic models in order to create an effective clustering approach. We then show how to extend the approach to the classification problem. We present experimental results on a number of real data sets in order to illustrate the advantages of using such an approach. Advantage : A probabilistic model on the side information uses the partitioning information for the purpose of estimating the coherence of different clusters with side attributes. This helps in abstracting out the noise in the membership behavior of different attributes. IMPLEMENTATION Implementation is the stage of the project when the theoretical design is turned out into a working system. Thus it can be considered to be the most critical stage in achieving a successful new system and in giving the user, confidence that the new system will work and be effective. The implementation stage involves careful planning, investigation of the existing system and it’s constraints on implementation, designing of methods to achieve changeover and evaluation of changeover methods. Main Modules:1. User Module: In this module, Users are having authentication and security to access the detail which is presented in the ontology system. Before accessing or searching the details user should have the account in that otherwise they should register first. 2. The COATES Algorithm: We will describe our algorithm for text clustering with side-information. We refer to this algorithm as COATES throughout the project, which corresponds to the fact that it is a COntent and Auxiliary attribute based Text cluStering algorithm. We assume that an input to the algorithm is the number of clusters k. As in the case of all text-clustering algorithms, it is assumed that stop-words have been removed, and stemming has been performed in order to improve the discriminatory power of the attributes. 3. Extension To Classification: We will discuss how to extend the approach to classification. We will extend our earlier clustering approach in order to incorporate supervision, and create a model which summarizes the class distribution in the data in terms of the clusters. Then, we will show how to use the summarized model for effective classification. First, we will introduce some notations and definitions which are specific to the classification problem. As before, we assume that we have a text corpus S of documents. 4. Experimental Results: We compare our clustering and classification methods against a number of baseline techniques on real and synthetic data sets. We refer to our clustering approach as COntent and Auxiliary attribute based TExt clustering (COATES). As the baseline, we used two different methods: (1) An efficient projection based clustering approach which adapts the k-means approach to text. This approach is widely known to provide excellent clustering results in a very efficient way. We refer to this algorithms as Schutze Silverstein [text only] in all figure legends in the experimental section . System Configuration:H/W System Configuration:- Processor Pentium –III - Speed - 1.1 Ghz RAM - 256 MB(min) Hard Disk - 20 GB Floppy Drive - 1.44 MB Key Board - Standard Windows Keyboard Mouse - Two or Three Button Mouse Monitor - SVGA S/W System Configuration: Operating System :Windows95/98/2000/XP  Application Server : Tomcat5.0/6.X  Front End : HTML, Java, Jsp  Scripts  Server side Script  Database : Mysql 5.0  Database Connectivity : JDBC. : JavaScript. : Java Server Pages.

On the use of Side I..

Related documents

Products

Support

On the use of Side I..

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib