International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 5- May 2013 Robust Alias Detection System for Accurate Identification of Aliases of a Given Person Name K. Sarada#1, K P N.V. Satya Sree#2 and K.V. Narasimha Reddy#3 1 K.Sarada, Vignan’s Nirula Institute of Technology & Science for Women, Guntur, AP, India. K.P.N.V.Satya Sree, Asst Professor, Department of Computer Science & Engineering, Vignan’s Nirula Institute of Technology & Science for Women, Guntur, AP, India. 3 K.V. Narasimha Reddy, Asst Professor, Department of Computer Science & Engineering, Vignan’s Nirula Institute of Technology & Science for Women, Guntur, AP, India 2 Abstract— Due to the increase in the size of World Wide Web, Content Based Retrieval becomes more challenging and also it provides lot of irrelevant results. We propose a new Concept and Content Based Text Retrieval technique to provide exact results. We use Associative mining technique with the semantic concepts. Our methodology indexes the texts according to semantic concepts and generates association which will be used for Text retrieval. We extract the high level concepts and low level features. The extracted feature vectors are indexed to a semantic concept, and we generate association rules, using which the retrieval is done. We use both the visual and texture features to represent the semantic concept. This CCBR is very help full in both data collection and modeling large scale text data bases. . Keywords— Privacy-preserving, data publishing, functional dependency, utility, data reconstruction. I. INTRODUCTION World wide web resources increasing every day and the growth of the web resources makes the information retrieval as a challenging task. Particularly the text retrieval in the web becomes more complicated due to the growth of the web resources. There exists few techniques for content based text retrieval, but suffers with the efficiency of providing better and appropriate results. For example Google provides text search as a concept based one and also it provides lots of irrelevant results. For a content based system to be successful, it need to minimize the gap between analysts model of visual patterns and computers representation of information. Content based system enables the user to easily access to text databases using query methods similar to reasoning. Researches that use semantic methods proved to better mimic knowledge that represent visual patterns. Fonseca at al.[1] proposes an ontology-driven aerial-information system for classifying content based system that uses complex-query methods such as shape, multi object relationships and semantics. In this paper, we propose a text retrieval technique that uses content and concept based methods and association rules to link visual semantics to the concepts. We provide the query by example and query by concepts for the efficient retrieval of texts. we deal with shapes, the only information usually available is the underlying geometry. Appropriate features are chosen to encode this geometry as richly as possible, without ISSN: 2231-5381 compromising on robustness. Quite clearly, the set of useful features varies depending on the particular application at hand. For example, invariance to articulations of part structures is very important in applications like gait-based human identification whereas the same feature is not desired for applications like retrieval based on human pose. Our goal here is to develop system that supports fast retrieval of shapes without needing any costly correspondence step during matching. To this end, we use (or propose) features that address most challenges faced by shape matching tasks including invariance to object translation, rotation, scale, articulations, etc. In the proposed indexing framework, a given shape is represented using a collection of feature vectors, each characterizing a geometrical relationship between a pair of landmark points. The features should be easily computable for the matching algorithm to be efficient and to be able to scale up to large database sizes. For each landmark pair, depending on the application, all or a subset of the following geometrical characteristics are encoded in the corresponding feature vector. II. Background Work Quang Minh Vu & et.al [13] proposed a technique of disambiguation of people in Web search using a knowledge base. It was the work about to differentiate documents related to different people to find the documents which ever were similar to the same person. In this paper[13] the authors proposed a method that used Web directories as a knowledge base to find the documents matching and their similarity index. As web documents often contain noisy data, to find out a topic of a web page was difficult. Authors [13] used several sets of documents on several topics to help to find web pages’ topics and to extract important terms related to topics. Then they [13] used important terms for the calculation of document pair similarities. Y. Matsuo & et. Al. [14] introduced the concept of keyword extraction from a single document using word cooccurrence statistical information method. In this paper [14], the authors presented a new keyword extraction algorithm that applies to a single document without using a corpus. Frequent terms are extracted first, then a set of co-occurrences between each term and the frequent terms, i.e., occurrences in the same http://www.ijettjournal.org Page 1965 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 5- May 2013 sentences, is generated. Co-occurrence distribution shows importance of a term in the document as follows. If the probability distribution of co-occurrence between term a and the frequent terms is biased to a particular subset of frequent terms, then term a is likely to be a keyword. The degree of bias of a distribution is measured by the χ2-measure. Their [14] algorithm showed comparable performance to tfidf without using a corpus. Wen-Hsiang Lu & et.al [15], proposed anchor text mining for translation of Web queries approach. To discover translation knowledge in diverse data resources on the Web, this article [15] proposed an effective approach to finding translation equivalents of query terms and constructing multilingual lexicons through the mining of Web anchor texts and link structures. Although Web anchor texts are widescoped hypertext resources, not every particular pair of languages contains sufficient anchor texts for effective extraction of translations for Web queries. For more generalized applications, the approach [15] was designed based on a transitive translation model. The translation equivalents of a query term can be extracted via its translation in an intermediate language. To reduce interference from translation errors, the approach further integrates a competitive linking algorithm into the process of determining the most probable translation. A series of experiments has been conducted [15], including performance tests on term translation extraction, cross-language information retrieval, and translation suggestions for practical Web search services, respectively. The obtained experimental results had shown that the proposed approach was effective in extracting translations of unknown queries, is easy to combine with the probabilistic retrieval model to improve the cross-language retrieval performance, and is very useful when the considered language pairs lack a sufficient number of anchor texts [15]. Crawled text Preprocessin g Feature extraction Associati on rule generatio n Input text or concept Perform cbir & return results visual and texture feature Indexing Identify concept Compute relevance score Algorithm1: Step1: Crawl texts from internet. Step2: Apply sobel edge detector. Step3: Extract raw features. Step4: Normalize to same size. Step5: Compute relevancy score with the Semantic concepts. Compute cosine similarity (Euclidean distance )between selected feature vector and a single vector under semantic concept. Vdis=(V i-Vj)------------(1). V i – Selected Feature from input set. Vj – Selected feature under a semantic concept. Srs=(Nk/Tk)*100 ---(2). Nk-No of feature vectors matched. III. METHODS Tk- Total number of features available under particular Preprocessing: We perform preprocessing on the web semantic concept. Step6: repeat step 5 for all semantic crawled text; first the crawled text is converted to a fixed Concept. shape in order to map features into unique size. The scaled Step7: Identify the concept the feature related. text is converted to gray scale and edge detection is performed. Step 8: Index the vector under the semantic concept. We extract the shape feature from the edge detected text. The extracted raw feature is normalized to fixed size. We use Association Rule Generation general algorithms for edge detection on the input texts. The We extract the full feature subspace indexed into the extracted texture feature is mapped to unique size for indexing. system and generate decision rules. Each rule has set of Visual and Texture feature Indexing: The extracted feature vectors are indexed to a semantic feature sub space and unique semantic. The association rules concept based on the relevance score. We compute the are generated using Total from partial approach. The relevancy score with all the texts in a semantic concept. The generated rules are evaluated using wilcoxon signed rank test. feature vector is assigned a label to the semantic concept only The newly sorted rule is added to the model. Concept Query if the similarity of texts below the semantic is more similar to The input concept is used to perform text retrieval. We the input text. We compute cosine similarity method to calculate the similarity score with all the association rules compute the similarity between two feature vectors. The identified feature vector is indexed into the semantic with the available in the indexed system. Based on the concept label. We compute similarity values with both visual and identified we compute the relevance score with all the textual feature assigned to the texts in the concept category. We sort texture features. the texts according to the relevance score and return the results. ISSN: 2231-5381 http://www.ijettjournal.org Page 1966 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 5- May 2013 Algorithm2: Step1: Receive the concept query. Step2: compute relevancy score with the All semantic concepts. Compute cosine similarity(Euclidean distance )between selected concept and a single term under semantic concept. Vdis=(Vi-Vj)------------(1). Vi – input concept. Vj – Selected term under a semantic concept. Srs=(Nk/Tk)*100 ---(2). Nk-No of keywords matched. Tk- Total number of keywords available under particular semantic concept. Step3: repeat step 5 for all semantic Concept. Step4: Identify the concept the concept query related. Step5: retrieve all texts under the semantic concept. Step6: return results. Content Query: In this method we preprocess the text and extract both visual and texture features and normalize the feature vectors. Using the extracted feature vectors we compute the relevance score with all the association rules. We compute the weight for each rule and sort the score. Based on the score we extract the feature vectors identified and return as results. Step1: Read Input Text. Step2: Apply sobel edge detector. Step3: Extract raw features. Step4: Normalize to same size. Step5: compute relevancy score with the Semantic concepts. Compute cosine similarity(Euclidean distance )between selected feature vector and a single vector under semantic concept. Vdis=(Vi-Vj)------------(1). Vi – Selected Feature from input set. Vj – Selected feature under a semantic concept. Srs=(Nk/Tk)*100 ---(2). Nk-No of feature vectors matched. Tk- Total number of features available under particular semantic concept. Step6: repeat step 5 for all semantic Step7: Identify the concept the feature related. Step8: retrieve relevant texts and return results IV. CONCLUSION The proposed method will compute anchor texts-based cooccurrences among the given personal name and aliases, and will create a word co-occurrence graph by making connections between nodes representing name and aliases in the graph based on their first order associations with each other. The graph mining algorithm to find out the hop distances between nodes will be used to identify the association orders between name and aliases. Ranking SVM ISSN: 2231-5381 will be used to rank the anchor texts according to the cooccurrence statistics in order to identify the anchor texts in the first order associations. The web search engine can expand the query on a personal name by tagging aliases in the order of their associations with name to retrieve all relevant results thereby improving recall and achieving a substantial MRR compared to that of previously proposed methods. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [1] L. J. Latecki, R. Lakamper, and U. Eckhardt, “Shape descriptors for non-rigid shapes with a single closed contour,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2000, pp. 424–429. [2] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recognition using shape contexts,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 4, pp. 509–522, Apr. 2002. [3] H. Ling and D. W. Jacobs, “Shape classification using the inner-distance,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 2, pp. 286–299, Feb. 2007. [4] C. Rao, A. Yilmaz, and M. Shah, “View-invariant representation and recognition of actions,” Int. J. Comput. Vis., vol. 50, no. 2, pp. 203–226,2002. [5] Y.Wang, H. Jiang, M. Drew, L. Ze-Nian, and G. Mori, “Unsupervised discovery of action classes,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2006, pp. 1654–1661. [6] D. Sharvit, J. Chan, H. Tek, and B. B. Kimia, “Symmetrybased indexing of text databases,” J. Vis. Commun. Text Represent., vol. 9, no. 4, pp. 366–380, 1998. [7] T. B. Sebastian, P. N. Klein, and B. B. Kimia, “Recognition of shapes by editing their shock graphs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 5, pp. 550–571, May 2004. [8] B. Leibe and B. Schiele, “Analyzing appearance and contour based methods for object categorization,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2003. [9] S. Biswas, G. Aggarwal, and R. Chellappa, “Efficient indexing for articulation invariant shape matching and retrieval,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2007, pp. 1–8. [10] G. Mori and J. Malik, “Recognizing objects in adversarial clutter: Breaking a visual captcha,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2003, pp. 134–141. [11] Z. Tu and A. L. Yuille, “Shape matching and recognition: Using generative models and informative features,” in Proc. Eur. Conf. Computer Vision, 2004, pp. 195–209. [1] K.Sarada M.Tech (CSE) Department of Computer Science & Engineering at Vignan’s Nirula Institute Of Technology & Science for Women, Guntur. [2] K.P.N.V.Satya Sree Asst. Professor Department of Computer Science & Engineering at Vignan’s Nirula Institute Of Technology & Science for Women, Guntur. She guided many projects in the area of Data Warehousing and Data mining for CSE & IT Departments. Her research interests are in the areas of Datamining and Image Processing. http://www.ijettjournal.org Page 1967 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 5- May 2013 [3] K.V.Narasimha Reddy received the B.Tech (CSE) from JNTUH, M.Tech (C.S.E) from JNTUK he is currently working as an Assistant Professor & Head of the Department of Computer Science & Engineering at Vignan’s Nirula Institute Of Technology & Science for Women, Guntur. He guided many projects in the area of image processing for CSE & IT Departments. His research interests are in the areas of Datamining and Image Processing. ISSN: 2231-5381 http://www.ijettjournal.org Page 1968