Robust Alias Detection System for Accurate Identification of

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 5- May 2013
Robust Alias Detection System for Accurate Identification of
Aliases of a Given Person Name
K. Sarada#1, K P N.V. Satya Sree#2 and K.V. Narasimha Reddy#3
1
K.Sarada, Vignan’s Nirula Institute of Technology & Science for Women, Guntur, AP, India.
K.P.N.V.Satya Sree, Asst Professor, Department of Computer Science & Engineering, Vignan’s Nirula Institute of Technology &
Science for Women, Guntur, AP, India.
3 K.V. Narasimha Reddy, Asst Professor, Department of Computer Science & Engineering, Vignan’s Nirula Institute of Technology &
Science for Women, Guntur, AP, India
2
Abstract— Due to the increase in the size of World Wide Web,
Content Based Retrieval becomes more challenging and also it
provides lot of irrelevant results. We propose a new Concept and
Content Based Text Retrieval technique to provide exact results.
We use Associative mining technique with the semantic concepts.
Our methodology indexes the texts according to semantic
concepts and generates association which will be used for Text
retrieval. We extract the high level concepts and low level
features. The extracted feature vectors are indexed to a semantic
concept, and we generate association rules, using which the
retrieval is done. We use both the visual and texture features to
represent the semantic concept. This CCBR is very help full in
both data collection and modeling large scale text data bases.
.
Keywords— Privacy-preserving, data publishing, functional
dependency, utility, data reconstruction.
I. INTRODUCTION
World wide web resources increasing every day and the
growth of the web resources makes the information retrieval
as a challenging task. Particularly the text retrieval in the web
becomes more complicated due to the growth of the web
resources. There exists few techniques for content based text
retrieval, but suffers with the efficiency of providing better
and appropriate results. For example Google provides text
search as a concept based one and also it provides lots of
irrelevant results.
For a content based system to be successful, it need to
minimize the gap between analysts model of visual patterns
and computers representation of information. Content based
system enables the user to easily access to text databases using
query methods similar to reasoning. Researches that use
semantic methods proved to better mimic knowledge that
represent visual patterns. Fonseca at al.[1] proposes an
ontology-driven aerial-information system for classifying
content based system that uses complex-query methods such
as shape, multi object relationships and semantics.
In this paper, we propose a text retrieval technique that
uses content and concept based methods and association rules
to link visual semantics to the concepts. We provide the query
by example and query by concepts for the efficient retrieval of
texts. we deal with shapes, the only information usually
available is the underlying geometry. Appropriate features are
chosen to encode this geometry as richly as possible, without
ISSN: 2231-5381
compromising on robustness. Quite clearly, the set of useful
features varies depending on the particular application at hand.
For example, invariance to articulations of part structures is
very important in applications like gait-based human
identification whereas the same feature is not desired for
applications like retrieval based on human pose. Our goal here
is to develop system that supports fast retrieval of shapes
without needing any costly correspondence step during
matching. To this end, we use (or propose) features that
address most challenges faced by shape matching tasks
including invariance to object translation, rotation, scale,
articulations, etc. In the proposed indexing framework, a
given shape is represented using a collection of feature vectors,
each characterizing a geometrical relationship between a pair
of landmark points. The features should be easily computable
for the matching algorithm to be efficient and to be able to
scale up to large database sizes. For each landmark pair,
depending on the application, all or a subset of the following
geometrical characteristics are encoded in the corresponding
feature vector.
II. Background Work
Quang Minh Vu & et.al [13] proposed a technique of
disambiguation of people in Web search using a knowledge
base. It was the work about to differentiate documents related
to different people to find the documents which ever were
similar to the same person. In this paper[13] the authors
proposed a method that used Web directories as a knowledge
base to find the documents matching and their similarity index.
As web documents often contain noisy data, to find out a topic
of a web page was difficult. Authors [13] used several sets of
documents on several topics to help to find web pages’ topics
and to extract important terms related to topics. Then they [13]
used important terms for the calculation of document pair
similarities.
Y. Matsuo & et. Al. [14] introduced the concept of
keyword extraction from a single document using word cooccurrence statistical information method. In this paper [14],
the authors presented a new keyword extraction algorithm that
applies to a single document without using a corpus. Frequent
terms are extracted first, then a set of co-occurrences between
each term and the frequent terms, i.e., occurrences in the same
http://www.ijettjournal.org
Page 1965
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 5- May 2013
sentences, is generated. Co-occurrence distribution shows
importance of a term in the document as follows. If the
probability distribution of co-occurrence between term a and
the frequent terms is biased to a particular subset of frequent
terms, then term a is likely to be a keyword. The degree of
bias of a distribution is measured by the χ2-measure. Their [14]
algorithm showed comparable performance to tfidf without
using a corpus.
Wen-Hsiang Lu & et.al [15], proposed anchor text mining
for translation of Web queries approach. To discover
translation knowledge in diverse data resources on the Web,
this article [15] proposed an effective approach to finding
translation equivalents of query terms and constructing
multilingual lexicons through the mining of Web anchor texts
and link structures. Although Web anchor texts are widescoped hypertext resources, not every particular pair of
languages contains sufficient anchor texts for effective
extraction of translations for Web queries. For more
generalized applications, the approach [15] was designed
based on a transitive translation model. The translation
equivalents of a query term can be extracted via its translation
in an intermediate language. To reduce interference from
translation errors, the approach further integrates a
competitive linking algorithm into the process of determining
the most probable translation. A series of experiments has
been conducted [15], including performance tests on term
translation extraction, cross-language information retrieval,
and translation suggestions for practical Web search services,
respectively. The obtained experimental results had shown
that the proposed approach was effective in extracting
translations of unknown queries, is easy to combine with the
probabilistic retrieval model to improve the cross-language
retrieval performance, and is very useful when the considered
language pairs lack a sufficient number of anchor texts [15].
Crawled
text
Preprocessin
g
Feature
extraction
Associati
on rule
generatio
n
Input text
or concept
Perform cbir
& return
results
visual and
texture
feature
Indexing
Identify
concept
Compute
relevance
score
Algorithm1:
Step1: Crawl texts from internet.
Step2: Apply sobel edge detector.
Step3: Extract raw features.
Step4: Normalize to same size.
Step5: Compute relevancy score with the Semantic
concepts.
Compute cosine similarity (Euclidean distance )between
selected feature vector and a single vector under semantic
concept.
Vdis=(V i-Vj)------------(1).
V i – Selected Feature from input set.
Vj – Selected feature under a semantic concept.
Srs=(Nk/Tk)*100
---(2).
Nk-No of feature vectors matched.
III. METHODS
Tk- Total number of features available under particular
Preprocessing: We perform preprocessing on the web semantic concept.
Step6: repeat step 5 for all semantic
crawled text; first the crawled text is converted to a fixed
Concept.
shape in order to map features into unique size. The scaled
Step7: Identify the concept the feature related.
text is converted to gray scale and edge detection is performed.
Step 8: Index the vector under the semantic concept.
We extract the shape feature from the edge detected text. The
extracted raw feature is normalized to fixed size. We use
Association Rule Generation
general algorithms for edge detection on the input texts. The
We extract the full feature subspace indexed into the
extracted texture feature is mapped to unique size for indexing.
system
and generate decision rules. Each rule has set of
Visual and Texture feature Indexing:
The extracted feature vectors are indexed to a semantic feature sub space and unique semantic. The association rules
concept based on the relevance score. We compute the are generated using Total from partial approach. The
relevancy score with all the texts in a semantic concept. The generated rules are evaluated using wilcoxon signed rank test.
feature vector is assigned a label to the semantic concept only The newly sorted rule is added to the model.
Concept Query
if the similarity of texts below the semantic is more similar to
The input concept is used to perform text retrieval. We
the input text. We compute cosine similarity method to
calculate
the similarity score with all the association rules
compute the similarity between two feature vectors. The
identified feature vector is indexed into the semantic with the available in the indexed system. Based on the concept
label. We compute similarity values with both visual and identified we compute the relevance score with all the textual
feature assigned to the texts in the concept category. We sort
texture features.
the texts according to the relevance score and return the
results.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 1966
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 5- May 2013
Algorithm2:
Step1:
Receive the concept query.
Step2: compute relevancy score with the
All semantic concepts.
Compute cosine similarity(Euclidean distance )between
selected concept and a single term under semantic concept.
Vdis=(Vi-Vj)------------(1).
Vi – input concept.
Vj – Selected term under a semantic concept.
Srs=(Nk/Tk)*100
---(2).
Nk-No of keywords matched.
Tk- Total number of keywords available under particular
semantic concept.
Step3: repeat step 5 for all semantic
Concept.
Step4: Identify the concept the concept query related.
Step5: retrieve all texts under the semantic concept.
Step6: return results.
Content Query:
In this method we preprocess the text and extract both
visual and texture features and normalize the feature vectors.
Using the extracted feature vectors we compute the relevance
score with all the association rules. We compute the weight
for each rule and sort the score. Based on the score we extract
the feature vectors identified and return as results.
Step1:
Read Input Text.
Step2:
Apply sobel edge detector.
Step3: Extract raw features.
Step4: Normalize to same size.
Step5: compute relevancy score with the
Semantic concepts.
Compute cosine similarity(Euclidean distance )between
selected feature vector and a single vector under semantic
concept.
Vdis=(Vi-Vj)------------(1).
Vi – Selected Feature from input set.
Vj – Selected feature under a semantic concept.
Srs=(Nk/Tk)*100
---(2).
Nk-No of feature vectors matched.
Tk- Total number of features available under particular
semantic concept.
Step6: repeat step 5 for all semantic
Step7: Identify the concept the feature related.
Step8: retrieve relevant texts and return results
IV. CONCLUSION
The proposed method will compute anchor texts-based cooccurrences among the given personal name and aliases, and
will create a word co-occurrence graph by making
connections between nodes representing name and aliases in
the graph based on their first order associations with each
other. The graph mining algorithm to find out the hop
distances between nodes will be used to identify the
association orders between name and aliases. Ranking SVM
ISSN: 2231-5381
will be used to rank the anchor texts according to the cooccurrence statistics in order to identify the anchor texts in the
first order associations. The web search engine can expand the
query on a personal name by tagging aliases in the order of
their associations with name to retrieve all relevant results
thereby improving recall and achieving a substantial MRR
compared to that of previously proposed methods.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[1] L. J. Latecki, R. Lakamper, and U. Eckhardt, “Shape
descriptors for non-rigid shapes with a single closed contour,” in
Proc. IEEE Conf. Computer Vision and Pattern Recognition,
2000, pp. 424–429.
[2] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and
object recognition using shape contexts,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 24, no. 4, pp. 509–522, Apr. 2002.
[3] H. Ling and D. W. Jacobs, “Shape classification using the
inner-distance,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29,
no. 2, pp. 286–299, Feb. 2007.
[4] C. Rao, A. Yilmaz, and M. Shah, “View-invariant
representation and recognition of actions,” Int. J. Comput. Vis.,
vol. 50, no. 2, pp. 203–226,2002.
[5] Y.Wang, H. Jiang, M. Drew, L. Ze-Nian, and G. Mori,
“Unsupervised discovery of action classes,” in Proc. IEEE Conf.
Computer Vision and Pattern Recognition, 2006, pp. 1654–1661.
[6] D. Sharvit, J. Chan, H. Tek, and B. B. Kimia, “Symmetrybased indexing of text databases,” J. Vis. Commun. Text
Represent., vol. 9, no. 4, pp. 366–380, 1998.
[7] T. B. Sebastian, P. N. Klein, and B. B. Kimia, “Recognition
of shapes by editing their shock graphs,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 26, no. 5, pp. 550–571, May 2004.
[8] B. Leibe and B. Schiele, “Analyzing appearance and contour
based methods for object categorization,” in Proc. IEEE Conf.
Computer Vision and Pattern Recognition, 2003.
[9] S. Biswas, G. Aggarwal, and R. Chellappa, “Efficient
indexing for articulation
invariant shape matching and retrieval,” in Proc. IEEE Conf.
Computer Vision and Pattern Recognition, 2007, pp. 1–8.
[10] G. Mori and J. Malik, “Recognizing objects in adversarial
clutter: Breaking a visual captcha,” in Proc. IEEE Conf.
Computer Vision and Pattern Recognition, 2003, pp. 134–141.
[11] Z. Tu and A. L. Yuille, “Shape matching and recognition:
Using generative models and informative features,” in Proc. Eur.
Conf. Computer Vision, 2004, pp. 195–209.
[1] K.Sarada M.Tech (CSE) Department of
Computer Science & Engineering at Vignan’s
Nirula Institute Of Technology & Science for
Women, Guntur.
[2] K.P.N.V.Satya Sree Asst. Professor
Department of Computer Science &
Engineering at Vignan’s Nirula Institute Of
Technology & Science for Women, Guntur.
She guided many projects in the area of
Data Warehousing and Data mining for
CSE & IT Departments. Her research
interests are in the areas of Datamining and Image
Processing.
http://www.ijettjournal.org
Page 1967
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 5- May 2013
[3] K.V.Narasimha Reddy received the
B.Tech (CSE) from JNTUH, M.Tech
(C.S.E) from JNTUK he is currently
working as an Assistant Professor & Head
of the Department of Computer Science &
Engineering at Vignan’s Nirula Institute Of
Technology & Science for Women, Guntur. He guided
many projects in the area of image processing for CSE &
IT Departments. His research interests are in the areas of
Datamining and Image Processing.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 1968
Download