A Novel Co-Clustering Mechanism for Generation of Optimal Clusters Kalpana Palla , P.Rajasekhar

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 5- Feb 2014
A Novel Co-Clustering Mechanism for Generation of
Optimal Clusters
Kalpana Palla1, P.Rajasekhar2
1
2
M.Tech Scholar, Assistant Professor
Dept. of CSE, Avanthi Institute of Engineering and Technology,Visakhapatnam.
1,2
Abstract: Co-clustering is an important research issue in now
a days in the field of data mining, musical data is one type of
example foe this proposal, We are proposing an efficient data
clustering mechanism with incremental clustering algorithm
with genetic feature, Clustering is a mostly unsupervised
procedure and the majority of the clustering algorithms
depend on certain assumptions in order to define the
subgroups present in a data set. In our approach we are
performing incremental clustering on music dataset which
consists of artists and tags, for optimal clusters apart from the
traditional approaches.
I.INTRODUCTION
Clustering is the Processing of grouping the similar type of
objects based on the similarity between the data objects,
this similarity can be measures by the distance or semantic
similarity or any other non trivial measure which gives the
relation between the data objects, Grouping the musical
data takes more interesting in the recent days of interest
based on the feature set of the musical data objects.
Even though various approaches released in the years of
research every approach has their own advantages and
disadvantages like random selection of the centroios,local
optima, implementation complexity issues while dealing
with tree structures and many other. In the real time
scenario every data object need not to maintain same
dimensionality.
Music artist similarity has been an active research topicin
music information retrieval for a long time since it is
especially useful for music recommendation and
organization[1, 2]. Many characteristics can be brought
into considerationfor defining similarity, e.g., sound, lyrics,
genre,style, and mood. Methods for calculating artistic
similarityinclude recent proposals that are based on the
similarityinformation provided by the All Music Guide
website(http://www.allmusic.com) as well as those based
onthe user access history (e.g., see [10]). Although there
hasbeen considerable effort into developing effective and
efficient method for calculating artist similarity, several
challengesstill exist. First, artist similarity varies
considerablywhen considering different aspects of artists
such as genre,mood, style, culture, and acoustics. Second,
the user accesshistory data are often very sparse and hard to
ISSN: 2231-5381
acquire.Third, even if we can obtain the categorical
descriptionsof two artists using All Music Guide,
comparing the descriptionsis not trivial since there are
semantic similaritiesAmong different descriptions. For
example, given two moodterms witty and thoughtful, we
cannot simply quantify theirsimilarity as 0 just because
they are different words or as 1because they are synonyms.
There are a number of key concepts to consider
when comparingthese approaches. The cold start problem
refers to thefact songs that are not annotated cannot be
retrieved. Thisproblem is related to popularity bias in that
popular songs (inthe short-head) tend to be annotated more
thoroughly thanunpopular songs (in the long-tail) [11].
This often leadsto a situation in which a short-head song is
ranked abovea long-tail song despite the fact that the longtail song maybe more semantically relevant. We prefer an
approach thatavoids the cold start problem (e.g.,
autotagging). If this isnot possible, we prefer approaches in
which we can explicitlycontrol which songs are annotated
(e.g., survey, games),rather than an approach in which only
the more popular songsare annotated (e.g., social tags, web
documents).A strong labeling [3] is when a song has been
explicitlylabeled or not labeled with a tag, depending on
whether ornot the tag is relevant. This is opposed to a weak
labeling inwhich the absence of a tag from a song does not
necessarilyindicate that the tag is not relevant. For
example, a songmay feature drums but is not explicitly
labeled with the tag“drum”. Weak labeling is a problem if
we want to designa MIR system with high recall, or if our
goal is to collecta training data set for a supervised
autotagging system thatuses discriminative classifiers (e.g.,
[4, 7]).It is also important to consider the size, structure,
andextensibility of the tag vocabulary. In the context of
textbasedlarge anddiverse set of semantic tags, where each
tag describes somemeaningful attribute or characterization
of music. In thispaper, we limit our focus to tags that can
be used consistentlyby a large number of individuals when
annotating novelsongs based on the audio content alone.
This does not includetags that are personal (e.g., “seen
live”), judgmental (e.g.,“horrible”), or represent external
knowledge about the song(e.g., geographic origins of an
artist). It should be noted thatthese tags are also useful for
retrieval (and recommendation)
and merit additional attention from the MIR community.
http://www.ijettjournal.org
Page 246
International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 5- Feb 2014
II. RELATED WORK
Music data usually consists of the artists and their
respective styles and which uses the relevance matrix
between the number of songs related to the artist.Intially
computes the distance between the data objects which
means data retrieved from the relenace matrix and place the
data points in nearest clusters until the termination point
meets.
meansclustering, constrained spectral clustering, and
constrainedclustering
using
non-negative
matrix
factorizations; and little has been done on utilizing
constraints forhierarchical clustering. Recently, there do
exist a few works onincorporating constraints into
hierarchical clustering (e.g., byextending the partial known
hierarchy with the constraints toa full hierarchy or by
modifying the order of cluster mergingprocess).
III.PROPOSED SYSTEM
Co-clustering gives the clustering the more than
one type of data slimantaneously like documents and ur;ls
and while text clustering styles and artists for music data
clustering, our hierarchical approach merges the similar
data based on the relevance matrix and similarity matrix
between the data points.
Hierarchal approach generates the optimal cluster
by aging meging the similardata objects after forming the
data relation between the objects .Hierarchical Clustering is
the generation of tree-like clusterstructures without user
supervision. Hierarchical clustering algorithmsorganize
input data either bottom-up (agglomerative)or top-down
(divisive) [12]. In general hierarchical agglomerative
clustering ismore frequently used than hierarchical
divisiveclustering. Co-clustering refers to clustering of
more than onedata type. Dhillon proposes bipartite spectral
graph partitioningapproaches to co-cluster words and
documents. Long etalpropose a general principled model,
called relation summarynetwork, for co-clustering
heterogeneous data presentedas a k-partite graph.While
hierarchical clustering deals with only one type ofdata and
co-clustering produces only one level of data
organization,hierarchical
co-clustering
aims
at
simultaneouslyconstructing hierarchical structures for two
or more data types,that is, it attempts to achieve the
function of both hierarchicalclustering and co-clustering.
Because of this unique nature,hierarchical co-clustering is
receiving special attention from researchers[14], [15]. Xuet
al. proposed a hierarchical divisiveco-clustering algorithm
[17] to simultaneously find out documentclusters and the
associated word clusters. Shao et al. [16] incorporated this
hierarchical divisive co-clustering algorithminto a novel
artist similarity quantifying framework for thepurpose of
assisting artist similarity quantification by utilizingthe style
and mood clusters information. In their framework,
theartist similarity is based on style similarity and mood
similarity.Even though this hierarchical divisive coclustering methodis given, to our best knowledge, few
researchers have studied the hierarchical agglomerative coclustering methods .In recent years, much work has been
done on constrainedclustering—integrating various forms
of background knowledgein the clustering process.
Existing constrained clusteringmethods have been focused
on the use of background informationin the form of
instance level “must-link” and “cannot-link”constraints,
which, as the naming suggests, assert that, for apair of data
instances, they must be in the same cluster andthey should
be in distinct clusters, respectively. Most of theconstrained
clustering algorithms in the literature are designedfor
partitional clustering methods e.g., constrained K-
ISSN: 2231-5381
In this paper we are proposing an efficient clustering
algorithm for cluster the music data, with respect to the
artists and tags, in this proposed we combined k-means
algorithm with genetic algorithm for optimal clusters,
genetic algorithm gives the solutions for the problems like
NP hard problem, apart from the traditional random
selection of the centroids we proposed a novel approach i.e
intra cluster variances for specifying the number of
centroids.
Initially we construct the relevance matrix
between the artists and styles or moods of the music
objects and then compute the intra cluster varience or
distance between the data objects and computes the
minimum distance with respect to number of clusters
specified by the user and then forwarded to evolutionary
approach for optimal clusters
Algorithm: Musical CoClustering
Step 1: Initialize the musical data with Artists and Styles or
moods
Step 2 : Read the input number of clusters
Step 3: compute the relevance matrix between artists and
styles/moods
Step 4: for i:=0 to Max_Number_Of_Iterations
Compute manhatten distance between data points
Step 6: Store minimum distance and minimum index
Step 7 : Group the similar data points
Step 8: Return the clusters
Step 9: terminate
Compute the relevance matrix between artists
and styles/ moods, which indicates the number of relations
between data points and compute the manhattern distance
between the data points and specify the minimum number
of clusters, initially select random selection of the centroids
and compute the manhattern distance between all the data
points and compute the minimum distance and minimum
index of the datapoints and place it in respective clusters
and continue same process, select the another centoird it
should not be repeated as previous centroid and proceed
with the same process until maximum number of iterations.
http://www.ijettjournal.org
Page 247
International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 5- Feb 2014
For optimal performance compute with agglutinative
approach by merging the same set of data points which has
same set of data points, Combine the data points which has
same distance and again cluster the data in hierarchal
manner (tree structure) in hierarchal manner the following
diagram shows the clusters formed after merging of data
points as follows
IV.CONCLUSION
Finally we conclude that we proposed an efficient novel co
clustering of musical data, user need not to specify the
number of clusters, it can be manipulated through intra
cluster variances, and for distance measure we had used the
genetic feature for best fitness or optimal clusters.
REFERENCES
[1] J.J. Aucouturier, F. Pachet, P. Roy, and A. Beurive. Signal +
context = better classification. ISMIR, 2007.
[2] L. Barrington, M. Yazdani, D. Turnbull, and G. Lanckriet.
Combining feature kernels for semantic music retrieval. ISMIR,
2008.
[3] G. Carneiro, A. B. Chan, P. J. Moreno, and N.
Vasconcelos.Supervised learning of semantic classes for image
annotation and retrieval.IEEEPAMI, 29(3):394–410, 2007.
[4] O. Celma, P. Cano, and P. Herrera. Search sounds: An audio
crawler focused on weblogs. In ISMIR, 2006.
[5] S. Clifford. Pandora’s long strange trip.Inc.com, 2007.
[6] J. S. Downie. Music information retrieval evaluation exchange
(MIREX), 2005.
[7] D. Eck, P. Lamere, T. Bertin-Mahieux, and S. Green.
Automatic generation of social tags for music recommendation.In
Neural InformationProcessing Systems Conference (NIPS), 2007.
[8] W. Glaser, T. Westergren, J. Stearns, and J. Kraft.Consumer
item matching method and system.US Patent Number 7003515,
2006.
[9] P. Knees, T. Pohle, M. Schedl, D. Schnitzer, and K.
Seyerlehner. A document-centered approach to a natural language
music search engine.In ECIR, 2008.
[10] P. Knees, T. Pohle, M. Schedl, and G. Widmer. A music
search engine built upon audio-based and web-based similarity
measures. In ACMSIGIR, 2007.
[11] P. Lamere and O. Celma. Music recommendation tutorial
notes. ISMIR Tutorial, September 2007.
ISSN: 2231-5381
[12] E. L. M. Law, L. von Ahn, and R. Dannenberg. Tagatune: a
game for music and sound annotation. In ISMIR, 2007.
[13] M. Levy and M. Sandler. A semantic space for music derived
from social tags. In ISMIR, 2007.
[14] M. Mandel and D. Ellis.A web-based game for collecting
music metadata.In ISMIR, 2007.
[15] C. McKay and I. Fujinaga. Musical genre classification: Is it
worth pursuing and how can it be improved? ISMIR, 2006.
[16] F. Miller, M. Stiksel, and R. Jones.Last.fm in
numbers.Last.fm press material, February 2008.
[17] M. Sordo, C. Lauier, and O. Celma. Annotating music
collections: How content-based similarity helps to propagate
labels. In ISMIR, 2007.
[18] D. Turnbull. Design and Development of a Semantic Music
Discovery Engine.PhD thesis, UC San Diego, 2008.
[19] D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet.
Semantic annotation and retrieval of music and sound
effects.IEEE TASLP, 16(2), 2008.
[20] D. Turnbull, R. Liu, L. Barrington, and G Lanckriet. Using
games tocollect semantic information about music.In ISMIR ’07,
2007.
BIOGRAPHIES
P.Rajasekhar completed M.Tech GITAM
University.He is working as Assistant
Professor in Dept. of Computer Science
and Engineering Avanthi Institute of
Engineering and Technology affiliated to
JNTU, Kakinada. He has an experience of
10 years in teaching. His interesting areas are Software
Engineering, Data Modeling.
Kalpana
Palla
completed
her
BTech
(Information Technology) in Sri Vasavi
Engineering College affiliated to JNTU,
Kakinada. She is pursuing MTech (Software
Engineering) Avanthi Institute of Engineering
and Technology affiliated to JNTU, Kakinada.
Her interested areas are Software Engineering and BioInformatics.
http://www.ijettjournal.org
Page 248
Download