www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242

advertisement
www.ijecs.in
International Journal Of Engineering And Computer Science ISSN:2319-7242
Volume 4 Issue 1 January 2015, Page No. 10028-10042
Constructing Horizontal layout and Clustering Horizontal layout
by applying Fuzzy Concepts for Data mining Reasoning
Kalluri N V Satya Naresh, Divya Vani .Y
divyasudha99@gmail.com
Shri Vishnu Engineering College for Women
Bhimavaram, Andhra Pradesh, India
Abstract: Clustering is one of the significant tasks in data mining which is benevolent for bounteous users
by affording analysis and decision making. This paper inaugurates agile and dexterous way to conceive
horizontal layout and forthright usage of horizontal layout in data mining algorithms like clustering.
Predominantly educing a data set in data mining project for analysis is a time conceiving, striving task so
horizontal layouts are created and stored in database which averts the burden of performing data
preprocessing in data mining projects .The vertical layouts created by vertical aggregations in SQL are
impotent for data mining algorithms so horizontal aggregations are used to create horizontal layouts. It is
surpass to create horizontal layout instead of creating vertical layout as vertical layout only creates one
column per aggregated group by using normal SQL (Structured Query Language) aggregations and
horizontal layouts returns many values per aggregated group or row so they are useful for data mining
algorithms. Through CASE and SPJ methods horizontal aggregations are evaluated for creating horizontal
layouts dexterously and agilely. This paper induces how horizontal layout can be created easily with CASE
method than by using SPJ method. To prepare a data set for clustering takes more time and effort so the
created horizontal layout is obliged for clustering directly without wastage of time and effort. As in data
uncertainty is the key feature so by using soft computing concepts like Fuzzy Set, clustering of horizontal
layout is done, hence clustered data is serendipitous for users for analysis and decision making and the
whole process is elucidated with examples and experimental results.
Keywords: Horizontal Aggregation, Horizontal
layout, Vertical layout, Vertical Aggregation,
Horizontal layouts are dreadfully of assistance in data
Data
mining algorithms, so this paper utterly perambulates
mining
Concepts.
algorithms,
Clustering,
Fuzzy
about effortless creation and clustering of horizontal
layout by superintendence imprecise data.
1. Introduction:
Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No.10028-10042
Page 10028
Generally erecting a data set for data mining projects is
ample information about itemized groups from data
a most time conceiving process. The vertical layouts
based on peculiar variables such as gender, name, age,
spawned by normal SQL aggregation functions
address, profession, phone number or income is called
(vertical aggregations) are discordant for using in data
as general aggregation. Utmost data mining algorithms
mining tasks or projects. Vertical layout spawned by
crave horizontal layout data set as input because
vertical aggregations dwelled of more no of rows
horizontal layout return values per aggregated row
which are not I/O (Input or Output) efficient and are
instead of one value per aggregated row. A latest class
impotent for using in data mining tasks or projects. So
of aggregate functions is contemplated to return a table
to disentangle the problem of erecting data sets
or data set having horizontal layout aggregating
horizontal
create
expressions of numeric and transposing the results.
horizontal layout easily. Horizontal layouts are
Functions which belong to this type of class are
augment I/O efficient than vertical layout for using in
horizontal
data mining algorithms like classification, regression
epitomize the dilatation form of traditional SQL
analysis, PDA, clustering. Horizontal layout can avoid
aggregations, which return a group of values or
the burden of creating data sets by performing data
columns in a horizontal layout per aggregated row or
preprocessing phase and data set creation phase with
group instead of a single column or value per
complex SQL queries. Vertical layouts have some
aggregated row.
aggregations
are
adopted
to
aggregations.
Horizontal
aggregations
limitations to use for data mining algorithms which are
erected by using normal SQL functions as they return
only one column per aggregated group or row, so
Horizontal layout is created by using functions called
horizontal aggregations which create many columns or
values per aggregated group or row instead of one
value per row. They are many advantages with
horizontal
aggregations
which
are
helpful
for
generating SQL code automatically and these are
evaluated by using SPJ and CASE methods in this
paper. In this paper it is clearly proved with example
that it is easy and time efficient to create horizontal
layout by using CASE method than using SPJ method.
Without performing any data mining pre-processing
tasks in-anticipation created horizontal layout is used
unswervingly for clustering saving time and effort.
Many vital operators and functions are needed to
compute aggregations in SQL. Sum is the ultimate
prevalently used aggregation of a column and assorted
other aggregation operators return the row count,
maximum, average and minimum over the groups of
rows. For accomplishing aggregations all the extant
operators have cramp to be used in data mining
intendments to create large data sets. For OLTP
(online transaction process) database schemas need to
be profoundly normalized. But conventionally data
mining, machine learning or statistical algorithms
carve aggregated data to be in synopsized form. Data
mining algorithms use suitable input as cross tabular
(horizontal) pattern so for this intendment essential
endeavor is required to compute aggregation.
Clustering of horizontal layout is performed by using
En masse creating a data set for data mining projects is
Fuzzy Concepts handling impreciseness and vagueness
a most time conceiving process. Horizontal layouts are
of data.
I/O and time efficient for using in data mining
The mechanism where information is gleaned, asserted
in a summary form and recycled for demographic
analysis is known as data aggregation. Intension to get
algorithms like classification, regression analysis,
PDA, clustering which can avoid the burden of
creating data sets by performing data preprocessing
Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No.10028-10042
Page 10029
phase and data set creation phase with complex SQL
fixed than it is K-Means algorithm and if no of clusters
queries. Vertical layouts have some cramp to use for
are fixed than it is fuzzy-C Means algorithm.
data mining algorithms which are created by using
normal SQL functions as they return only one column
per aggregated group or row, so Horizontal layout is
created
by
using
functions
called
horizontal
aggregations which create many columns or values per
aggregated group or row instead of one value per row.
They
are
many
advantages
with
horizontal
As horizontal layout can be used precisely for data
mining algorithms or projects we are using well-nigh
for clustering because it is one of most important task
in data mining. Clustering of Horizontal layout can be
performed through Fuzzy C-Means algorithm.
2. Literature Review:
aggregations like procreate SQL code automatically
Database is formulating data to model pertinent
and evaluated by using SPJ and CASE methods.
aspects of
verisimilitude in a way to support
An advanced class function is Horizontal aggregation
processes
to return attributes or columns that are aggregated in a
Management System (DBMS) are specially developed
horizontal layout. Most algorithms require datasets
software applications that interact with applications,
with horizontal layout as input. It is tenacious task to
users and database to capture data and analyze data.
superintend data sets without rampart of DBMS.
DBMS is special software designed to allow define,
Intramural a Relational database it is worthier to try
create, update, query and administrate database. Some
with different subsets of dimensions and data points
known DBMS are MYSQL, PostgreSQL, MariaDB,
are easier, faster and flexible than working outside
SQLLite, Oracle, Microsoft SQL Server, DBase, SAP
with another alternative tool. Much like project, join,
HANA, FoxPro, Libre office Base, IBM DB2, and File
select, horizontal aggregation are performed by using
Marker Pro.
requiring
information.
Data
Base
operator and it is better to implement inside query
processor.
To select data from database SELECT statement is
used. Projection is selecting of the columns of table
In everyday and advanced applications intersperse of
soft computing and tools are invigorated by soft
computing. In real applications data uncertainty is the
clamorous feature and as hard computing cannot
handle vague and uncertain data soft computing is
used. Zadeh inaugurated the notion of graded
that one wishes to appear in the answer or table or data
set. SQL join is used to built data set or table based on
the common field between tables from two or more
tables to combine rows of tables. Left outer join
returns the matched tuples or rows from the right table
and all the tuples or rows from left table.
membership by perceiving the concept of Fuzzy set in
order to apprehend impreciseness in data, and theorize
Aggregation function groups multiple rows values to
the
most
form a single value based on certain condition. The
autonomous learning problem clustering is dealing
most commonly used aggregation functions are
with discovering a structure in a collection of
average (), maximum (), mode (), median (), count (),
unlabeled data. To cluster inexact and imprecise data
minimum (), sum (). These normal SQL aggregation
Fuzzy based clustering algorithms are used. In
functions are also called as vertical aggregation
clustering if the minimum no of elements in a cluster is
functions useful to create vertical layout. Group by
characteristic
function
of
sets.
The
clause performs gathering of all the rows that contains
Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No.10028-10042
Page 10030
data in the opted columns and allows aggregation
preparation phase takes lot of time and effort. The
functions to operate on one or more columns.
horizontal layouts can be precisely used as input data
sets by data mining algorithms like classification,
Data mining is the process of extracting knowledge
from data. Data present in various data sources is
collected and stored in data warehouses than data
mining functionalities are performed on preprocessed
data giving results of user understandable form. All
tasks Data cleaning, transforming, reducing, regression
analysis, association rule generation, Classification,
clustering, outlier analysis comes under data mining
tasks. This paper deals with clustering among different
functionalities of data mining.
regression analysis, clustering and PDA without again
preparing data sets from data tables. Prevalent SQL
aggregation functions like min, avg, sum, and max can
be used to create vertical layout. Vertical layouts
elicited by using accustom SQL aggregation functions
but cannot be opted as I/O efficient for data mining
algorithms because they can generate only one column
per aggregated group and legion rows. Therefore a
horizontal layout is imperative having many columns
per aggregated group i.e returning many values per
Data Clustering is the technique of partitioning a
row. By excogitating functions like horizontal
dataset into distinct clusters depending upon the
aggregations educing horizontal layout can be comply.
property of same identity of elements. The Elements
Data mining tools can perforce generate SQL code. To
which are having identical features are kept in a single
assay horizontal aggregations methods like CASE and
cluster, whereas not so identical elements are kept in
SPJ can be afford.
different clusters. In 1965 Zadeh determined the sign
of fuzzy set and deliberated fuzzy set. Membership
2.3 Advantages of creating horizontal layouts using
horizontal aggregations and clustering them:
function is accredited with fuzzy set and considerate to
tackle with imprecise data.
(.) In data mining tools SQL code can be generated as
A fuzzy set is defined as A  S, where S is a set in an
horizontal aggregation constructs a template and
universe, is defined by its membership function
denoted by  A such that  A : X  [0,1] , that is every
y  A is associated with a real number  A ( y ) , called
the membership value of x, which satisfies 0<  A ( y )
<1.
automates to reproduce, optimize and test SQL queries
for correctness.
(.)
SQL queries generated axiomatically are more
efficient than queries generated by end user.
(.) The data set created by horizontal aggregations can
be created unswervingly in the database.
To cluster data by super visioning impreciseness by
using Fuzzy set concept, clustering is performed for
the created Horizontal layouts and the clustered data is
serendipitous for users to analysis and decision making
purposes.
2.2 Need For Creating Horizontal Layout:
(.) The Horizontal layouts created can be straightly
given as input for data mining algorithms like
classification, regression analysis, clustering and PDA.
(.) The clustered data created by clustering horizontal
layout is more serendipitous by users for analysis and
decision making.
Horizontal layout predominantly untangles the burden
of data mining projects as educing of data sets in data
Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No.10028-10042
Page 10031
2.4 Definitions:
T is a database table with primary key P, C1,
C2,…….,Ci as discrete columns, N as one numeric
column and it is symbolized as T(P, C1, C2….Ci, N).
In OLAP terms it is interpreted as T is the fact table
having P as primary key, i dimensions, N as measure
column where M is the size of the table, C1, C2 …….Ci
Table
2.2
Vertical layout
are foreign keys in fact table and primary keys in
lookup tables. T is the input table, by executing SQL
queries tables TV, TH are created where Table TV is the
vertical layout table, TH is the horizontal layout.
Conversion of vertical layout to horizontal layout is
After giving the above SQL Query with SQL
aggregation function like sum, above table 2.2 is the
output for query which is called a vertical layout. As
this vertical layout is having only one aggregated
column and both C1, C2 acting as primary key it is not
the goal of horizontal aggregations.
useful for giving as input to data mining algorithms,
Let us consider the following table T as example
So horizontal tabular layout is required. The following
having P as primary key, C1, C2 as discrete columns
table 2.3 is horizontal layout having two aggregated
and N as numeric column.
columns and one primary key which is helpful for
giving as input in data mining tasks or algorithms.
Table 2.1 A
Database Table
Table
2.3
Horizontal layout
3. Methodology
Consider the query Select C1, C2, Sum (N) from T
group by C1, C2 order by C1, C2.
3.1 Horizontal Aggregations:
Horizontal aggregations are abetting in times where
the user wants to get output in horizontal form or
craves amalgamating vertical layout with aggregations
confide in on grouping columns. As vertical layout are
not that abundantly commodious for data mining
Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No.10028-10042
Page 10032
algorithms horizontal layout are created by using
The syntax for erecting of Horizontal layout is as
horizontal
follows:
aggregations.
Horizontal
aggregations
revamp the vertical layout to horizontal layout by
transmogrifying the aggregation column N to list of
transposing columns Y1……….YK.
Consider an SQL Query that takes X1…..Xm as subset
from C1…..Cp1.
SELECT X1,….,Xj , Ha(N BY Y1,….,Yk) FROM F
GROUP BY X1,….,Xj .
Consider a palpable example of stores database
procuring stores information in Table transaction.
Table transaction is possessing strid, deptid, date,
The syntax for conceiving vertical layout is as follows.
Select X1….Xm, sum (N) from T group by X1….Xm.
The above query will outturn a vertical layout data set
possessing m+1 columns where the m columns
X1…Xm act as primary and Sum (N) is the only one
aggregated column.
month, year, day, rate, qty, totalsales, itemqty,
costAmt
as columns. Suppose if we appetite to find
out total sales for each storied by each day of the
week.
The normal SQL statement for the above query is
Select strid, day, sum (totalsales) from transaction
group by strid, day order by strid, day.
To metamorphose the Vertical layout to horizontal
layout, horizontal aggregation functions are used.
The
indispensable
desideratum
of
This gives a vertical layout like below
horizontal
aggregations is to transmogrify aggregated column N
by a list of columns Y1……Yk where the Y1….. Yk are
subset of columns X1……Xm and k<m. So to
inaugurate SQL code by horizontal aggregations there
are four input parameters T, X1….Xm, N, Y1….Yk
Where T is the Input table, X1….Xm are the grouping
columns, N is the aggregated column and Y1….Yk are
transposing columns.
The frame of reference for horizontal aggregation is
similar to the frame of reference for vertical
aggregation. The horizontal aggregation function is
connate by Ha(N BY Y1,….,Yk)
where Ha is the
standard SQL aggregation function , N is the
aggregation column and Y1………..Yk are the
Fig 3.1.1 Vertical layout created
by using vertical aggregations
transposing columns. Annexing of standard SQL
aggregation or vertical aggregation function is
This vertical layout is not useful for data mining tasks
rendered by using “By” clause which transmutes the
as it has only one aggregated column and both strid ,
aggregation column N to list of transposing columns
day of week act as primary key returning many
Y1………Yk which avails in conceiving a horizontal
records.
So by using horizontal aggregations
layout instead of vertical layout creation.
Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No.10028-10042
Page 10033
horizontal layout is created having many aggregated
columns and only strid as primary key.
The SQL syntax with horizontal aggregations is as
follows:
Select strid, sum (total_sales BY day_of_Week) from
transaction group by strid.
Fig
3.2
System
Architecture
3.2.1 Module 1(Selection Process)
We need to select the table from database and select
the columns that we want to group by, aggregate,
transpose for which we want to create horizontal
Fig 3.1.2 Horizontal layout created
layout.
by using Horizontal aggregations
Select the group by column X1…..Xj
3.2 Creation and Clustering Horizontal Layout
Select the aggregate column N
This paper percolates creation of horizontal layout
with CASE, SPJ methods and clusters the resulted
Horizontal layout by using Fuzzy C-Means algorithm.
Select the transposing column Y1….Yk.
3.2.2 Module 2(Creation of Horizontal Layout)
An Example with results is also explained for
understanding. Horizontal layouts can be created by
In this module horizontal layouts are created by using
CASE, SPJ and Pivot methods but PIVOT and CASE
SPJ and CASE methods.
method give the same result with almost same time
complexity but CASE method is having better time
complexity than SPJ method. So we are only using
CASE and SPJ methods in our process, both gives
same result with different time complexities. Creation
and clustering horizontal layouts is done in three
modules. This is the proposed System architecture:
3.2.2.2 SPJ Method: In this caliber we aggregate the
column in horizontal way with the help of SPJ (Select,
Project, Join) method. The basic idea is to create one
table with a vertical aggregation for each result
column, and then join all those tables to produce FH.
We aggregate from F into d projected tables with d
Select-Project-Join-Aggregation
projection,
join,
aggregation).
queries
Each
(selection,
table
F1
corresponds to one sub grouping combination and has
{X1… Xj} as primary key and an aggregation on A as
Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No.10028-10042
Page 10034
the only non key column. It is necessary to introduce
possible optimization is synchronizing table scans to
an additional table F0 that will be outer joined with
compute the d tables in one pass.
projected tables to get a complete result set.
Finally, to get TH we need d left outer joins with the T0
Three Main Steps in SPJ Method to create Horizontal
and d tables so that all individual aggregations are
layout:
properly assembled as a set of d dimensions for each
group. Outer joins set result columns to null for
(.)
First Table T0 is created having distinct
combination of group by columns X1,……..,Xj.
missing combinations for the given group. In general,
nulls should be the default value for groups with
For each unique combination of Transposing
missing combinations. We believe it would be
columns Y1 ,………,Yk , Tables T1 ,…….,Td are
incorrect to set the result to zero or some other number
created.
by default if there is no qualifying rows. Such
(.)
approach should be considered on a per CASE basis.
(.) Lastly Table T0 is left outer joined with each table
INSERT INTO TH SELECT T0.X1, T0.X2, . . . , T0.Xj,
T1 to Td.
T1.N, T2.N,. . . ,Td.N FROM T0 LEFT OUTER JOIN
How these tables are created is clearly explained
below.
T1 ON T0.X1 = T1.X1 and . . . and T0.Xj =T1.Xj LEFT
OUTER JOIN F2 ON T0.X1= T2.X1 and . . . and T0.Xj =
Table T0 defines the number of result rows, and builds
T2.Xj…………..LEFT OUTER JOIN Fd ON T0.X1 =
the primary key. T0 is populated so that it contains
Td.X1 and . . . and T0.Xj = Td.Xj.
every existing combination of X1,..……,Xj. Table F0
has X1,……,Xj as primary key and it does not have any
Real Time Example for SPJ method:
Consider a database having stores information and
non key column.
Transaction is a table in the database having StoreId,
INSERT INTO T0 SELECT DISTINCT X1,. . . , Xj
DepId, Date, Month, Year, Day, ItemId, Rate, Qty,
FROM T.
Amt as columns.
We should create tables T1 to Td .
Tables T1,, ……., Td contain individual aggregations
Suppose if want find total sales amount for each
storied by each day of week.
for each combination of R1, . . .,Rk. The primary key of
table T1….Td is Y1,……….,Yk and N is aggregated
The following queries should be computed to construct
column.
horizontal layout by using SPJ method
INSERT INTO T1 SELECT Xi,….Xj, V(N) FROM
Query1:
T/Tv
INSERT INTO F0 SELECT DISTINCT storeid
WHERE Y1 = v11 AND ………… Yk= Vk1 GROUP
BY Xi,….Xj.
Query2:
Then each table T1 aggregates only those rows that
correspond
FROM Transaction.
to
the
Ith
unique
combination
of
Y1……….Yk , given by the WHERE clause. A
INSERT INTO F1 SELECT storeid, sum (amt) AS
totalsalesamt
FROM
Transaction
WHERE
Day=’Mon’ GROUP BY storeid;
Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No.10028-10042
Page 10035
.Query3:
OUTER JOIN F3 on F0.storeid=F3.storeid LEFT
OUTER JOIN F4 on F0.storeid=F4.storeid LEFT
INSERT INTO F2 SELECT storeid, sum (amt) AS
totalsalesamt
FROM
Transaction
OUTER JOIN F5 on F0.storeid=F5.storeid LEFT
WHERE
OUTER JOIN F6 on F0.storeid=F6.storeid LEFT
Day=’Tue’ GROUP BY storeid;
OUTER JOIN F7 on F0.storeid=F7.storeid.
.Query4:
By evaluating above queries we will get the horizontal
INSERT INTO F3 SELECT storeid, sum (amt) AS
layout that we want but it takes lot of effort as more
totalsalesamt FROM Transaction WHERE Day=’Wed’
sub queries should be written and more join operations
GROUP BY storeid;
should be performed. Consider the same above query,
.Query5:
to create vertical layout for this just one query is
INSERT INTO F4 SELECT storeid, sum (amt) AS
totalsalesamt
FROM
Transaction
WHERE
enough i.e select storied, day, sum (amt) from
Transaction group by storied, day.
But to create
horizontal layout we are writing 9 queries, so to reduce
Day=’Thu’ GROUP BY strid;
the effort and time complexity CASE method can be
Query6:
used to create horizontal layout easily with less effort.
INSERT INTO F5 SELECT storeid, sum (amt) AS
3.2.2.3 CASE Method:
totalsalesamt FROM Transaction WHERE Day=’Fri’
In this module we aggregate the column horizontally
GROUP BY storeid;
through CASE Method. The CASE statement returns a
.
value selected from a set of values based on Boolean
expressions. From a relational database theory point of
Query7:
view
this
is
equivalent
to
doing
a
simple
INSERT INTO F6 SELECT storeid, sum (amt) AS
projection/aggregation query where each non key
totalsalesamt FROM Transaction WHERE Day=’Sat’
value is given by a function that returns a number
GROUP BY storeid;
based on some conjunction of conditions. In a similar
manner to SPJ, the method directly aggregates from F.
.Query8:
INSERT INTO F7 SELECT storeid, sum (amt) AS
Horizontal aggregation queries can be evaluated by
totalsalesamt FROM F Transaction WHERE
directly aggregating from F and transposing rows at
Day=’Sun’ GROUP BY storeid;
the same time to produce FH. First, we need to get the
unique combinations of R1,….,Rk that define the
Query9:
INSERT
matching Boolean expression for result columns. The
INTO
FH
SELECT
F0.storied,
F1.totalsalesamt AS Mon-amt, F2.totalsalesamt
AS Tue-amt, F3.totalsalesamt
F4.totalsalesamt
from F is as follows:
AS Wed-amt,
AS Thu-amt, F5.totalsalesamt
AS fri-amt, F6.totalsalesamt
SQL code to compute horizontal aggregations directly
AS Sat-amt,
F7.totalsalesamt AS Sun-amt FROM F0 LEFT
OUTER JOIN F1 on F0.storeid=F1.storeid LEFT
OUTER JOIN F2 on F0.storrid=F2.storeid LEFT
V () is a standard (vertical) SQL aggregation that has
a
“CASE”
statement
as
argument.
Horizontal
aggregations need to set the result to null when there
are no qualifying rows for the specific horizontal
group to be consistent with the SPJ method and also
with the extended relational model.
Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No.10028-10042
Page 10036
SQL Syntax for CASE method is given below, in the
previously created data set can be directly taken as
syntax T is the original table and TH is the horizontal
input for clustering instead of again creating data set.
layout:
The Horizontal layout clustered can be useful for
SELECT DISTINCT Y1,……,Yk FROM T.
analysis and decision making. As fuzzy C-means
algorithm can handle vagueness of data, so to cluster
INSERT INTO TH SELECT X1, . . . , Xj , V(CASE
Horizontal layouts fuzzy C-Means algorithm is used.
WHEN Y1 = v11 and . . . and Yk =vk1
THEN N ELSE null END)
…………………,V(CASE WHEN Y1 = v1d and . . .
and Yk = vkd THEN N ELSE null END)
FROM F GROUP BY X1, X2 . . . ,Yj.
3.2.3.1 FUZZY C-MEANS ALGORITHM:
As experienced in real life situations, the clustering of
datasets by hard c-means leads to a partition of the
dataset. But, this is unwanted in many cases and so the
applicability of hard c-means has been limited.
However, the concept of fuzzy sets, so that an element
Example: Suppose in a store database if we want find
out total items sold in each department of each store by
can belong to any number of clusters with different
membership values.
each day of week. The following query is evaluated to
create horizontal layout by using CASE method. select
The objective function is
StoreId, DepId, sum( CASE when Day='Fri' then Qty
n
c
J m (U , v)   ( ik ) m ' (dik ) 2
else null end),sum( CASE when Day='Mon' then Qty
k 1 i 1
else null end),sum( CASE when Day='Sat' then Qty
1  m'   and is
else null end),sum( CASE when Day='Thr' then Qty
m’ being a real number such that
else null end),sum( CASE when Day='Tue' then Qty
called the fuzzifier. ik  [0, 1] is the membership of
else null end),sum( CASE when Day='Wed' then Qty
else null end) from Trans1 Group By StoreId, DepId.
the kth pattern to vi .
Algorithm:
3.2.3 Module 3(Clustering)
STEP 1: Fix c ( 2  c  n ) and select a value m’
The main objective in this paper is to create a data set
Initialize the partition matrix
easily so that it can be useful directly in data mining
For r = 0, 1, 2,…. Do
tasks or projects avoiding data preprocessing phase.
The horizontal layout can be useful for any data
STEP 2: Calculate the ‘c’ centers
mining algorithm so we are using it directly for
clustering. The previously created horizontal layout is
taken as input for clustering.
vi( r ) , i  1, 2,...c
n
using the formula vij 

k 1
n
m'
ik

k 1
.xkj
m'
ik
Suppose if there is a stores data base, if we want to
find the stores that are having same total sales amount
STEP 3: Update the partition matrix for the r th step
for each day of week or if we want to cluster the stores
U (r )
based on total sales for each day of week than
Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No.10028-10042
Page 10037
to
Taking
U ( r 1) = (
 ik( r 1)
), where
I k  {i | 2  c  n; d ik( r )  0}
Query1:
INSERT INTO F0 SELECT DISTINCT storeid
FROM Transaction.
Query2:
1
ik( r 1)
 c  d (r) 2/( m ' 1) 
 , if I k   ,
    ik( r ) 
 j 1  d jk 



 0, where
INSERT INTO F1 SELECT storeid, sum (amt) AS
totalsalesamt
FROM
Transaction
WHERE
Day=’Mon’ GROUP BY storied.
i  I'k  {1, 2,...c}  I k
.Query3:
STEP 4: If
U ( r 1)  U ( r )   L
STOP
Else go to STEP 2.
INSERT INTO F2 SELECT storeid, sum (amt) AS
totalsalesamt
FROM
Transaction
WHERE
Day=’Tue’ GROUP BY storied.
Here C denotes number of clusters, V denotes cluster
centers, X denotes data point, d denotes distance
between cluster centre and data point and U is the
partition matrix where each element of matrix
represents the membership value of a data point X
belonging to Cluster C.
.Query4:
INSERT INTO F1 SELECT storeid, sum (amt) AS
totalsalesamt FROM Transaction WHERE Day=’Wed’
4. Results:
GROUP BY storied.
By taking one real time example construction of
.Query5:
Horizontal layout by using SPJ method and CASE
INSERT INTO F1 SELECT storeid, sum (amt) AS
method is provided. After creating Horizontal layout, it
totalsalesamt
is taken as input data set for clustering and clustering
Day=’Thu’ GROUP BY strid.
FROM
Transaction
WHERE
is done using fuzzy C-means algorithm.
Query6:
Example:
INSERT INTO F1 SELECT storeid, sum (amt) AS
Consider a database having stores information.
totalsalesamt FROM Transaction WHERE Day=’Fri’
Transaction is a table in the database having StoreId,
GROUP BY storied.
DepId, Date, Month, Year, Day, ItemId, Rate, Qty,
Amt as columns.
Suppose if we want to find total sales amount for each
storied by each day of week.
Query7:
INSERT INTO F1 SELECT storeid, sum (amt) AS
totalsalesamt
FROM Transaction WHERE
Day=’Sat’ GROUP BY storied.
.Query8:
SPJ Method:
INSERT INTO F1 SELECT storeid, sum (amt) AS
The following queries should be computed to construct
totalsalesamt
horizontal layout by using SPJ method.
Day=’Sun’ GROUP BY storied.
FROM F Transaction WHERE
Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No.10028-10042
Page 10038
then Qty else null end) from Trans1 Group By StoreId,
Query9:
INSERT
INTO
FH
SELECT
F0.storied,
DepId.The results are as follows:
F1.totalsalesamt AS Mon-amt, F2.totalsalesamt
First we need to select the Transaction table from
AS Tue-amt, F3.totalsalesamt
AS Wed-amt,
database containing stores information for which we
AS Thu-amt, F5.totalsalesamt
want to create horizontal layout. The input frame is as
F4.totalsalesamt
AS fri-amt, F6.totalsalesamt
AS Sat-amt,
follows.
F7.totalsalesamt AS Sun-amt FROM F0 LEFT
OUTER JOIN F1 on F0.storeid=F1.storeid LEFT
OUTER JOIN F2 on F0.storrid=F2.storeid LEFT
OUTER JOIN F3 on F0.storeid=F3.storeid LEFT
OUTER JOIN F4 on F0.storeid=F4.storeid LEFT
OUTER JOIN F5 on F0.storeid=F5.storeid LEFT
OUTER JOIN F6 on F0.storeid=F6.storeid LEFT
OUTER JOIN F7 on F0.storeid=F7.storeid.
By evaluating above queries we will get the horizontal
layout that we want but it takes lot of effort as more
sub queries should be written and more join operations
should be performed. Consider in the above query to
By pressing the select table button we can select the
create vertical layout just one query is enough i.e
Transaction table and by pressing display button the
select storied, day, sum(amt) from Transaction group
selected table is displayed as follows. After this by
by storied, day. But to create horizontal layout we are
pressing
writing 9 queries, so to reduce the effort CASE method
GENERATION frame will be displayed.
generate
button
the
SQL
CODE
can be used to create horizontal layout easily with less
effort.
CASE Method:
Suppose if we want find out total items sold in each
department of each store by each day of week. The
following query is evaluated to create horizontal layout
by using CASE method.
select StoreId, DepId, sum( CASE when Day='Fri'
then Qty else null end),sum( CASE when Day='Mon'
then Qty else null end),sum( CASE when Day='Sat'
then Qty else null end),sum( CASE when Day='Thr'
then Qty else null end),sum( CASE when Day='Tue'
In this frame if we press view Columns button all the
then Qty else null end),sum( CASE when Day='Wed'
columns of the selected table will be displayed.
Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No.10028-10042
Page 10039
The clustering results are as follows:
By selecting the columns that we want to group by,
aggregate, transpose, aggregation function and method
This is the input frame where we need to select the
name and by clicking Generate button we get
data set that we want to cluster by using the browse
Horizontal layout as output.
button.
Here we are selecting the previously created horizontal
layout as input for clustering. The data storeids are
The above horizontal layout output is taken as input
clustered by using Fuzzy C-Means algorithm.
for clustering and clustering is performed by using
fuzzy C-means algorithm.
Suppose from the stores data base if we want to find
the stores that are having same total sales amount for
each day of week or if want cluster the stores based on
total sales for each day of week than previously
created data set can be directly taken as input for
clustering instead of again creating data set.
Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No.10028-10042
Page 10040
4. Conclusion:
(.)Preparing data set for data mining projects takes
more effort and time but horizontal layout data set can
be
easily created
using horizontal
aggregation
functions.
(.)It is easy to create Horizontal Layout using CASE
than SPJ method as SPJ method consists computing
more sub queries where as in CASE method a single
query is enough to compute.
(.)Time
Complexity
(O(NlogN+dknlogn+dN))
of
is
CASE
better
method
than
time
complexity of SPJ method (O(Nlog(N))+dknlogn+dN
) where N is the size of the input table F, n is the size
of output table Horizontal layout, d is the distinct
combination of transposing columns and k is the
number of transposing columns .
[2] G. Bhargava, P. Goel, and B.R. Iyer, “Hypergraph
Based Reorderings of Outer Join Queries with
Complex Predicates, ”Proc. ACM SIGMOD Int’l
Conf. Management of Data (SIGMOD ’95), pp. 304315, 1995.
[3] J.A. Blakeley, V. Rao, I. Kunen, A. Prout, M.
Henaire, and C. Kleinerman, “.NET Database
Programmability and Extensibility in Microsoft SQL
Server,” Proc. ACM SIGMOD Int’l Conf.
Management of Data (SIGMOD ’08), pp. 1087-1098,
2008.
[4] J. Clear, D. Dunn, B. Harvey, M.L. Heytens, and P.
Lohman, “Non- Stop SQL/MX Primitives for
Knowledge Discovery,” Proc. ACM SIGKDD Fifth
Int’l Conf. Knowledge Discovery and Data Mining
(KDD ’99), pp. 425-429, 1999.
[5] E.F. Codd, “Extending the Database Relational
Model to Capture More Meaning,” ACM Trans.
Database Systems, vol. 4, no. 4, pp. 397-434, 1979.
(.)Fuzzy C-Means algorithm can give better clustering
results than K-Means and Hard C-Means algorithms as
it handles vagueness of data.
5. Future Work:
(.)Other data mining algorithms like classification,
regression analysis, Decision Making can also be
implemented by taking Horizontal layout as input.
(.)Horizontal layout can be clustered by using other
soft computing clustering algorithms to handle
impreciseness in data.
(.)Missing values in data is not handled, so rough set
concept can be used to handle missing data.
(.)To reduce the execution time of clustering algorithm
it can be parallelized using OPEN_MP.
REFERENCES
[1] Carlos Ordonez and Zhibo Chen.: “Horizontal
Aggregations in SQL to Prepare Data Sets for Data
Mining Analysis”, IEEE TRANSACTIONS ON
KNOWLEDGE AND DATA ENGINEERING, VOL.
24, NO. 4, APRIL 2012.
[6] C. Cunningham, G. Graefe, and C.A. GalindoLegaria, “PIVOT and UNPIVOT: Optimization and
Execution Strategies in an RDBMS,” Proc. 13th Int’l
Conf. Very Large Data Bases (VLDB ’04), pp. 9981009, 2004.
[7] C. Galindo-Legaria and A. Rosenthal, “Outer Join
Simplification
and
Reordering
for
Query
Optimization,” ACM Trans. Database Systems, vol.
22, no. 1, pp. 43-73, 1997.
[8] H. Garcia-Molina, J.D. Ullman, and J. Widom,
Database Systems: The Complete Book, first ed.
Prentice Hall, 2001.
[9] G. Graefe, U. Fayyad, and S. Chaudhuri, “On the
Efficient Gathering of Sufficient Statistics for
Classification from Large SQL Databases,” Proc.
ACM Conf. Knowledge Discovery and Data Mining
(KDD ’98), pp. 204-208, 1998.
[10] J. Gray, A. Bosworth, A. Layman, and H.
Pirahesh, “Data Cube: A Relational Aggregation
Operator Generalizing Group-by, Cross- Tab and SubTotal,” Proc. Int’l Conf. Data Eng., pp. 152-159, 1996.
[11] J. Han and M. Kamber, Data Mining: Concepts
and Techniques, first ed. Morgan Kaufmann, 2001.
[12] G. Luo, J.F. Naughton, C.J. Ellmann, and M.
Watzke, “Locking Protocols for Materialized
Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No.10028-10042
Page 10041
Aggregate Join Views,” IEEE Trans. Knowledge and
Data Eng., vol. 17, no. 6, pp. 796-807, June 2005.
[13] C. Ordonez, “Horizontal Aggregations for
Building Tabular Data Sets,” Proc. Ninth ACM
SIGMOD Workshop Data Mining and Knowledge
Discovery (DMKD ’04), pp. 35-42, 2004.
[14] C. Ordonez, “Vertical and Horizontal Percentage
Aggregations,” Proc. ACM SIGMOD Int’l Conf.
Management of Data (SIGMOD ’04), pp. 866-871,
2004.
[15] C. Ordonez, “Integrating K-Means Clustering
with a Relational DBMS Using SQL,” IEEE Trans.
Knowledge and Data Eng., vol. 18, no. 2, pp. 188-201,
Feb. 2006.
[16] C. Ordonez, “Statistical Model Computation with
UDFs,” IEEE Trans. Knowledge and Data Eng., vol.
22, no. 12, pp. 1752-1765, Dec.
2010.
[17] C. Ordonez, “Data Set Preprocessing and
Transformation in a Database System,” Intelligent
Data Analysis, vol. 15, no. 4, pp. 613- 631, 2011.
[18] C. Ordonez and S. Pitchaimalai, “Bayesian
Classifiers Programmed in SQL,” IEEE Trans.
Knowledge and Data Eng., vol. 22, no. 1, pp. 139-144,
Jan. 2010.
[19] S. Sarawagi, S. Thomas, and R. Agrawal,
“Integrating Association Rule Mining with Relational
Database Systems: Alternatives and Implications,”
Proc. ACM SIGMOD Int’l Conf. Management of Data
(SIGMOD ’98), pp. 343-354, 1998.
[20] H. Wang, C. Zaniolo, and C.R. Luo, “ATLAS: A
Small But Complete SQL Extension for Data Mining
and Data Streams,” Proc. 29th Int’l Conf. Very Large
Data Bases (VLDB ’03), pp. 1113- 1116, 2003.
[21] A. Witkowski, S. Bellamkonda, T. Bozkaya, G.
Dorman, N. Folkert, A. Gupta, L. Sheng, and S.
Subramanian, “Spreadsheets in RDBMS for OLAP,”
Proc. ACM SIGMOD Int’l Conf. Management of Data
(SIGMOD ’03), pp. 52-63, 2003.
[22] Zadeh, L. A.: Fuzzy sets, Information and
Control, 8, (1965), pp.338–353.
[23]Sugeno, S.: Fuzzy measures and fuzzy integrals, in
Fuzzy Automata and Decision Process, edited by
M.Gupta, G.N. Sardis and B.R. Gaines (North
Holland, Amsterdam, New York), (1977), pp. 82-102.
[24]Attanasov, K. T.: Intuitionistic Fuzzy Sets, Fuzzy
Sets and Systems, 20, (1986), pp.87–96.
Kalluri N V Satya Naresh, IJECS Volume 4 Issue 1 January, 2015 Page No.10028-10042
Page 10042
Download